CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large Vision-Language Models

ICML 2026
Sejong University
TL;DR

LiteLVLM is a training-free, text-guided token pruning method
for efficient pixel grounding in Large Vision-Language Models. Don't let CLIP trick you!

MOTIVATION

Do We Need All Visual Tokens?

01

Visual Tokens Are Not Free

Ever asked ChatGPT or Gemini about an image and waited longer than expected? These large vision-language models are powerful, but they can be slow because they process hundreds or even thousands of visual tokens.

SOLUTION

A Counterintuitive Finding

02

Visual-Text Similarity Reversal

We explore how to speedup LVLM inference by pruning unnecessary visual tokens. Our finding is Counterintuitive. Tokens inside the object are less similar to the text.

03

Text Attention Sink

The [EOS] token should summarize the text, but it gets stuck on [SOS], not what actually matters.

Interactive text attention sink visualization Hover over each token to see how the EOS token attends most strongly to SOS. [EOS] [SOS] please segment the player wearing green cleats [EOS]
GOAL

Pixel Grounding with Fewer Tokens

04

Fewer Tokens, Faster Grounding.

We introduce LiteLVLM, a simple yet effective token pruning method for pixel grounding that makes LVLMs faster, lighter, and easier to use, even on non-cutting-edge hardware.

How Fast Can Pixel Grounding Be?

1 / 10
Retain 576 Visual Tokens
Retain 192 Visual Tokens

Token Pruning Animation

Explore which tokens LiteLVLM retains and prunes.

1 Input Image
Input image
2 Similarity-aware Tokens
Similarity-aware token visualization
3 Context-aware Tokens
Context-aware token visualization
4 Pixel Grounding Output
Pixel grounding output
Retain 32 Visual Tokens

LiteLVLM

We design LiteLVLM, a text-guided token pruning method for efficient pixel grounding. Considering the visual-text similarity reversal, we first retain visual tokens with low similarity to the [EOS] token - similarity-aware tokens. Then, we recover contextually informative tokens - context-aware tokens. Finally, all these tokens pass through the LLM, and the pixel decoder generates a pixel-level segmentation mask.

LiteLVLM Method Overview
EFFICIENCY

How Efficient Is LiteLVLM?

FastV VisPruner LiteLVLM (Ours)

FLOPs (TB)

Baseline 4.66
2.65 2.17 2.11
Retain 192
2.23 1.33 1.27
Retain 64

Prefilling Time (ms)

Baseline 166.25
162.88 75.65 74.88
Retain 192
157.50 54.77 54.02
Retain 64

CUDA Time (ms)

Baseline 340.89
340.25 276.09 265.83
Retain 192
338.65 253.10 237.35
Retain 64

Storing Activation (GB)

Baseline 0.81
0.81 0.37 0.35
Retain 192
0.80 0.23 0.21
Retain 64
22%

faster CUDA time @192

30.4%

faster CUDA time @64

2.3×

lower activation memory @192

3.9×

lower activation memory @64

Visualizations

LiteLVLM qualitative visualization

Contact

If you have any questions, please feel free to contact us:

BibTeX

@article{lee2026clip,
  author    = {Lee, Sangin and Choi, Yukyung},
  title     = {CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models},
  journal   = {arXiv preprint arXiv:2605.13178},
  year      = {2026},
}