CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large Vision-Language Models

TL;DR

LiteLVLM is a training-free, text-guided token pruning method
for efficient pixel grounding in Large Vision-Language Models. Don't let CLIP trick you!

MOTIVATION

Do We Need All Visual Tokens?

01

Visual Tokens Are Not Free

Ever asked ChatGPT or Gemini about an image and waited longer than expected? These large vision-language models are powerful, but they can be slow because they process hundreds or even thousands of visual tokens.

SOLUTION

A Counterintuitive Finding

02

Visual-Text Similarity Reversal

We explore how to speedup LVLM inference by pruning unnecessary visual tokens. Our finding is Counterintuitive. Tokens inside the object are less similar to the text.

Please segment the player wearing green cleats

03

Text Attention Sink

The [EOS] token should summarize the text, but it gets stuck on [SOS], not what actually matters.

GOAL

Pixel Grounding with Fewer Tokens

04

Fewer Tokens, Faster Grounding.

We introduce LiteLVLM, a simple yet effective token pruning method for pixel grounding that makes LVLMs faster, lighter, and easier to use, even on non-cutting-edge hardware.

How Fast Can Pixel Grounding Be?

1 / 10

Retain 576 Visual Tokens

Retain 192 Visual Tokens

✨ Token Pruning Animation

Explore which tokens LiteLVLM retains and prunes.

Retain 32 Visual Tokens

LiteLVLM

We design LiteLVLM, a text-guided token pruning method for efficient pixel grounding. Considering the visual-text similarity reversal, we first retain visual tokens with low similarity to the [EOS] token - similarity-aware tokens. Then, we recover contextually informative tokens - context-aware tokens. Finally, all these tokens pass through the LLM, and the pixel decoder generates a pixel-level segmentation mask.

EFFICIENCY

How Efficient Is LiteLVLM?

FastV VisPruner LiteLVLM (Ours)

FLOPs (TB)

Baseline 4.66

2.65 2.17 2.11

Retain 192

2.23 1.33 1.27

Retain 64

Prefilling Time (ms)

Baseline 166.25

162.88 75.65 74.88

Retain 192

157.50 54.77 54.02

Retain 64

CUDA Time (ms)

Baseline 340.89

340.25 276.09 265.83

Retain 192

338.65 253.10 237.35

Retain 64

Storing Activation (GB)

Baseline 0.81

0.81 0.37 0.35

Retain 192

0.80 0.23 0.21

Retain 64

22%

faster CUDA time @192

30.4%

faster CUDA time @64

2.3×

lower activation memory @192

3.9×

lower activation memory @64

Visualizations

Contact

If you have any questions, please feel free to contact us:

Sangin Lee silee@rcv.sejong.ac.kr Yukyung Choi ykchoi@sejong.ac.kr

BibTeX

@article{lee2026clip,
  author    = {Lee, Sangin and Choi, Yukyung},
  title     = {CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models},
  journal   = {arXiv preprint arXiv:2605.13178},
  year      = {2026},
}