Visual Tokens Are Not Free
Ever asked ChatGPT or Gemini about an image and waited longer than expected? These large vision-language models are powerful, but they can be slow because they process hundreds or even thousands of visual tokens.
Ever asked ChatGPT or Gemini about an image and waited longer than expected? These large vision-language models are powerful, but they can be slow because they process hundreds or even thousands of visual tokens.
We explore how to speedup LVLM inference by pruning unnecessary visual tokens. Our finding is Counterintuitive. Tokens inside the object are less similar to the text.
The [EOS] token should summarize the text, but it gets stuck on [SOS], not what actually matters.
We introduce LiteLVLM, a simple yet effective token pruning method for pixel grounding that makes LVLMs faster, lighter, and easier to use, even on non-cutting-edge hardware.
Explore which tokens LiteLVLM retains and prunes.
We design LiteLVLM, a text-guided token pruning method for efficient pixel grounding. Considering the visual-text similarity reversal, we first retain visual tokens with low similarity to the [EOS] token - similarity-aware tokens. Then, we recover contextually informative tokens - context-aware tokens. Finally, all these tokens pass through the LLM, and the pixel decoder generates a pixel-level segmentation mask.
faster CUDA time @192
faster CUDA time @64
lower activation memory @192
lower activation memory @64
If you have any questions, please feel free to contact us:
@article{lee2026clip,
author = {Lee, Sangin and Choi, Yukyung},
title = {CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models},
journal = {arXiv preprint arXiv:2605.13178},
year = {2026},
}