logo

VisPlay

Self-Evolving Vision-Language Models
Yicheng He1* · Chengsong Huang2* · Zongxia Li3* · Jiaxin Huang2 · Yonghui Yang4
1University of Illinois Urbana–Champaign    2Washington University in St. Louis
3University of Maryland    4National University of Singapore
yh84@uiuc.edu    chengsong@wustl.edu    zli12321@umd.edu *: Equal Contribution
Visplay figure
Abstract
Reinforcement learning (RL) provides a principled framework for improving vision-language models (VLMs) on complex reasoning tasks. However, existing RL approaches rely heavily on human-annotated labels or task-specific heuristics. We introduce VisPlay, a self-evolving framework that enables VLMs to autonomously improve reasoning from massive unlabeled image data. A single base model alternates between an Image-Conditioned Questioner and a Multimodal Reasoner—trained via GRPO using difficulty/diversity rewards. VisPlay improves visual reasoning, compositionality, and hallucination robustness across eight benchmarks.
Method
As illustrated in above figure, the framework operates as a closed-loop system involving two agents evolved from the same base model: an Image-Conditioned Questioner and a Multimodal Reasoner. The process begins with the Questioner taking an image as input to generate a visual query. Subsequently, the Reasoner receives both the image and the generated query to produce a response. Both the Questioner and the Reasoner are initialized from a shared pretrained backbone. The two agents co-evolve through iterative interactions: the Questioner is trained to generate more challenging questions, while the Reasoner is trained to solve more and more challenging questions.
Cite
@misc{he2025visplay, title={VisPlay: Self-Evolving Vision-Language Models from Images}, author={Yicheng He and Chengsong Huang and Zongxia Li and Jiaxin Huang and Yonghui Yang}, year={2025}, eprint={2511.15661}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2511.15661}, }