VisPlay
Self-Evolving Vision-Language Models
Yicheng He1* ·
Chengsong Huang2* ·
Zongxia Li3* ·
Jiaxin Huang2 ·
Yonghui Yang4
1University of Illinois Urbana–Champaign
2Washington University in St. Louis
3University of Maryland
4National University of Singapore
yh84@uiuc.edu chengsong@wustl.edu zli12321@umd.edu
*: Equal Contribution
Abstract
Reinforcement learning (RL) provides a principled framework for improving
vision-language models (VLMs) on complex reasoning tasks. However,
existing RL approaches rely heavily on human-annotated labels or
task-specific heuristics. We introduce VisPlay, a self-evolving
framework that enables VLMs to autonomously improve reasoning from
massive unlabeled image data. A single base model alternates between
an Image-Conditioned Questioner and a Multimodal Reasoner—trained via GRPO
using difficulty/diversity rewards. VisPlay improves visual reasoning,
compositionality, and hallucination robustness across eight benchmarks.
Method
As illustrated in above figure, the framework operates as a closed-loop system involving two agents evolved from the same base model: an Image-Conditioned Questioner and a Multimodal Reasoner. The process begins with the Questioner taking an image as input to generate a visual query. Subsequently, the Reasoner receives both the image and the generated query to produce a response. Both the Questioner and the Reasoner are initialized from a shared pretrained backbone. The two agents co-evolve through iterative interactions: the Questioner is trained to generate more challenging questions, while the Reasoner is trained to solve more and more challenging questions.
Cite
@misc{he2025visplay,
title={VisPlay: Self-Evolving Vision-Language Models from Images},
author={Yicheng He and Chengsong Huang and Zongxia Li and Jiaxin Huang and Yonghui Yang},
year={2025},
eprint={2511.15661},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.15661},
}