VisPlay — Self-Evolving Vision-Language Models

Reinforcement learning (RL) provides a principled framework for improving vision-language models (VLMs) on complex reasoning tasks. However, existing RL approaches rely heavily on human-annotated labels or task-specific heuristics. We introduce VisPlay, a self-evolving framework that enables VLMs to autonomously improve reasoning from massive unlabeled image data. A single base model alternates between an Image-Conditioned Questioner and a Multimodal Reasoner—trained via GRPO using difficulty/diversity rewards. VisPlay improves visual reasoning, compositionality, and hallucination robustness across eight benchmarks.

As illustrated in above figure, the framework operates as a closed-loop system involving two agents evolved from the same base model: an Image-Conditioned Questioner and a Multimodal Reasoner. The process begins with the Questioner taking an image as input to generate a visual query. Subsequently, the Reasoner receives both the image and the generated query to produce a response. Both the Questioner and the Reasoner are initialized from a shared pretrained backbone. The two agents co-evolve through iterative interactions: the Questioner is trained to generate more challenging questions, while the Reasoner is trained to solve more and more challenging questions.