VADE: Variance-Aware Dynamic Sampling via Online Sample-Level Difficulty Estimation for Multimodal RL

Zengjie Hu1,2,*, Jiantao Qiu2,*,\(^\ddagger\), Tianyi Bai3,*, Haojin Yang1,
Binhang Yuan3, Qi Jing1,†, Conghui He2,†, Wentao Zhang1,†
1Peking University, 2Shanghai AI Lab, 3HKUST,
*Equal contribution. \(^\ddagger\)Project leader. Corresponding authors.
{wentao.zhang, jingqi}@pku.edu.cn, {qiujiantao, heconghui}@pjlab.org.cn

Abstract

Group-based policy optimization methods like GRPO and GSPO have become standard for training multimodal models, leveraging group-wise rollouts and relative advantage estimation. However, they suffer from a critical \({gradient \space vanishing}\) problem when all responses within a group receive identical rewards, causing advantage estimates to collapse and training signals to diminish. Existing attempts to mitigate this issue fall into two paradigms: filtering-based and sampling-based methods. Filtering-based methods first generate rollouts broadly and then retroactively filter out uninformative groups, leading to substantial computational overhead. Sampling-based methods proactively select effective samples before rollout but rely on static criteria or prior dataset knowledge, lacking real-time adaptability. To address these issues, we propose VADE, a Variance-Aware Dynamic sampling framework via online sample-level difficulty Estimation. Our framework integrates three key components: online sample-level difficulty estimation using Beta distributions, a Thompson sampler that maximizes information gain through the estimated correctness probability, and a two-scale prior decay mechanism that maintains robust estimation under policy evolution. This three components design enables VADE to dynamically select the most informative samples, thereby amplifying training signals while eliminating extra rollout costs. Extensive experiments on multimodal reasoning benchmarks show that VADE consistently outperforms strong baselines in both performance and sample efficiency, while achieving a dramatic reduction in computational overhead. More importantly, our framework can serves as a plug-and-play component to be seamlessly integrated into existing group-based RL algorithms.

VADE Framework

Method

Overview of the VADE framework. Our method maintains distributions \(\text{Beta}(\alpha_i + 1, \beta_i+ 1)\) for each sample to enable online difficulty estimation. Through Thompson sampling and InfoGain \(\mathcal{I}_i = p_t(1-p_t)^2\) maximization, VADE dynamically selects informative batches for group-wise rollouts. The two-scale prior decay mechanism ensures estimates remain accurate throughout policy evolution.

Experimental Results

Benchmark Performance

benchmark_result

Training Dynamic

Training Dynamic

Ablation Study

Ablation Study

Citation


        @article{hu2025vade,
          title={VADE: Variance-Aware Dynamic Sampling via Online Sample-Level Difficulty Estimation for Multimodal RL},
          author={Hu, Zengjie and Qiu, Jiantao and Bai, Tianyi and Yang, Haojin and Yuan, Binhang and He, Conghui and Jing, Qi and Zhang, Wentao},
          journal={arXiv preprint arXiv:2511.18902},
          year={2025},
          url={https://arxiv.org/abs/2511.18902},
          archivePrefix={arXiv},
          eprint={2511.18902},
          primaryClass={cs.AI},
        }