2.8m Gmail.txt Apr 2026

: Uses 22k data pairs focusing on textual accuracy (

The paper addresses the "SFT plateau," a phenomenon where Supervised Fine-Tuning (SFT) performance on Large Language Models (LLMs) stops improving even as the dataset size increases [11, 22]. The authors use a specific of chart-to-code data to demonstrate this limitation and propose Multimodal Structured Reinforcement Learning (MSRL) as a solution [11, 22]. 2. Methodology Supervised Fine-Tuning (SFT) Phase : Baseline Model : Qwen2.5-VL-7B-Instruct [11, 22]. 2.8M GMAIL.txt

) to ensure the generated code matches the visual intent [11]. : Uses 22k data pairs focusing on textual

To break the plateau, the authors implement a two-stage Reinforcement Learning (RL) process [11]. : Qwen2

: Qwen2.5-VL-72B-Instruct is used as the judge model for calculating visual rewards during training [11]. 4. Experimental Results

: The model is tested on subsets ranging from 200k to 2.8 million samples.