PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment

CVPR 2026 Findings

Yantao Li^1,2,3, Qiang Hui^2,3, Chenyang Yan^2,3, Kanzhi Cheng^2,3, Fang Zhao^2,3, Chao Tan^2,3, Huanling Gao^2,3, Jianbing Zhang^2,3, Kai Wang^2,3, Xinyu Dai¹, Shiguo Lian^2,3

¹National Key Laboratory for Novel Software Technology, Nanjing University, ²Data Science & Artificial Intelligence Research Institute, China Unicom, ³Unicom Data Intelligence, China Unicom

Abstract

Reinforcement learning has recently improved the reasoning ability of Large Language Models and Multimodal LLMs, yet prevailing reward designs emphasise final-answer correctness and consequently tolerate process hallucinations, cases where models reach the right answer while misperceiving visual evidence. We address this process-level misalignment with PaLMR, a framework that aligns not only outcomes but also the reasoning process itself.

PaLMR comprises two complementary components: a perception-aligned data layer that constructs process-aware reasoning data with structured pseudo-ground-truths and verifiable visual facts, and a process-aligned optimisation layer that constructs a hierarchical reward fusion scheme with a process-aware scoring function to encourage visually faithful chains-of-thought and improve training stability.

Method Overview

PaLMR introduces multimodal process alignment for visual reasoning. Instead of rewarding only final-answer correctness, it makes visual facts and intermediate reasoning steps explicit, then optimizes the model toward chains of thought that remain faithful to the image.

Perception-Aligned Data Layer

Builds process-aware reasoning data with structured pseudo-ground-truths and verifiable visual facts, giving supervision to the visual evidence in reasoning process.

Process-Aligned Optimization Layer

Uses hierarchical reward fusion and process-aware scoring to encourage visually faithful reasoning traces while maintaining reinforcement learning stability.

Overview of PaLMR, including perception-aligned data construction and process-aligned optimization.

Key Highlights

Targets process hallucination: reduces cases where the final answer is correct but the reasoning process contradicts visual evidence.
Enhances human evaluation alignment: replaces point-wise scoring with pairwise comparisons to achieve a significantly higher human alignment ratio(>80%). This provides a robust, human-aligned signal for optimizing perceptual grounding.
Ensures training stability: employs a hierarchical reward fusion scheme that requires coherent visual perception before rewarding the final answer. This rigorous gating mechanism maintains stable, monotonically increasing accuracy and reasoning stability.
Aligns evidence and reasoning: connects structured visual facts with chains of thought through process-level supervision.

Results

Experiments on Qwen2.5-VL-7B show that PaLMR substantially reduces reasoning hallucinations and improves visual reasoning fidelity across multimodal reasoning benchmarks.

Model	#Data	MMMU_val	HallusionBench	MathVerse_{Vision Only}	MMStar	MathVista
GPT-4o	-	60.0	68.0	-	-	63.8
Gemini2-Flash	-	70.6	69.4	-	-	70.4
Qwen2.5-VL-72B	-	68.2	71.4	-	70.8	74.8
Qwen2.5-VL-32B	-	63.7	72.1	54.3	67.3	74.7
InternVL2.5-8B	-	56.2	67.4	-	62.9	64.4
MM-Eureka-7B	15K	55.4	69.5	46.6	64.6	73.0
OpenVLThinker-7B	12K	56.3	66.9	40.4	62.1	70.2
Perception-R1-7B	2K	56.3	70.0	46.1	66.3	73.6
Qwen2.5-VL-7B	-	56.4	63.8	42.6	64.3	68.2
+ GRPO	4.7K	57.8	66.7	45.9	66.0	74.1
PaLMR-7B	4.7K	59.3	70.9	47.5	67.1	73.8

BibTeX

@inproceedings{li2026palmr,
  title     = {PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment},
  author    = {Yantao Li and Qiang Hui and Chenyang Yan and Kanzhi Cheng and Fang Zhao and Chao Tan and Huanling Gao and Jianbing Zhang and Kai Wang and Xinyu Dai and Shiguo Lian},
  booktitle = {CVPR 2026 Findings},
  year      = {2026},
  url       = {https://arxiv.org/abs/2603.06652},
  doi       = {10.48550/arXiv.2603.06652}
}