Reinforcement learning has recently improved the reasoning ability of Large Language Models and
Multimodal LLMs, yet prevailing reward designs emphasise final-answer correctness and consequently
tolerate process hallucinations, cases where models reach the right answer while misperceiving
visual evidence. We address this process-level misalignment with PaLMR, a framework that
aligns not only outcomes but also the reasoning process itself.
PaLMR comprises two complementary components: a perception-aligned data layer that constructs
process-aware reasoning data with structured pseudo-ground-truths and verifiable visual facts,
and a process-aligned optimisation layer that constructs a hierarchical reward fusion scheme with
a process-aware scoring function to encourage visually faithful chains-of-thought and improve
training stability.