* Equal contribution
University of California, Berkeley
We propose using action reconstruction as a scoring criterion for synthesized reasoning traces in user modeling, yielding more causally faithful reasoning and improved downstream action prediction.
User modeling aims to use language models (LMs) to mimic an individual’s behavior from a corpus of past context–action pairs (e.g., conversation turns), enabling the simulation of users in settings like behavioral science, human–AI collaboration, and market research. Recent approaches augment these corpora with synthesized reasoning traces, typically generated by conditioning on both context and action. However, such conditioning constitutes post-hoc rationalization rather than reasoning: the trace is guaranteed to justify the action, but may not encode the underlying latent causal decision paths. We propose RECON, which uses action reconstruction to score reasoning traces by their predictive power: given a context and candidate reasoning, a reconstruction model predicts the action, and reconstruction fidelity determines reasoning quality. Across four domains, RECON achieves a 54.7% win rate over the post-hoc rationalization baseline. Further, training a reasoning synthesis model with rewards derived from RECON improves downstream performance, achieving win rates of up to 70.0% over baselines. We further show that RECON-synthesized reasoning transfers across models and improves user modeling beyond the reconstruction model itself.
We evaluate across four free-form conversation domains spanning formal legal proceedings to casual online debate, modeling eight individuals in total. In each domain, context is a sequence of prior conversation turns and the action is an individual’s next-turn response.
Standard reasoning synthesis conditions on both context and the ground-truth action, producing rationalizations that are consistent with the action but need not explain why the user chose it over alternatives. RECON addresses this by scoring candidate rationalizations through action reconstruction: given the original context and a synthesized reasoning trace, can a model recover the user’s observed action? High reconstruction fidelity is a tractable proxy for causal, latent reasoning.
Training-free. Sample N=4 candidate rationalizations, reconstruct the action from each, and select the trace whose reconstruction best matches the observed action.
RL-based training. Use reconstruction alignment as a GRPO reward to fine-tune the reasoning synthesizer. The frozen action model forces interpretable, model-agnostic traces.
We evaluate by measuring win rate against Backward Synthesis — the simple post-hoc rationalization — in a retrieval-augmented generation pipeline. An LM judge compares generated actions to ground truth along three dimensions (style, intent, values).
We compare three methods: RECON-Select (best-of-N selection), RECON-GRPO (reasoning synthesizer trained with reconstruction rewards), and E2E-GRPO, a trained baseline that generates both reasoning and action end-to-end, rewarded solely on whether the generated action matches the ground truth. E2E-GRPO represents the natural RL alternative: rather than scoring reasoning by reconstruction fidelity, it optimizes reasoning implicitly by rewarding action accuracy.
All win rates vs. Backward Synthesis baseline (ties excluded).
In the PMQ domain, RECON-GRPO learns to identify the Prime Minister’s rhetorical strategy — attacking the opposition rather than responding defensively — and produces reasoning that guides the action model toward a response much closer to the ground truth. The post-hoc rationalization baseline, by contrast, produces a defensive framing that misses the PM’s communicative intent.
@misc{zhu2026recon,
title={RECON: Reconstruction-Guided Reasoning Synthesis for User Modeling},
author={Zhu, Alan and Miroyan, Mihran and Wang, Carolyn and Zhou, Andrew
and Dunlap, Lisa and Norouzi, Narges and Gonzalez, Joseph E.},
year={2026},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.CL}
}