The Challenge of Inference-Time Alignment
At this week’s ML reading group at DeepFlow, we dove into InfAlign, a paper from DeepMind recently accepted at ICML. It’s one of those mathematically dense works that aims not at an immediate engineering recipe but at broadening our horizons about what “alignment” really means when models are deployed in the wild.
The conversation ranged from reinforcement learning formalisms to the practical headaches of inference-time evaluation. What emerged was a sharper sense of where alignment research is heading, and how some of those ideas might eventually touch the orchestration problems we tackle at DeepFlow.
Why this paper caught our attention
The InfAlign paper asks a subtle but vital question: How do we ensure models behave well not just at training time, but when they are actually used with inference-time tricks like sampling or reranking?
Traditionally, alignment efforts (think RLHF) optimize models against human preference data during training. But real-world use often introduces a mismatch: users deploy models with inference-time strategies (like best-of-n sampling) that change the distribution of outputs. The paper frames this as the “inference-time alignment gap”—a mismatch between the base policy and the effective policy under deployment .
What the paper claims
The core proposal is a three-step curriculum called Calibrate and Transform Reinforcement Learning (CTRL):
- Calibration: Adjust reward model scores to better reflect outputs sampled by a reference policy.
- Transformation: Apply a mathematical transform to those calibrated scores, posing the problem as an RLHF-style optimization.
- Optimization: Train policies using this transformed reward to maximize inference-time win rate—a metric defined as which policy “wins” more often under the same inference procedure .
The authors’ method, InfAlign-CTRL, is shown to perform comparably to baseline methods like PPO and IPO when no inference-time adjustments are used, but significantly better when such strategies are applied, as indicated by higher win rates under the same divergence budget.

Figure (from InfAlign, DeepMind, ICML 2025). Win rate comparisons across three benchmark tasks. InfAlign-CTRL outperforms baseline methods under both standard inference (top row) and Best/Worst-of-N inference-time strategies (bottom row)
Why it matters for DeepFlow
For us at DeepFlow, the relevance lies in the principles behind inference-time alignment:
- Multi-agent orchestration: Just as InfAlign addresses the gap between training-time and inference-time behavior, our platform must anticipate mismatches between designed workflows and live orchestration.
- Reward shaping in delegation: The calibration-transform loop echoes how we might design scoring functions when orchestrating across agents and tools.
- Auditability & compliance: InfAlign’s focus on inference-time win rates aligns with our mission of tracking and auditing workflows, making sure decisions are transparent not just in design, but also in execution.
Closing reflections
InfAlign may not be the next algorithm we drop into production. Its methods are too heavy for now, and its assumptions are not always realistic. But that’s not what makes it important.
Its contribution is in naming and formalizing the inference-time alignment gap—a problem that sits at the heart of deploying aligned AI systems in practice. For DeepFlow, that recognition is key: it sharpens our awareness that orchestration must account for real-world deviations, not just theoretical designs.
As alignment research moves closer to deployment realities, our reading group takeaway is clear: orchestration and alignment are converging problems. The question is no longer just how to train aligned models, but how to ensure they stay aligned once embedded in complex, live systems—the very problem space DeepFlow was built to tackle.
This post is part of DeepFlow’s ML Reading Group series, where we share reflections on the latest AI research and its impact on workflow automation.