The challenge of inference-time alignment

Advaith Vellanki

Machine Learning Engineer

Why this paper caught our attention

The InfAlign paper asks a subtle but vital question: How do we ensure models behave well not just at training time, but when they are actually used with inference-time tricks like sampling or reranking?

Traditionally, alignment efforts (think RLHF) optimize models against human preference data during training. But real-world use often introduces a mismatch: users deploy models with inference-time strategies (like best-of-n sampling) that change the distribution of outputs. The paper frames this as the "inference-time alignment gap"—a mismatch between the base policy and the effective policy under deployment .

What the paper claims

The core proposal is a three-step curriculum called Calibrate and Transform Reinforcement Learning (CTRL):

Calibration: Adjust reward model scores to better reflect outputs sampled by a reference policy.
Transformation: Apply a mathematical transform to those calibrated scores, posing the problem as an RLHF-style optimization.
Optimization: Train policies using this transformed reward to maximize inference-time win rate—a metric defined as which policy "wins" more often under the same inference procedure .

The authors' method, InfAlign-CTRL, is shown to perform comparably to baseline methods like PPO and IPO when no inference-time adjustments are used, but significantly better when such strategies are applied, as indicated by higher win rates under the same divergence budget.

Figure (from InfAlign, DeepMind, ICML 2025). Win rate comparisons across three benchmark tasks. InfAlign-CTRL outperforms baseline methods under both standard inference (top row) and Best/Worst-of-N inference-time strategies (bottom row)

Why it matters for DeepFlow

For us at DeepFlow, the relevance lies in the principles behind inference-time alignment:

Multi-agent orchestration: Just as InfAlign addresses the gap between training-time and inference-time behavior, our platform must anticipate mismatches between designed workflows and live orchestration.
Reward shaping in delegation: The calibration-transform loop echoes how we might design scoring functions when orchestrating across agents and tools.
Auditability & compliance: InfAlign's focus on inference-time win rates aligns with our mission of tracking and auditing workflows, making sure decisions are transparent not just in design, but also in execution.

Closing reflections

InfAlign may not be the next algorithm we drop into production. Its methods are too heavy for now, and its assumptions are not always realistic. But that’s not what makes it important.

Its contribution is in naming and formalizing the inference-time alignment gap—a problem that sits at the heart of deploying aligned AI systems in practice. For DeepFlow, that recognition is key: it sharpens our awareness that orchestration must account for real-world deviations, not just theoretical designs.

As alignment research moves closer to deployment realities, our reading group takeaway is clear: orchestration and alignment are converging problems. The question is no longer just how to train aligned models, but how to ensure they stay aligned once embedded in complex, live systems—the very problem space DeepFlow was built to tackle.

This post is part of DeepFlow's ML Reading Group series, where we share reflections on the latest AI research and its impact on workflow automation.

‍

Blogs

Read our blogs to know more how DeepFlow helps

Reinforcement Learning Scaling Laws And The Art Of Predictability

The challenge of inference-time alignment

Why this paper caught our attention

What the paper claims

Why it matters for DeepFlow

Closing reflections

Read our blogs to know more how DeepFlow helps

Reinforcement Learning Scaling Laws And The Art Of Predictability

Title Name

Transforming new customer onboarding for banks

Title Name

AI has woken up

Title Name

The AI-native studio for ambitious teams