LLMs Improving LLMs:
Agentic Discovery for Test-Time Scaling

Tong Zheng¹, Haolin Liu², Chengsong Huang³, Huiwen Bao, Sheng Zhang¹, Rui Liu¹, Runpeng Dai⁴, Ruibo Chen¹, Chenxi Liu¹, Tianyi Xiong¹, Xidong Wu⁵, Hongming Zhang⁶, Heng Huang¹

¹UMD, ²UVA, ³WUSTL, ⁴UNC, ⁵Google, ⁶Meta

Figure 1. Auto-TTS contrasts with hand-crafted TTS: humans specify an environment—states, actions, feedback, and objectives—given which an explorer iteratively proposes candidate controllers, evaluates them in offline replay, and updates history using scaling curves together with execution trajectories.

AutoTTS treats test-time scaling as algorithmic search in replay-backed environments; a frontier coding agent proposes and refines code-defined controllers using discovery history.
No human in the discovery loop; Low cost.

~69.5% tokens saved token reduction relative to SC@64 at β≈0.5; held-out average accuracy matches SC@64 averaged over four backbone scales.

$39.9 discovery cost estimated monetary cost of one discovery run (abstract).

160 minutes (wall-clock) duration of the same discovery run.

0 LLM calls in discovery eval evaluation replays cached segments; the base LLM is not queried repeatedly.

Abstract Motivation Problem setup Method Results Cite

Abstract

Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation-allocation space unexplored.

We propose an environment-driven framework, AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key to AutoTTS lies in environment construction: the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search.

As a concrete instantiation, we formulate width–depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals, where controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls. We further introduce beta parameterization to make the search tractable and fine-grained execution trace feedback to improve discovery efficiency by helping the agent diagnose why a TTS program fails.

Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy–cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and model scales, while the entire discovery costs only $39.9 and 160 minutes.

Motivation

We look back at the trajectory of TTS development. As shown in the figure on the right, many distinct TTS methods can be viewed as special cases in a single shared control space. Take the width–depth view: width = number of parallel branches; depth = how far each branch is extended. If each method is just a point in this space, why keep hand-designing more, instead of treating it as a search problem? This motivates us to rethink the path forward for TTS research: build reusable environments, not more heuristics — define the space, then let a coding agent discover new controllers inside it automatically.

Figure 2: Six representative TTS methods plotted as points in the same width–depth space. — **Fig. 2.** Six representative TTS methods, each one a special case in the same width–depth space.

Problem setup

Test-time scaling as algorithmic search

We treat adaptive test-time inference as allocating a finite budget across branches: open branches, extend them in fixed-length generation intervals, probe to reveal intermediate answers, prune, then aggregate into a final reply—capturing best-of-N, self-consistency, adaptive branching, and related schedules within one formulation.

Branches & probes

Each branch i yields prefixes z_i,1, z_i,2, …; after each fixed-length interval, an intermediate answer ω_i,k exists but enters the controller’s observation only when a probing action is taken.

Budget & cost

Computation is counted in interval units up to budget B. State cost sums depths plus probing overhead: Cost(s_t) = Σ_i ℓ_t,i + κ_probe·|Ω_t| (often κ_probe = 0).

State

At step t, s_t = (q, m_t, I_t, ℓ_t, Ω_t): question q, number of instantiated branches m_t, active branch set I_t, depth vector ℓ_t, and revealed probe triples Ω_t.

Admissible actions

BRANCH — open a new branch through the first interval.
CONTINUE(i) — advance branch i by one interval.
PROBE(i) — reveal ω_i,ℓ without advancing depth.
PRUNE(i) — deactivate i; depths & past probes stay recorded.
ANSWER — terminate and apply aggregation.

Policy, aggregation & objective

A code-defined policy π(· | s, β) maps state s and a scalar meta-parameter β (which deterministically schedules internal knobs) to actions until ANSWER. The controller may ship its own terminal aggregator Agg_π,β, yielding prediction ŷ_π,β(q) and cost C_π,β(q).

Over tasks (q, y) ~ 𝒟, we maximize accuracy minus penalized cost: max_{π, β} 𝔼_q,y[𝟙{ŷ_π,β(q) = y} − γ · C_π,β(q)].

Discovery. The outer loop searches over implementations of π; each candidate is replay-evaluated on offline caches without fresh decoding (accuracy vs. cost), traces go into memory for later rounds, and we export the best accuracy–cost trade-off—implementation details follow under Method.

Method

Environment construction before discovery

The MDP of Problem setup is instantiated here as a concrete environment: fix the interface, pay for base-model forwards once to materialize trajectories, then export a replay table that discovery evaluates without new decoding. Data collection follows Parallel-Probe: each question receives N independent traces in segments of Δ tokens; all forwards precede the discovery loop.

1

Specify the interface

Define state s_t, admissible actions A(s_t), budget and cost Cost(s_t), and the accuracy–cost objective.
2

Offline trajectory collection

For each query, draw N parallel, mutually independent reasoning traces from the backbone (full strings first). Only after this batch completes do we partition each trace into fixed-length segments of Δ tokens and enumerate branch prefixes z_i,k with probe responses ω_i,k.
3

Materialize the replay store

Every environment transition consults the archived table: executing PROBE(i), for instance, retrieves the archived ω_i,k without advancing decoding. Repeated scoring passes and sweeps along β therefore incur only replay cost.
4

Hand off to discovery

With the store fixed, candidate controllers are simulated exclusively through observe/step; asymptotic evaluation cost is dominated by table replay rather than live decoding.

Steps 1–3 run once per (model, benchmark); iterative coding-agent discovery starts only after the replay store is frozen.

“Walk through build” animates how parallel probe export turns the formal MDP into a replayable environment.

Freeze MDP

s_t A(s_t) cost

Base LLM · N parallel trajectories

Independent full rollouts — chunking and probes come after this batch.

Chunk grid · cumulative prefix · PROBE ⇒ ω → Δ · c

Each row is one sampled trajectory τ_i; z_i,j denotes its j-th Δ-token segment (parallel-probe chunk along that rollout). Write z_i,0 ‖ ⋯ ‖ z_i,k for their concatenation up through segment k. The observation ω_i,k is obtained by PROBE at that concatenated prefix (so e.g. ω_1,1 uses z_1,0 ‖ z_1,1), then the archived Δ continuation extends the table — discovery later replays via lookup. Each segment row also stores an incremental cost c_i,k that bundles the Δ-continuation decode tokens with the PROBE-forward tokens spent by the backbone to realize ω_i,k at that prefix (the discrete analogue of the depth term Σ_iℓ_t,i plus probing overhead κ_probe·|Ω| in Cost(s_t)).

PROBE runs once per segment during construction at the cumulative prefix shown; each tuple archives ω, the Δ slice, and c_i,k counting both probe-side and continuation tokens into Cost(s). Read:

τ₁

z_1,0 PROBE ω_1,0 c_1,0

z_1,0z_1,1 PROBE ω_1,1 c_1,1

z_1,0z_1,1z_1,2 PROBE ω_1,2 c_1,2

τ₂

z_2,0 PROBE ω_2,0 c_2,0

z_2,0z_2,1 PROBE ω_2,1 c_2,1

z_2,0z_2,1z_2,2 PROBE ω_2,2 c_2,2

τ₃

z_3,0 PROBE ω_3,0 c_3,0

z_3,0z_3,1 PROBE ω_3,1 c_3,1

z_3,0z_3,1z_3,2 PROBE ω_3,2 c_3,2

⋯ extend to all N traces; column k freezes PROBE at z_i,0 ‖ ⋯ ‖ z_i,k, stores ω_i,k, the Δ slice, and c_i,k (PROBE tokens + Δ tokens for bookkeeping).

serialize → replay table

Replay store

z_•,1 z_•,2 z_•,3 ⋯

Discovery: β parameterization & trace feedback

Historical records couple offline replay trajectories with search traces; controllers map an external scalar β monotonically into interior hyper-parameters, collapsing outer tuning to a one-dimensional sweep.

Beta parameterization for tractable search

Empirically, automatically generated controllers can introduce on the order of ten coupled hyper-parameters. Across only five outer rounds joint optimization gravitates toward extreme regimes—overly aggressive pruning, for example—that minimize measured tokens on the search benchmark yet fail as generic allocation schedules. We therefore impose beta parameterization: each artifact exports a lone scalar β together with a deterministic map specifying every internal knob, monotonic so that larger β never reduces the permissible token envelope. Outer search collapses to sweeping β while discouraging brittle thresholds tailored exclusively to Q_search; in our realization both the programmatic policy and mapping are authored by the coding agent.

History augmentation with execution traces

Scalar summaries such as accuracy or aggregate token totals indicate whether an iteration succeeds but rarely explain systematic failure. Accordingly, alongside each round’s multi-β sweep we archive both empirical scaling curves and the full action-by-action trajectories reconstructed during replay. The former summarizes aggregate quality–budget trade-offs; the latter supplies fine-grained behavioral evidence analogous to tracing harness logs, enabling the explorer to localize defects prior to rewriting code. This design parallels reported gains from execution-level supervision in autonomous software engineering pipelines.

Live case demo

A coding agent refines a controller from grid feedback

This demo replays one held-out AIME25 instance as a 2-D branch-depth grid. Each cell is a cached intermediate answer from the offline environment; the coding agent reads the failure trace, proposes a new OptimalController, evaluates it, appends feedback to history, and uses that history in the next round.

AIME25 · problem 29 · gold answer 104

Propose code, execute on the grid, store feedback, then refine

Five rounds of a coding agent: turn 1 stops too early and returns 196; turns 2–5 rewrite controller.py until the run converges on the gold answer 104.

1Propose controller.py
2Execute on grid
3Read feedback
4Append to history

Repo Agent reads and edits these files

environment 2D replay grid

controller.py candidate code for this turn

history past code, trace, metrics

Claude Code Agent

① Read repo history and propose controller.py
② Run the controller in Env on the grid
③ Inspect trace, vote evidence, tokens, and accuracy
④ Store feedback to History for the next turn

quality signal acc=0 · tokens pending this feedback drives the next code proposal

turn_1.py round 1 / 5

controller.py … click to expand

Loading controller...

agent reads feedback

The environment reports an early wrong majority: 196 dominates before the correct answer appears.

Execution Process

Run the proposed controller on the real AIME25 replay grid.

question

Right triangle ABC, BC = 38, points K,L satisfy AK = AL = BK = CL = KL = 14. Find n if area BKLC = n√3.

actual get_last_trace() waiting for proposed controller

Press play to execute the current code against cached grid states.

104 correct 196 early trap ERR other answer active this action visited, not active pruned / abandoned final vote evidence

Results

Experimental results

Main results: accuracy & total tokens

Setup. Four Qwen3 backbones; columns report AIME24 (search) plus held-out AIME25/HMMT25 (and averaged held-out where shown). Higher accuracy and lower cumulative tokens are better; within each backbone block the strongest entry per column is bold. Rows contrast handcrafted baselines with AutoTTS discovered controllers at two scalar β settings.
Trade-offs. Discovered controllers typically achieve a better empirical accuracy–token frontier than the handcrafted methods compared here.
Generalization. Policies are optimized using only AIME24 replay constructions yet transfer to held-out benchmarks: they outperform every handcrafted baseline on average accuracy for three of four models, and remain competitive on Qwen3-8B (62.7 vs 62.8 for SC@64).
β = 0.5. Cuts aggregate token usage by roughly 69.5% vs SC@64 while keeping matched mean held-out accuracy averaged across models (45.3 vs 45.2).
β = 1.0. Pushes peak accuracy beyond all handcrafted baselines in five of the eight tabulated comparison cells.

Method	Type	AIME24 (search)		AIME25 (held-out)		HMMT25 (held-out)		Avg. (held-out)
Method	Type	Acc. ↑	Tokens ↓	Acc. ↑	Tokens ↓	Acc. ↑	Tokens ↓	Acc. ↑	Tokens ↓
Base model: Qwen3-0.6B
SC @ 64	Handcrafted	21.4	1008.6k	28.9	890.5k	18.1	937.8k	23.2	914.2k
ASC	Handcrafted	21.4	805.5k	28.9	653.8k	18.1	580.8k	23.2	617.3k
ESC	Handcrafted	21.4	986.7k	28.9	868.8k	18.1	923.9k	23.2	896.4k
Parallel-Probe	Handcrafted	21.8	773.8k	29.7	697.8k	18.5	734.5k	24.1	716.2k
AutoTTS (β = 0.5)	Discovered	19.2	283.6k	28.6	250.3k	14.9	257.6k	21.8	254.0k
AutoTTS (β = 1.0)	Discovered	20.9	542.2k	31.1	474.7k	18.0	487.1k	24.6	480.9k
Base model: Qwen3-1.7B
SC @ 64	Handcrafted	72.5	1025.8k	44.4	1054.1k	24.2	1132.9k	34.3	1093.5k
ASC	Handcrafted	72.3	482.6k	44.4	600.9k	24.2	586.3k	34.3	593.6k
ESC	Handcrafted	72.5	909.2k	44.4	913.8k	24.2	1014.2k	34.3	964.3k
Parallel-Probe	Handcrafted	68.1	748.5k	44.7	775.8k	22.6	860.2k	33.7	818.0k
AutoTTS (β = 0.5)	Discovered	68.5	276.3k	46.7	327.9k	30.5	359.1k	38.6	343.5k
AutoTTS (β = 1.0)	Discovered	70.4	499.1k	49.0	612.6k	32.1	679.6k	40.6	646.1k
Base model: Qwen3-4B
SC @ 64	Handcrafted	80.0	886.8k	76.6	1088.1k	43.6	1168.3k	60.1	1128.2k
ASC	Handcrafted	80.1	175.7k	76.4	277.3k	44.0	388.9k	60.2	333.1k
ESC	Handcrafted	80.0	528.9k	76.6	793.3k	43.6	990.2k	60.1	891.8k
Parallel-Probe	Handcrafted	79.7	688.9k	76.1	806.0k	44.7	872.3k	60.4	839.2k
AutoTTS (β = 0.5)	Discovered	82.0	236.7k	73.8	332.3k	45.7	365.0k	59.8	348.7k
AutoTTS (β = 1.0)	Discovered	83.5	424.9k	74.4	610.4k	46.5	686.8k	60.5	648.6k
Base model: Qwen3-8B
SC @ 64	Handcrafted	80.4	910.8k	76.7	1124.4k	48.9	1267.0k	62.8	1195.7k
ASC	Handcrafted	80.4	226.0k	76.7	406.2k	48.8	565.1k	62.8	485.7k
ESC	Handcrafted	80.4	459.4k	76.7	793.1k	48.9	1062.1k	62.8	927.6k
Parallel-Probe	Handcrafted	81.5	730.8k	76.9	846.7k	47.1	897.2k	62.0	872.0k
AutoTTS (β = 0.5)	Discovered	84.3	255.3k	74.1	361.2k	48.1	396.7k	61.1	379.0k
AutoTTS (β = 1.0)	Discovered	85.8	467.4k	75.8	672.4k	49.5	749.1k	62.7	710.8k

Scaling with `β` (accuracy vs. tokens)

Sweeping β traces empirical accuracy–token curves against fixed baselines; in each subplot the learned controller shifts toward a better Pareto frontier.

Scaling curves: Qwen3-0.6B on AIME25. Accuracy versus total tokens on log scale across beta. — Qwen3-0.6B · AIME25

Scaling curves: Qwen3-1.7B on AIME25. — Qwen3-1.7B · AIME25

Scaling curves: Qwen3-4B on HMMT25. — Qwen3-4B · HMMT25

Scaling curves: Qwen3-8B on HMMT25. — Qwen3-8B · HMMT25

Self-evolving discovery trajectory

Discovery proceeds round by round: each iteration proposes an updated controller from the explorer (Claude Code in our experiments), pushing replay-evaluated accuracy and token cost toward a better accuracy–cost frontier.

Python code panels showing how the controller evolves across agent turns during discovery. — Zoomed controller code across turns

Discovered controller

The discovered controller, which we term the Confidence Momentum Controller (CMC), reveals four non-obvious mechanisms.

Trend-based stopping. Rather than gating termination on instantaneous confidence, CMC maintains an EMA of pool confidence and stops only when both the EMA level is high and the trend is non-negative, preventing premature stopping on transient confidence spikes.

Coupled width–depth control. Widening and deepening are linked through the EMA delta: strong confidence gains suppress new branch spawning, while stagnation or regression triggers widening, creating a closed feedback loop absent in all hand-crafted baselines.

Alignment-aware depth allocation. Each round, branches whose latest answer matches the pool winner receive burst_aligned probe steps. This concentrates computation on the emerging consensus while still advancing all active branches.

Conservative branch abandonment. A branch is only abandoned after persistently deviating for abandon_patience consecutive rounds, with at least two active branches always preserved.

Together, these mechanisms represent a level of coordinated complexity that would be difficult to arrive at through manual intuition alone.

OptimalController · excerpt turn_5.py click to expand

Expand to load source.

Citation

@article{zheng2026autotts,
  title = {LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling},
  author = {Zheng, Tong and Liu, Haolin and Huang, Chengsong and Bao, Huiwen and Zhang, Sheng and Liu, Rui and Dai, Runpeng and Chen, Ruibo and Liu, Chenxi and Xiong, Tianyi and Wu, Xidong and Zhang, Hongming and Huang, Heng},
  journal = {arXiv preprint},
  year = {2026}
}

@misc{zheng2026parallelprobe,
  title = {Parallel-Probe: Towards Efficient Parallel Thinking via 2D Probing},
  author = {Zheng, Tong and Huang, Chengsong and Dai, Runpeng and He, Yun and Liu, Rui and Ni, Xin and Bao, Huiwen and Wang, Kaishen and Zhu, Hongtu and Huang, Jiaxin and Huang, Furong and Huang, Heng},
  year = {2026},
  eprint = {2602.03845},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL},
  url = {https://arxiv.org/abs/2602.03845},
}