PolyPPO for Bidirectional Decoding: Early Evidence From PushT

GitHub: sabdulmajid/bid_lerobot This project explores a simple question: if an imitation policy can produce many plausible futures, can we train it to make the set of futures better, not just one sampled action at a time? The setting is VQ-BeT on PushT. VQ-BeT predicts actions through discrete residual vector-quantized code IDs, then decodes those codes into continuous actions. That structure gives a useful control point; we can treat the sampled RVQ code tuple as the PPO action while leaving the continuous decoder deterministic.

What We Tried

We compared five approaches across the experiments. Direct VQ-BeT used the pretrained policy sampled directly. From there we added BID and PolySelect-style inference baselines for inference-time selection and reranking, then PPO without diversity optimizing sampled RVQ code IDs using standard PPO, then naive PolyPPO adding set-normalized advantages plus a code-diversity reward, and finally quality-gated PolyPPO as a next-stage objective that only rewards diversity among high-return attempts and can penalize bad diversity. The underlying idea comes from the distinction between pass@1 and pass@k. A policy can be poor at selecting one best action while still containing useful diverse candidates in its sample distribution. PolyPPO tries to improve that candidate distribution by training over sets of attempts from the same restored environment prefix.

Correctness Before Scaling

We put gates in place for reproducing the direct VQ-BeT evaluation, sampled RVQ code log-prob correctness, recomputed old log-prob equality before each update, PPO ratio equal to 1 before update, value-head overfit sanity checks, exact PushT state restore and prefix replay determinism, set-normalized advantages summing to zero per set, lambda_div = 0 matching PPO without diversity, artifact validation for malformed or non-finite results, and same-seed same-start evaluation discipline. This discipline mattered because PolyPPO is very easy to fool yourself with. If repeated attempts aren't actually from the same start state, pass@k is not measuring candidate diversity; it is mostly measuring seed difficulty.

What We Found

The direct local VQ-BeT baseline landed around 0.50 pass@1 over 500 episodes. That's lower than public model-card expectations, so we treat it as a local reproducibility baseline. The 500-episode paired results were:

Method	pass@1	Avg max overlap
Direct VQ-BeT	0.504	0.750
PPO without diversity	0.398	0.761
Naive PolyPPO code diversity	0.392	0.761

This is a negative result for naive PolyPPO as a pass@1 method. It doesn't beat direct VQ-BeT, and it doesn't beat PPO without diversity. The grouped same-start pass@k experiment told a more interesting story though:

Method	pass@1	pass@2	pass@4	pass@8	pass@16	coverage@16
PPO without diversity	0.375	0.625	0.750	0.875	0.875	7/8
Naive PolyPPO code diversity	0.500	0.625	0.875	0.875	1.000	8/8

Naive PolyPPO improved candidate coverage and best-of-set behavior, even though it didn't improve single-attempt success. The current research signal is that diversity helped the sample pool but hurt (or at least failed to improve) the policy's default sample quality.

Interpretation

The likely failure mode is off-manifold diversity. Diversity is only useful if it expands the set of good continuations. If the diversity reward is unconditional, it can reward strange RVQ codes that decode into behaviorally different but low-quality actions; improving pass@k while degrading pass@1. Which points towards a cleaner hypothesis where diversity should be return-conditioned (reward diversity among good attempts, not diversity for its own sake). This is what motivates quality-gated PolyPPO. The idea is to compute returns for attempts from the same restored prefix, identify good attempts by mean or quantile threshold, add a diversity reward only for those good attempts, and optionally penalize diversity among bad ones.

Stress Tests

We also ran small stress evaluations under action noise and observation noise.

Variant	PPO pass@1	PolyPPO pass@1
Standard	0.38	0.36
Action noise	0.40	0.44
Observation noise	0.40	0.36

This is too small to be conclusive. It does suggest the candidate-pool improvement might sometimes help under action perturbation, but it's not a strong signal. The next step is running the same-start grouped pass@k experiment under noise to see if the candidate-pool improvement is more robust than pass@1.

Where Things Stand

The foundation is in place. PPO over VQ-BeT RVQ code IDs works end-to-end, the PPO correctness gates are established, exact same-prefix grouped pass@k evaluation is working, and we have clean baselines for direct, PPO, and naive PolyPPO. Naive code diversity isn't enough (negative). And set-level diversity can improve pass@k and coverage (positive).

What's Next

The next step is quality-gated PolyPPO. Concretely, that means training quality-gated code-diversity PolyPPO beyond one update, comparing it against all three existing baselines, running same-start grouped pass@k across all methods, and running a 100-episode paired evaluation for pass@1. We'll scale only if quality-gated PolyPPO actually improves pass@1, pass@k, coverage, robustness, or sample efficiency at matched compute. The broader direction is best-of-k distillation, where we would use grouped sampling to discover high-quality continuations, then distill the best attempt back into the policy so that pass@k gains become pass@1 gains.