Risk-Aware Supervision for OpenPI Robot Policies on LIBERO

GitHub: sabdulmajid/robotics This project asks a simple question: before a robot foundation policy acts, can we predict whether the scene is too risky to attempt? The motivating problem is that long-horizon robot behavior compounds local failures. A policy might be strong on nominal tasks, but a small visual occlusion, poor object visibility, or an unfavorable initial state can make an otherwise reasonable action sequence fall apart. If the policy can't reliably tell when it's outside its reliable operating region, then "always try" isn't a good deployment strategy. The system I built wraps OpenPI on LIBERO with a runtime risk supervisor. The supervisor estimates the probability of rollout failure from the current scene and early execution signals, then decides whether to continue or abstain. The important design choice is that the risk model isn't trying to replace the robot policy — OpenPI still produces actions. The risk layer only asks whether execution is likely to be productive.

First Principles

A robot policy has two distinct problems. The first is control: what action should I take? The second is execution risk: should I trust this policy in this state? Most policy evaluations focus on the first question. This project focuses on the second. If a policy has a 70% success rate, the naive interpretation is that it's "pretty good." But for deployment, the structure of the failures matters more than the average. If failures are predictable from state, a supervisor can reject high-risk attempts and preserve the useful parts of the policy. This leads to a coverage-aware objective: maximize useful task completion while reducing failures among attempted executions, without rejecting everything. The tradeoff is unavoidable — a risk-aware system can reduce failures by abstaining, but too much abstention lowers coverage and total completion. The project is about finding the operating point where the risk signal actually improves decision quality.

What I Built

The project has three layers. A toy symbolic risk-planning harness. Before running robot simulation, I built a small stochastic symbolic environment to test the interfaces and prove that state-conditioned risk can improve planning when there are meaningful alternatives. The toy domain includes skills like direct pick, conservative pick, move distractor, fast place, slow place, and recover. The result established the first gate: oracle state-risk planning beats naive and fixed-risk planning when the scenario has exploitable risk structure. This mattered because it prevented jumping into learned critics before proving the planning idea could work at all. Real OpenPI/LIBERO execution. The main experiment uses OpenPI's pi05_libero policy on LIBERO tasks. OpenPI provides the robot foundation policy, LIBERO provides tabletop manipulation tasks and simulation, and the risk layer logs rollouts, extracts state and progress information, and trains failure predictors. The current dataset contains 993 direct OpenPI episodes for risk modeling, 630 held-out runtime supervisor episodes, LIBERO-Spatial stress tests with occlusion and action noise, and nominal rollouts across LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, and LIBERO-10. This makes the project more than a toy planning demo — it's a working risk-supervision stack around a real vision-language-action robot policy. VLM-based runtime risk supervision. The strongest risk model uses frozen SigLIP image embeddings. If visual conditions make a task risky, an image model should capture some of that signal. The model combines frozen SigLIP image features from the scene, task identity and language information, early rollout progress statistics, action norm and smoothness signals, and no-progress indicators. It doesn't use hidden stressor metadata as an input — the image model has to infer risk from observable scene evidence.

Offline Risk Modeling Results

The offline results show that visual features matter considerably.

Model	AUROC	AUPRC
Fixed task prior	0.695	0.347
Structured progress risk	0.702	0.297
SigLIP vision-language risk	0.905	0.811
Metadata oracle	0.930	0.840

The metadata oracle is allowed to see injected stressor information, so it's only a diagnostic upper bound. The important result is that the SigLIP model nearly recovers the oracle signal without using hidden stressor labels. The visual scene contains meaningful information about whether OpenPI is likely to fail.

Runtime Supervisor Results

The first runtime supervisor used the offline calibration threshold directly, run on 630 held-out OpenPI/LIBERO episodes.

Runtime mode	Coverage	Completion	Attempted failure	Abstain	Utility
Direct OpenPI	1.000	0.695	0.305	0.000	0.528
Fixed task prior selective	1.000	0.686	0.314	0.000	0.514
Runtime SigLIP selective	0.681	0.595	0.126	0.319	0.480

The result is mixed but useful. On the positive side, runtime SigLIP supervision cuts attempted failure from 30.5% to 12.6%. On the negative side, the original threshold is too conservative — it abstains on 31.9% of episodes, lowering total completion and utility. The risk signal is real, but the operating point needs tuning.

Threshold Sweep

To optimize the operating point, I ran a task-disjoint runtime threshold sweep. Runtime tasks 0–4 were used as calibration, tasks 5–9 were held out for testing, and thresholds were selected only from the calibration split before being evaluated once on the test split. Direct OpenPI on the test split had utility = 0.571 and attempted failure = 0.276. The tuned SigLIP supervisor produced:

Calibration target	Test coverage	Completion	Attempted failure	Utility	Utility delta
0.70	0.686	0.629	0.083	0.528	-0.043
0.75	0.781	0.714	0.085	0.627	+0.056
0.80	0.810	0.714	0.118	0.618	+0.047
0.85	0.857	0.724	0.156	0.617	+0.046
0.90	0.933	0.724	0.224	0.592	+0.021
0.95	0.962	0.724	0.248	0.583	+0.012
1.00	1.000	0.724	0.276	0.571	0.000

The best utility operating point is the 75% calibration target, with test coverage of 0.781, utility of 0.627, and attempted failure of 0.085. That beats direct OpenPI utility while substantially reducing failure among attempted episodes. At higher coverage the safety benefit remains but weakens — at 85% coverage attempted failure rises to 0.156, at 90% to 0.224, and at 95% to 0.248. Risk-aware runtime supervision can beat direct execution, but only when the threshold is selected for the runtime distribution.

What We Tried That Didn't Win

Adaptive chunking and early-abort style interventions were also explored. The intuition was that if risk is high, the system could shorten OpenPI's action horizon, query the policy more often, or stop early when progress stalls. Those ideas are still worth pursuing as future control mechanisms, but they weren't the main win at this stage. Simple risk-aware abstention was cleaner and more effective. So we reached an important negative conclusion: the first useful intervention is not to "control harder," but to "know when not to try."

What's Been Established

The project now has real OpenPI/LIBERO integration, large-scale rollout logging on robot simulation tasks, calibrated failure-risk modeling, frozen VLM features as risk predictors, runtime risk-based abstention, coverage-aware evaluation, threshold selection using a runtime calibration split, and a tuned operating point that improves utility over direct OpenPI in the paired runtime sweep. The system is a practical risk-supervision layer for brittle robot foundation policy execution.

What We Found

The central finding is that scene-conditioned visual risk estimates can identify many OpenPI failures before the robot fully commits to execution. The second finding is equally important that calibration matters as much as model quality. The offline SigLIP model was strong, but its original threshold was too cautious online. Once thresholds were tuned on a runtime calibration split, the same model found much better operating points. That suggests the next improvements should focus less on new architectures and more on calibration, coverage, and recovery behavior.

Next Steps

The immediate next experiment is running fresh runtime rollouts using the tuned threshold, rather than only evaluating thresholds against paired runtime outcomes. Beyond that, the plan is to repeat the threshold sweep across LIBERO-Object, LIBERO-Goal, and LIBERO-10, add recovery policies for rejected or failed attempts, explore temporal VLM features instead of single-frame embeddings, add a lightweight predictive dynamics model for no-progress and timeout risk, and evaluate whether risk-aware supervision improves utility under a fixed rejection budget. The longer-term direction is connecting this risk layer back to skill planning. Instead of only asking whether OpenPI should try a task, the planner should choose among alternatives — try the direct policy, use a conservative variant, move obstructing objects, recover to a safe pose, ask for help, or abstain. That turns risk prediction into decision-making.

Summary

This project started from a simple observation that robot policies fail unevenly, and uneven failures can be predicted. The current system shows that a frozen VLM risk model can detect many risky OpenPI/LIBERO executions, reduce attempted failures, and (after runtime threshold tuning) improve utility over direct execution in a paired runtime evaluation. A robot foundation policy should not only know what action to take, but also when the scene is too risky to trust itself.