StageACT: Stage-Conditioned Imitation for Robust Humanoid Door Opening

TL;DR

Stage-conditioned imitation learning (StageACT) helps disambiguate partial observations in long-horizon tasks.
55% success on unseen doors → 2–5× the ACT baselines—with shorter completion time.
Stage prompts enable recovery and intentional out-of-sequence behaviors.

Abstract

Humanoid robots promise to operate in everyday human environments without requiring modifications to the surroundings. Among the many skills needed, opening doors is essential, as doors are the most common gateways in built spaces and often limit where a robot can go. Door opening, however, poses unique challenges as it is a long-horizon task under partial observability, such as reasoning about the door’s unobservable latch state that dictates whether the robot should rotate the handle or push the door. This ambiguity makes standard behavior cloning prone to mode collapse, yielding blended or out-of-sequence actions.

We introduce StageACT, a stage-conditioned imitation learning framework that augments low-level policies with task-stage inputs. This simple addition increases robustness to partial observability, leading to higher success rates and shorter completion times. On a humanoid operating in a real-world office environment, StageACT achieves a 55% success rate on previously unseen doors, more than doubling the best baseline. Moreover, our method supports intentional behavior guidance through stage prompting, enabling recovery behaviors. These results highlight stage conditioning as a lightweight yet powerful mechanism for long-horizon humanoid loco-manipulation.

Video

Challenges in Partial Observability

Key information, such as the latch state of the door, is not directly observable. Given visually almost-identical observations, similar actions can lead to drastically different outcomes.

Failure case - door remains latched

Successful case - door unlatched

Failure cases of the naive imitation learning policy that struggle to disambiguate partial observations in a long-horizon door opening task. (Left) Blending of trajectories, (Middle) out of sequence motions, (Right) imprecise trajectories.

StageACT

StageACT is a stage-conditioned imitation learning policy for humanoid door opening. The core idea is to append a compact stage label to the usual sensory inputs so the policy knows where it is in the task. We keep a single low-level policy rather than splitting locomotion and manipulation, and feed a one-hot stage vector together with the current image and robot state. Stages are defined as search, approach, rotate, push, and stop, and are annotated offline by combining visual inspection with proprioceptive cues, where sharp torque spikes mark transitions. Conditioning on this stage information adds temporal context that resolves look-alike observations in a partially observable, long-horizon setting and reduces mode collapse.

In practice, the policy follows ACT's execution style. It predicts short action chunks of roughly 3 seconds that are temporally smoothed for stability, using the current camera image and robot state as inputs along with the stage vector, and it outputs short sequences of whole-body actions. Training uses standard imitation objectives, with minimal changes to accommodate stage conditioning. The same stage vector also serves as a prompt at test time, which enables recovery by returning to an earlier stage and supports intentional non-sequential behaviors when needed.

Our framework combines stage-level guidance with lowlevel control, allowing policies to disambiguate partial observations and execute contact-rich tasks more reliably.

Q1. Does stage-conditioning improve policy performance compared to a standard imitation baseline?

Yes. StageACT reaches 55% success on an unseen door while the best ACT baseline is 20%, and it completes faster (20.7s vs 27.5s). The biggest lift shows up in the ambiguous Approach phase: ACT succeeds 7/20 times versus 17/19 for StageACT, which aligns with the claim that stage input resolves observation ambiguity.

Q2. Beyond replaying the dataset, can stage prompts guide new behaviors at test time?

Yes. Stage prompts provide controllability, including intentional non-sequential execution: we can command S1→S4→S5, skipping Approach and Rotate, and the policy executes actions consistent with the prompted stage rather than immediate observations. This is useful when the latch is already disengaged so the robot can push directly, even though no such demonstrations were in the dataset; Fig. 7 serves as the primary reference for stage-guided behavior.

Q3. How do you train this?

We train StageACT on teleoperated demonstrations collected with a whole-body setup: a Unitree G1 with Dex-3 hands, where an Apple Vision Pro tracks the operator’s hands and retargets them to the robot via inverse kinematics. Locomotion is issued as simple base velocity commands while the operator controls arms and hands, which makes loco-manipulation teleop practical. The resulting dataset has 135 successful demos, gathered by two operators over two days in two offices (>8 hours). Each trajectory logs a 480×640 egocentric RGB image, a 29-D robot state (upper body, both hands,

Overview of whole body teleoperation setup based on the G1 humanoid robot.

To inject temporal context, we annotate five stages (search, approach, rotate, push, stop) offline, combining visual inspection with proprioceptive cues, where sharp torque spikes indicate contact transitions. During training and evaluation, these stage labels are one-hot vectors concatenated with the usual inputs so the policy can disambiguate look-alike observations and avoid mode collapse.

Long-horizon task of door opening categorized into sub-stages.

The learning recipe follows ACT with a small modification for stage input: we optimize a standard imitation reconstruction loss together with a KL regularizer, predict ~3-second action chunks, and temporally smooth them for stable execution. Hyperparameters and procedures mirror ACT, with changes only to accept the stage vector.