Bipedal Walking Using RL: Deep Dive with Code

Introduction

Humanoids are the ultimate physical embodiment of AI, and represent an important playground to test the limitations of current learning algorithms as well as a critical benchmark to reference with the most general purpose cognitive embodiment we know of: humans.

One of the biggest challenges of bringing humanoid robots into the physical world, is building human level bipedal controllers that can generalize to random environments like hills/snow/stairs, master multi-skill policies like walking/running/jogging, robust to failures, and being energy efficient to perform long horizon tasks.

In this blog, we will go over the biomechanics of walking, limitations of classical control theory like Model Predictive Controllers, how reinforcement learning fixes this and how we apply our techniques to train a UniTree G1 bipedal robot in Isaac Gym with sim2sim evaluation in Mujoco.

Biomechanics of Walking

If you’ve ever taken a mechanical vibration course, you’d know that a lot of complex dynamic systems can be simplified and modeled with elastic springs.

In its simplest form, we can think of humanoid walking as balancing an inverted pendulum with springs that store/output energy by alternating kinetic/potential energy (in a friction free world this is known as passive dynamics).

Figure 1: Simplified Inverted Pendulum Analogy of Bipedal Walking

However, in reality using such simple dynamic models for control will fail immediately when deployed into the physical world. The reality is, we have not accounted for non-linear friction, randomized terrains, ability to withstand perturbations, and so on.

The following is a detailed figure of a typical human walking gait cycle to visualize intuitively the kind of problem we’re solving¹:

Figure 2: Detailed Human Walking Gait Cycle

We can go one step further from the simplified inverted pendulum model and actually model each leg with six degrees of freedom (DoF)², which is the model estimates used by many humanoid robot companies:

Hip: yaw + roll + pitch = 3 DoF
Knee: pitch = 1 DoF
Ankle: pitch + roll = 2 DoF

In fact, we’re able to actually derive the dynamic equations of bipedal robots using these estimates with Lagrangian dynamics, and use the well understood Model-Predictive Control (MPC) toolbox to control our bipedal locomotion³.

Dynamics of how the leg moves and should move is relatively a solved problem, humanoid and legged robotic companies have pretty much solved the teleoperation problem of robots. However, when it comes to deploy these robots into the real world: how the robot should sequence its legs, how long/strong should it swing its leg, this is a very high dimensional space and its really hard to come up with a good solutions using classical control.

Limitations of Classical Controllers

As it turns out – non-linear models take a lot of time to solve, our system is under-actuated, the models are very simplified and don’t capture the full possible legged dynamics, and finally it is extremely hard to figure out the optimal hyper-parameters that could perform well in a generalized form in the complex real world.

Moreover, building a general purpose bipedal controller requires solving for multi-objective control optimizations, we need to simultaneously solve for stability, energy efficiency, speed, and natural gaits.

Which makes this problem suitable to solve using reinforcement learning!

Reinforcement Learning for Locomotion

At the heart of training with RL is data. It’s all about generating data to show how the robot can interact with the environment i.e. world model, and then using that data to train an agent. In RL, our model is an agent and it learns through its experiences.

RL works best when we are able to make real world distribution become part of the training data distribution. I really like this analogy⁴ made by Jim Fan from Nvidia during his recent talk at Sequoia Capital: “simulation principle is mapping from 1,000,000 to 1,000,001st” meaning the real world should look very similar to what we have in our training set.

Therefore, the best control policies are those that can massively benefit from training in parallel environments and leverage GPUs; and design rewards that benefit from parallelization.

Moreover, given the nature of humanoid companies where they have large fleets of robots, it’s essential to train a control policy that can account for all the physical variances between robots and deployed from one-to-all in one shot. Which is why parallelization can learn this effectively because we can randomize each scene with different physical parameters and learn a generalized policy from all experiences.

RL Formulation for Bipedal Walking

To formalize how RL is applied to bipedal walking, we can break it down into 4 key components that are part of any RL problem:

State Space:
- joint positions and velocities
- body orientation (roll, pitch, yaw)
- center of mass position and velocity
- contact sensors (binary signals indicating foot contact)
- command signals (desired velocity, direction)
Action Space:
- commanded joint positions/torques for each actuator
Reward Function:
- accounts for things like energy efficiency, speed, failure recovery, etc.
Policy Network Architecture:
- observation encoder (fully connected layers)
- memory component (GRU/LSTM) to handle temporal dependencies
- actor network (outputs actions)
- critic network (estimates value function)

PPO for Bipedal Walking

Proximal Policy Optimization (PPO) has become the go-to algorithm for training legged robots for several reasons:

Sample Efficiency: compared to other policy gradient methods, PPO makes better use of collected experience through multiple update epochs
Stability: the clipped objective function prevents destructively large policy updates
Parallelization: PPO can efficiently utilize thousands of parallel environments to collect diverse experiences
Simple Implementation: compared to more complex algorithms like SAC or TD3, PPO has fewer hyperparameters and is more forgiving

A simplified PPO training loop for bipedal walking looks something like like:

for iteration in range(num_iterations):
    # Collect experience using current policy
    trajectories = collect_trajectories(policy, environments)
    
    # Compute advantages and returns
    advantages = compute_gae(trajectories, value_function)
    
    # Update policy using PPO objective
    for epoch in range(num_epochs):
        for mini_batch in mini_batches(trajectories):
            update_policy(mini_batch, advantages)
            update_value_function(mini_batch)
    
    # Adjust environment difficulty i.e. curriculum
    if mean_reward > threshold:
        increase_difficulty(environments)

Something I’ve noticed while reading lots of papers training legged robots is that the RL algorithm doesn’t change that much nor does it make significant improvements per se, as much as spending time on the training setup in simulation. This observation has also been replicated with recent advancements applying RL to LLMs, where the RL algorithms are still the same but priors and reasoning have significantly improved⁵.

Observation and Action Processing

A critical aspect often overlooked in RL for locomotion is proper observation and action processing:

Observation Normalization: standardizing observations (zero mean, unit variance) across different dimensions helps stabilize training. This is especially important for bipedal robots where sensor values have different scales (angles vs. velocities).

Action Clipping and Scaling: actions need proper scaling to match the robot’s physical limitations:

 def process_actions(raw_actions):
     # Scale from [-1,1] to actual joint limits
     scaled_actions = raw_actions * action_scale
        
     # Add safety clipping
     clipped_actions = np.clip(scaled_actions, joint_min_limits, joint_max_limits)
        
     return clipped_actions

Phase-Based Control: many successful implementations use phase variables (cyclical values indicating where in the gait cycle the robot is) to help the policy learn periodic behaviours.

If you’re ever feel like doing everything right yet not seeing your robot learn properly, make sure that you got your observations and action processing properly! Ideally, I’ve found it best to get this done at the early stages of training by taking an overfit than regularize approach in order to trust your entire training infrastructure.

Rewards for Walking

One of the most intuitive ways I found to think about crafting rewards for robot locomotion is to think of it as generating supervised learning data where we engineer the labeled data through our rewards, simulation setup, etc. This means that more effort spent on crafting a general reward function means high quality data.

Some important rewards to include as part of your overall reward function to optimize for during training are:

not falling
walking at certain periodicity
symmetry
smoothness
adhering to commands
energy efficiency
etc

Designing reward functions is a topic that deserves its own blog! For the purposes of this blog, checkout the experiments section of this blog and code to see how we’ve constructed a simple reward function for walking.

Performance Metrics

Although this is not an exhaustive list, these are some high level metrics to look out for when training bipedal robots⁶:

forward velocity
energy per meter
time to fall
success on push-recovery perturbations tests
etc.

While training walking policies, you’d typically look at other metrics like average length per episode, reward direction (is it increasing overall), etc.

Symmetry Exploitation

Bipedal robots have an inherent symmetry that can be exploited:

Mirrored Experiences: converting experiences from left to right and vice versa can effectively double our training data
Symmetric Policy Architecture: using network architectures that enforce symmetric responses for symmetric situations
Reward Symmetry: ensuring the reward function doesn’t favour one leg over the other

Exploration Strategies

For bipedal locomotion, engineering exploration is crucial:

Action Noise Decay: starting with high exploration noise and gradually decreasing it allows the policy to first explore the action space broadly, then refine its own movements
Reference Motion Bias: initializing exploration around motion primitives from mocap data or simple controllers can accelerate learning
Curriculum-Based Exploration: adapting the noise level based on the environment difficulty helps maintain an appropriate exploration level throughout training

Challenges of Reinforcement Learning

While RL offers an immense leverage to build human level bipedal locomotion controllers, there are still critical challenges in order to build general purpose models. Some of them include:

Reward Design: design rewards by hand to solve for the multi-objectives required
Multi-Policies: how to have a walk, run, jog, and so on with one unified controller
Encoding Human Style and motion into walking
Generalization to challenging terrains (snow, ice, steep hills, etc)

Also, it’s hard to trust RL training results immediately and assume that certain methods don’t work. Because not scaling environment parallelization means less exploration, getting curriculum learning wrong means the robot might have to master complex tasks before learning the basics, and so forth.

Experiments

In my experiments, I was training on my 4GB NVIDIA GTX 1650 GPU, which is much smaller compared to what most research projects use or even training infrastructure of robotics companies.

Even with limited hardware, results can sufficiently improve by adjusting parameters and managing exploration vs. exploitation carefully.

Figure 3: Parallel Training of Unitree G1 in Isaac Lab to Walk

After doing hundreds of training runs with RL for different types of legged robots like Spot, Cassie, and UniTree’s I’ve found that the best way to build a production grade RL controller for bipedal robots should look something like this:

Simulation Maxing: I’ve found that, for the time being, training in Isaac Lab to be the most effective to build real world controllers. You can parallelize training runs massively = exploration, leverage GPUs for compression 10,000 humans hours into 2hrs simulations, ramp up iterations = exploitation, and use many existing projects as references to apply domain randomizations
Sim2Sim Evals: have a coding infrastructure that makes it super easy to go from training in Isaac Gym to testing in Mujoco (or any other sophisticated physics engine) is critical to build at the beginning. Moreover, building some kinds of benchmarks that test these policies in different terrains before they’re deployed into physical world
Sim2Real: although I didn’t experiment much with this in these experiments, I’ve worked on this problem more during my time at Inria as well as my first time job. I think early on, even though teleoperation isn’t the end goal being confident that your robots can track motions accurately and receive the right signals and perform what they’re told is very important too. this will help you make sure you’ve tuned/modeled your actuators properly. if you want to learn more about sim2real, read here

Figure 4: Sim2Sim Eval of Unitree G1 in Mujoco

Training Runs

All of the training runs I did was using PPO as the RL algorithm.

Figure 5: Sample of Mean Episode Length from a Training Run

1) Initial Test Run

python legged_gym/scripts/train.py --task=g1 --experiment_name=g1_test --num_envs=128 --max_iterations=100 --seed=42

Parameters: 128 parallel environments, 100 iterations
Results:
- mean reward: 0.02
- mean episode length: 32.32 steps
- training time: ~10 minutes
- robot behaviour: robots could barely stand before falling over

2) Extended Training Run

python legged_gym/scripts/train.py --task=g1 --experiment_name=g1_overnight_full --headless --num_envs=128 --max_iterations=10000 --seed=42

Parameters: 128 parallel environments, 10,000 iterations, headless mode
Results:
- mean reward: 1.52 (76x improvement over initial run)
- mean episode length: 579.58 steps (18x improvement)
- training time: ~2 hours (significantly faster than expected)
- key reward components:
  - tracking linear velocity: 0.2404 (positive)
  - tracking angular velocity: 0.0632 (positive)
  - contact: 0.0955 (positive foot contact patterns)
- robot behaviour: stable walking with good balance and response to velocity commands

Figure 6: Sample Reward Mean Going Wrong i.e. Drop in Performance as Environment Becomes More Challenging

3) Final Training Run

python legged_gym/scripts/train.py --task=g1 --experiment_name=g1_curiculum --headless --num_envs=96 --max_iterations=15000 --seed=42 --curriculum --domain_rand

Domain Randomization Improvements
- wider friction range (0.2-1.5) for better terrain adaptation
- added gravity randomization (-1.0 to 1.0) for better balance
- added motor strength randomization (0.8-1.2) for robustness
- more frequent and stronger pushes (every 4s, up to 2.0 m/s)
Network Architecture
- deeper actor/critic networks [64, 32] instead of single layer
- larger RNN (128 units, 2 layers) for better temporal modelling
- increased initial noise (1.0) for better exploration
Training Parameters
- reduced environments to 96 (from 128) to allow for larger networks
- increased iterations to 15000 for better convergence
- added gradient clipping and KL divergence control
- more frequent checkpointing (every 100 iterations)
Reward Structure
- increased tracking rewards (1.2x linear, 1.2x angular)
- stronger penalties for unstable behaviours
- added collision penalty
- increased contact reward (0.25)
Control Parameters
- increased stiffness and damping for better stability
- higher action scale (0.3) for more dynamic movements
- shorter episodes (20s) for faster learning

Results:

better generalization through domain randomization
more stable walking through improved control parameters
faster learning through optimized reward structure
better exploration through network architecture changes

Some practical tips/experiments thoughts log

The following are some thoughts I wrote down as a log while writing code to squeeze out performance:

follow along other world-class projects that have very complex rewards to take inspiration on how to construct the problem
the best way I found so far to think about RL is generating the data that we’re training on!
getting codebase to work properly at the beginning is essential, do not skip this step! this includes making sure observation/action numbers are accurate
metrics to focus on during training: reward per episode, reward length, w&b plots
train in Isaac lab, validate inside Mujoco for physics, try to apply perturbations or changes
find the balance between exploration and exploitation: increasing the number of environments increases the policy exploration, and increasing the number of iterations helps with exploitation i.e. refinement of the policy!
one tip that would be nice with agents: spin up different agents working on different hyper-params tuning to scale experimentation!
try to interpret the performance of the walking robot in the lens of the reward functions you have, so that you can change your reward function or experiments with a strong intuition vs. adding more exploration/exploitation
run in headless mode after you’re confident about the training for longer runs to maximize GPU compute and scale your training!
there’s something about the neural networks size to be extremely small when trained with RL policy, such that it just doesn’t make sense how we’re supposed to fit priors into it. On the good side, this is very effective to run inference. but on the other hand, it looks like we haven’t found a way yet to build intuitions or things like vision policies to train a robot to walk.
spikes occur and are normal during RL training, the policy is constantly trying new things to find better solutions, which can temporarily decrease performance. also, when curriculum learning is implemented properly things get more challenging as the episode stays for longer or reaches new milestones.
focus on strong upward trend, values reached in reward/episode length, stability near the end of training i.e. convergence
observation normalization
overfit then regularize

Moving Forward

While in this blog we’ve covered the basics of how to train a legged robot to walk and have successfully demonstrated it with code, there are still many open problems and ways in which we can make the policy better in order to learn how to build a production level bipedal controller.

Can you beat the baseline performance? I included sample weights in the log directory.

Methods to Improve Project

The following is a list of ideas to try on to take the existing training code forward:

Domain Randomization
- terrain (slopes, steps, friction coefficients)
- dynamic parameters (mass, intertia, motor strength, damping properties)
- sensor noise simulation (noise to observations)
- actuator delays (simulate communication/action delays that happen irl)
Advanced Reward Engineering
- energy efficiency (minimize energy consumption, natural gaits)
- foot clearance (foot clearance during swing phases)
- smoothness (rate of change of acceleration)
- style-based (reference motions from real animals or robots, rewards for matching gaits)
- llm-based rewards: similar to the eureka project, although be careful about what it omits
Curriculum Learning
- start with simple tasks: standing, then walking on flat terrain
- gradually increase difficulty
- velocity-based curriculum: start with slow movements and then gradually increase target velocities
- robustness training: introduce external pushes of increasing magnitude as training progresses
Multi-task Learning
- different movement modes: train for walking, trotting, galloping, and recovering from falls
- agility tasks: add jumping, turning in place, or navigating obstacles as additional tasks
- recovery behaviours: specifically train for recovering from unstable states
Advanced RL Techniques
- hierarchical RL: implement high-level and low-level controllers for better motion-planning
- model-based components: add a dynamics model to improve sample efficiency
- experience replay prioritization: prioritize rare or difficult scenarios during training
- priors: can we leverage priors knowledge before training a controller? i.e. VLAs for manipulation is very popular, can we do the same for bipedal walking?
- Mixture of Experts: to choose between different modes
- Hybrid Controllers: i.e. when can we use MPC when we have models that we think can work well, and when can we use RL when the models might be too hard to run 100 times per second. Then train a model that selects the best action based on what it has currently at hand
Sim2Real
- deploy in real world, assess performance
- collect failure cases, zoom in the dynamics (i’ve done this with mobile robots)
- online fine-tuning (same exploration methods to fine-tune on real robot)
- real-world RL (reward model like Dyna, learn from failures)

Beyond just improving the learned control policy in this project, there are still many challenging open problems to solve in locomotion in order to build human level controllers.

Open Problems

While building RL controllers to walk is a relatively solved problems, here is a list of some of the most critical problems that can benefit from new approaches to solve:

terrain-blind generalist gait: one policy that marches from floor to mud to foam to ice
unified controller that does both planning + control
seamless gait-switch for all walking scenarios (walking, running, ice, stop)
vision-conditioned foot placement and walking
whole body loco-manipulation (walk while holding 10kg, door open, skating that tucks hands)
reward-engineering: describing things like walk quietly
language conditioned tasks
benchmarks for walking performance
robustness to perturbations and fast resets
encoding human preferences into walking like arms swinging
push recovery ≥ 200N in any direction and robot stays up
zero-shot stair & ladder ascent
apply behaviour cloning, frame problem as next action prediction with transformers
using LLMs to write diverse rewards like Eureka
match human speed ~1.3m/s walk + 3m/s jog
long path planning like following a map
energy efficiency CoT (cost of transport) of humans is 0.2, humanoids are at at ~0.4-0.6
ability to carry different payloads and walk
use VLA for bipedal walking by learning from human data
leveraging priors physical knowledge for controllers

Conclusion

In this blog, we covered a deep dive into controlling bipedal humanoids starting from biomechanics to RL, as well as doing an actual experiment training a humanoid Unitree G1 in Isaac Lab and validate in Mujoco.

Hope you learned something new from this blog. If you work on these kinds of problems and have any questions, feel free to reach out!

References

The following are the reference papers and code from the deep dive on bipedal humanoid training and control:

Bipedal

Berkeley Humanoid: A Research Platform for Learning-Based Control: code uses Isaac Lab, high quality, recent, tested in real world
Reinforcement Learning for Versatile, Dynamic, and Robust Bipedal Locomotion Control: useful info about the problem of bipedal locomotion and algorithms to solve it
Real-World Humanoid Locomotion with Reinforcement Learning: from UC Berkeley, quite recent 2024 useful read for practical real world applications
Humanoid Locomotion as Next Token Prediction, UC Berkeley
Learning Humanoid Locomotion over Challenging Terrain, UC Berkeley
The Duke Humanoid: Design and Control For Energy Efficient Bipedal Locomotion Using Passive Dynamics
Starting on the Right Foot with Reinforcement Learning, Boston Dynamics
Figure AI RL for Bipedal Locomotion: talks about sim2real, & domain randomization
How to build Humanoid: Nvidia Isaac Lab: How to Walk: demo of real bipedal bot
Sim-to-Real Learning of all Common Bipedal Gaits via Periodic Reward Composition: details about crafting reward functions while managing sim2real
Revisiting Reward Design and Evaluation for Robust Humanoid Standing and Walking: dives a bit deeper into rewards design for robotics
Reinforcement Learning for Robust Parameterized Locomotion Control of Bipedal Robots: one of the early works by Sergey Levine to teach robots how to walk using RL, plus follow human gaits library
Learning Locomotion Skills for Cassie: Iterative Design and Sim-to-Real: quite old paper from like 2020, nice read about the general problem of design walking bipedal robots
Getting SAC to Work on a Massive Parallel Simulator: An RL Journey with Off-Policy Algorithms, Antonin Raffin: tries to use SAC instead of PPO for learning how to walk on quadruped
FootSteps Planning: Guide with Code
Deep Reinforcement Learning for Bipedal Locomotion: A Brief Survey: extensive survey about bipedal locomotion
Feedback Control for Cassie With Deep Reinforcement Learning: talks about learning from human references
Whole-body Humanoid Robot Locomotion with Human Reference
Bipedal Walking Papers Collection: recap where research is focused on in recent years
UC Berkeley Bipedal Publications: some high quality list of publications
Whole-body Humanoid Robot Locomotion with Human Reference
Training a humanoid robot for locomotion using Reinforcement Learning (papers + code)
MoELoco: Mixture of Experts for Multitask Locomotion
TDMPBC: Self-Imitative Reinforcement Learning for Humanoid Robot Control
Humanoid Parkour Learning: can walk, run, up/down
RL and Sim2Real for Manipulation in Isaac Lab
High-Performance Reinforcement Learning on Spot: Optimizing Simulation Parameters with Distributional Measures: spot RL running, some must-knows for agile control sim2real
AMO: Adaptive Motion Optimization for Hyper-Dexterous Humanoid Whole-Body Control + code
CS 294-277: Robots that Learn, UC Berkeley

UniTree

Unitree RL Gym: RL implementations for UniTree robots in Isaac Gym + Sim2Sim eval Mujoco
UniTree Mujoco: mainly defines the simulator config stuff
Isaac Lab: ready H1 training code and environments
Legged Gym
Unitree LeRobot

Humanoids Environments

Mujoco Playground: existing implementations for target robots + GPU parallelization
HumanoidBench: humanoid benchmark for 15 whole-body manipulation and 12 locomotion tasks in mujoco
LearningHumanoidWalking: ascending/descending stairs, walking on curves in Mujoco
Berkely Humanoid: isaac lab code to train bipedal lower body robot
Humanoid Gym: isaac gym for training + mujoco for sim2sim eval
learn2walk: sample efficiency of RL algos in mujoco
DukeHumanoidv1: isaac gym for walking
HOVER: whole-body controller in teacher-student style. training in isaac lab, and sim2sim eval in Mujoco all on H1-unitree robot
Human2Humanoid: training in Isaac Gym
Implementation of Walking robot using Genesis and RL
HumanoidGym
LocoMujoco
AMO: mujoco whole-body control

Cassie Environments

Cassie RL Walking: trains versatile walking policy for tracking velocity, height, and turning commands, code for berkeley paper
Gym Cassie: trains Cassie to walk/run forward in MuJoCo, reward for forward motion/penalty jumping
Gym Cassie Run: rewarded for walking/running forward as fast as possible
Walking Cassie: after 150 million time-steps, slow/quick forward gaits, lateral and multi-velocity movement
Gym Cassie: legacy code for OpenAI Gym implementation
bipedal_walker_terrain: teacher-student model for Cassie in Mujoco

Quadruped Environments

Learning to Walk using Massively Parallel DeepRL, ETH: takes the idea of extreme parallelism to train in minutes, ramp up terrain difficulty gradually like games, an uneven terrain takes about 20mins to finish training, can use code for inspiration done inside legged_gym, useful since high quality paper and group of people to learn their techniques and work they have done esp. around the Isaac gym for parallelization and domain randomization
Spot from Boston Dynamics in Isaac Lab + code
Learning agile and dynamic motor skills for legged robots: AnyMAL from ETH, pioneered RL and sim2real for quadruped robots
Making Quadrupeds Learning to Walk: From Zero to Hero (code): good for general ideas about what the rewards are, action space, and so on
Learning Quadrupedal Locomotion over Challenging Terrain, ETH
Isaac Sim SpeedUp Cheatsheet

RL Libraries

RSL_RL: code for both Isaac Lab and Legged Gym
Stable-Baselines3
Spinning Up RL
Reinforcement Learning Tips and Tricks: Stable-Baselines3

Books

Legged Robots that Balance by Marc Raibert: old book from 1980s
Biomechanics of Movement, Uchida and Delp: strong study of biomechanics

Real World Examples

Misc

Training AI to play Super Smash Bros
Eureka: Human-Level Reward Design via Coding Large Language Models
Voyager: AI agent that plays Minecraft
MineDojo: open-ended agent learning by watching 100,000s of Minecraft YouTube videos

Notes

Biomechanics of Movement, Uchida and Delp ↩
although in reality, human leg really has around 7-8 DoF not counting toes ↩
feel free to try this out in MATLAB ↩
Understanding this principle early on, and designing our simulations with maximum domain randomization to capitalize on this observation will be very important as we will later demonstrate ↩
The Second Half, Shunyu Yao ↩
checkout my experiments thoughts log ↩