宇树机器人强化学习：PPO 算法 Python 实现与解析

前言

本期将讲解 rsl_rl 仓库的 PPO 算法的 Python 实现。

Unitree RL GYM 是一个开源的基于 Unitree 机器人强化学习（Reinforcement Learning, RL）控制示例项目，用于训练、测试和部署四足机器人控制策略。该仓库支持多种 Unitree 机器人型号，包括 Go2、H1、H1_2 和 G1。仓库地址：https://github.com/unitreerobotics/unitree_rl_gym.git

0 仓库安装

关于仓库的安装和环境配置官方的文档已经非常清楚了，这里就不在赘述。
通过下述指令可以快速获取仓库代码。

官方教程

git clone https://github.com/leggedrobotics/rsl_rl.git
cd rsl_rl
git checkout v1.0.2

0-1 PPO 公式回顾

姑且这里回顾一下 PPO 的核心公式。PPO 的目标函数是： L^{clip}(\theta) = \mathbb{E}[\min(r(\theta)A, \mathrm{clip}(r(\theta), 1-\epsilon, 1+\epsilon)A)] 其中:
- r(\theta)：新旧策略概率比
- A：Advantage（优势函数）
- \epsilon：裁剪范围，一般取 0.1~0.2

概率比率（Probability Ratio） r(\theta) = \frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} 它表示：
- 新策略和旧策略在某个动作上的概率比例。
- 如果 r ≈ 1，说明新旧策略 差不多
- 如果 r >> 1 或 r << 1，说明策略 变化太大
通过上述公式，PPO 会限制 r(\theta) 的取值范围 [1-\epsilon, 1+\epsilon]。如果超过这个范围，梯度就会被裁剪，不再继续增大。

GAE（Generalized Advantage Estimation）:
- 它通过引入一个参数 λ（lambda），将 多步 TD 误差进行加权平均，从而得到更加稳定的 Advantage 估计。
公式：A_t^{GAE} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l} 其中：\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) 表示 TD 误差（Temporal Difference Error）。
参数 λ 控制了 偏差和方差之间的平衡：
- λ = 0 只使用 一步 TD 误差,方差小，偏差较大
- λ = 1 接近 Monte Carlo 回报.偏差小，方差较大

策略熵的公式为：H(\pi) = -\sum \pi(a|s)\log \pi(a|s)
- 策略越随机，熵越大

1 仓库一览

拉取完仓库以后，我们可以简单的使用 tree 指令看一下整个项目的结构
rsl_rl 目录结构

rsl_rl/
├── algorithms/
├── env/
├── modules/
├── runners/
├── storage/
└── utils/

1-1 `algorithms/` 目录

algorithms/
├── __init__.py
└── ppo.py

功能：存放 RL 算法实现，例如 ppo.py 实现了 PPO（Proximal Policy Optimization） 算法。
特点：
- 可以扩展更多算法（如 DDPG、TD3、DPPO）。
- 提供训练所需的核心算法逻辑（策略更新、损失函数计算等）。

1-2 `env/` 目录

env/
├── __init__.py
└── vec_env.py

功能：封装环境接口。
- vec_env.py 实现 Vectorized Environment，支持多环境并行训练。
作用：
- 对接仿真环境（如 PyBullet / Mujoco）。
- 提供标准接口给算法训练（step、reset、render 等）。

1-3 `modules/` 目录

modules/
├── actor_critic.py
└── actor_critic_recurrent.py

功能：定义策略网络结构。
- actor_critic.py：普通 Actor-Critic 网络。
- actor_critic_recurrent.py：RNN / LSTM 版本的 Actor-Critic 网络。
作用：
- 提供策略和价值网络给 PPO 或其他算法调用。
- 支持状态序列建模，适合处理时间相关的机器人动作控制。

1-4 `runners/` 目录

runners/
└── on_policy_runner.py

功能：训练调度器。
- on_policy_runner.py 负责 按策略采样数据并执行训练循环。
作用：
- 管理数据采样、训练步数、模型保存。
- 将算法、环境、存储模块整合成完整的训练流程。

1-5 `storage/` 目录

storage/
└── rollout_storage.py

功能：存储采样轨迹（rollouts）。
作用：
- PPO 需要保存每一步的状态、动作、奖励等。
- 提供 mini-batch 更新、归一化等功能。

1-6 `utils/` 目录

utils/
└── utils.py

功能：工具函数。
- 例如日志记录、模型保存/加载、张量操作等。
作用：
- 提供训练和部署所需的通用工具函数，减轻主逻辑负担。

2 PPO 算法的 Python 实现

2-1 代码一览

代码的路径在

algorithms/
├── __init__.py
└── ppo.py

代码整体在这，别急我们一部分一部分进行分析

# SPDX-FileCopyrightText: Copyright (c) 2021 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
# Copyright (c) 2021 ETH Zurich, Nikita Rudin
import torch
import torch.nn  nn
 torch.optim  optim
 rsl_rl.modules  ActorCritic
 rsl_rl.storage  RolloutStorage

 :
    actor_critic: ActorCritic

     ():
        .device = device
        .desired_kl = desired_kl
        .schedule = schedule
        .learning_rate = learning_rate
        
        .actor_critic = actor_critic
        .actor_critic.to(.device)
        .storage =   
        .optimizer = optim.Adam(.actor_critic.parameters(), lr=learning_rate)
        .transition = RolloutStorage.Transition()
        
        .clip_param = clip_param
        .num_learning_epochs = num_learning_epochs
        .num_mini_batches = num_mini_batches
        .value_loss_coef = value_loss_coef
        .entropy_coef = entropy_coef
        .gamma = gamma
        .lam = lam
        .max_grad_norm = max_grad_norm
        .use_clipped_value_loss = use_clipped_value_loss

     ():
        .storage = RolloutStorage(num_envs, num_transitions_per_env, actor_obs_shape, critic_obs_shape, action_shape, .device)

     ():
        .actor_critic.test()

     ():
        .actor_critic.train()

     ():
         .actor_critic.is_recurrent:
            .transition.hidden_states = .actor_critic.get_hidden_states()
        
        .transition.actions = .actor_critic.act(obs).detach()
        .transition.values = .actor_critic.evaluate(critic_obs).detach()
        .transition.actions_log_prob = .actor_critic.get_actions_log_prob(.transition.actions).detach()
        .transition.action_mean = .actor_critic.action_mean.detach()
        .transition.action_sigma = .actor_critic.action_std.detach()
        
        .transition.observations = obs
        .transition.critic_observations = critic_obs
         .transition.actions

     ():
        .transition.rewards = rewards.clone()
        .transition.dones = dones
        
           infos:
            .transition.rewards += .gamma * torch.squeeze(.transition.values * infos[].unsqueeze().to(.device), )
        
        .storage.add_transitions(.transition)
        .transition.clear()
        .actor_critic.reset(dones)

     ():
        last_values = .actor_critic.evaluate(last_critic_obs).detach()
        .storage.compute_returns(last_values, .gamma, .lam)

     ():
        mean_value_loss = 
        mean_surrogate_loss = 
         .actor_critic.is_recurrent:
            generator = .storage.reccurent_mini_batch_generator(.num_mini_batches, .num_learning_epochs)
        :
            generator = .storage.mini_batch_generator(.num_mini_batches, .num_learning_epochs)
         obs_batch, critic_obs_batch, actions_batch, target_values_batch, advantages_batch, returns_batch, old_actions_log_prob_batch, \
            old_mu_batch, old_sigma_batch, hid_states_batch, masks_batch  generator:
            .actor_critic.act(obs_batch, masks=masks_batch, hidden_states=hid_states_batch[])
            actions_log_prob_batch = .actor_critic.get_actions_log_prob(actions_batch)
            value_batch = .actor_critic.evaluate(critic_obs_batch, masks=masks_batch, hidden_states=hid_states_batch[])
            mu_batch = .actor_critic.action_mean
            sigma_batch = .actor_critic.action_std
            entropy_batch = .actor_critic.entropy
            
             .desired_kl !=   .schedule == :
                 torch.inference_mode():
                    kl = torch.(
                        torch.log(sigma_batch / old_sigma_batch + ) +
                        (torch.square(old_sigma_batch) + torch.square(old_mu_batch - mu_batch)) / ( * torch.square(sigma_batch)) - ,
                        axis=-)
                    kl_mean = torch.mean(kl)
                     kl_mean > .desired_kl * :
                        .learning_rate = (, .learning_rate / )
                     kl_mean < .desired_kl /   kl_mean > :
                        .learning_rate = (, .learning_rate * )
                 param_group  .optimizer.param_groups:
                    param_group[] = .learning_rate
            
            ratio = torch.exp(actions_log_prob_batch - torch.squeeze(old_actions_log_prob_batch))
            surrogate = -torch.squeeze(advantages_batch) * ratio
            surrogate_clipped = -torch.squeeze(advantages_batch) * torch.clamp(ratio,  - .clip_param,  + .clip_param)
            surrogate_loss = torch.(surrogate, surrogate_clipped).mean()
            
             .use_clipped_value_loss:
                value_clipped = target_values_batch + (value_batch - target_values_batch).clamp(-.clip_param, .clip_param)
                value_losses = (value_batch - returns_batch).()
                value_losses_clipped = (value_clipped - returns_batch).()
                value_loss = torch.(value_losses, value_losses_clipped).mean()
            :
                value_loss = (returns_batch - value_batch).().mean()
            loss = surrogate_loss + .value_loss_coef * value_loss - .entropy_coef * entropy_batch.mean()
            
            .optimizer.zero_grad()
            loss.backward()
            nn.utils.clip_grad_norm_(.actor_critic.parameters(), .max_grad_norm)
            .optimizer.step()
            mean_value_loss += value_loss.item()
            mean_surrogate_loss += surrogate_loss.item()
        num_updates = .num_learning_epochs * .num_mini_batches
        mean_value_loss /= num_updates
        mean_surrogate_loss /= num_updates
        .storage.clear()
         mean_value_loss, mean_surrogate_loss

2-2 初始化函数

我们来看看这个类初始化部分：

class PPO:
    actor_critic: ActorCritic
    def __init__(self, actor_critic, num_learning_epochs=1, num_mini_batches=1, clip_param=0.2, gamma=0.998, lam=0.95, value_loss_coef=1.0, entropy_coef=0.0, learning_rate=1e-3, max_grad_norm=1.0, use_clipped_value_loss=True, schedule="fixed", desired_kl=0.01, device='cpu',):
        self.device = device
        self.desired_kl = desired_kl
        self.schedule = schedule
        self.learning_rate = learning_rate
        # PPO components
        self.actor_critic = actor_critic
        self.actor_critic.to(self.device)
        self.storage = None  # initialized later
        self.optimizer = optim.Adam(self.actor_critic.parameters(), lr=learning_rate)
        self.transition = RolloutStorage.Transition()
        # PPO parameters
        self.clip_param = clip_param
        self.num_learning_epochs = num_learning_epochs
        self.num_mini_batches = num_mini_batches
        self.value_loss_coef = value_loss_coef
        self.entropy_coef = entropy_coef
        self.gamma = gamma
        self.lam = lam
        self.max_grad_norm = max_grad_norm
        self.use_clipped_value_loss = use_clipped_value_loss

初始化传入了大量 PPO 的超参数：
- actor_critic：这里传入的是 PPO 算法必须的 Actor-Critic 网络 (这个网络的定义在 modules/actor_critic.py,这个我们后面几期会进行解析)
- num_learning_epochs=1：每一批 rollout 数据 重复训练多少轮
- num_mini_batches=1：把 rollout 数据分成多少 mini-batch（以提高样本利用率）
- clip_param=0.2：这个是 PPO 的 ε 核心参数，用于对策略进行裁切
- gamma=0.998：奖励折扣因子，用于控制控制 长期奖励权重
- lam=0.95：GAE 的 λ 参数，用于在计算优势函数的时候降低方差
- value_loss_coef=1.0：损失函数权重，越高越关注 value 网络
- entropy_coef=0.0：策略熵，鼓励策略保持一定随机性，用于设置额外探索奖励
- learning_rate=1e-3：神经网络学习率
- max_grad_norm=1.0：梯度裁剪，大于此值的梯度值会被裁切，防止梯度爆炸
- use_clipped_value_loss=True：是否使用 Value Clipping,防止 Critic 更新过大
- schedule="fixed"：表示 训练过程中学习率保持固定，不根据 KL 或训练情况动态调整
- desired_kl=0.01：目标 KL 散度，表示 期望新旧策略之间的 KL 距离大约为 0.01，用于在自适应学习率策略中控制策略更新幅度。
- device='cpu'：运行设备
同时还定义了一些变量

self.storage = None  # initialized later
self.optimizer = optim.Adam(self.actor_critic.parameters(), lr=learning_rate)
self.transition = RolloutStorage.Transition()

self.storage：经验回放缓存（Rollout Buffer）占位符
self.optimizer：Adam 优化器 来更新 Actor-Critic 网络参数
self.transition：临时数据结构（step buffer）

2-3 初始化经验回放缓存函数 `init_storage()`

def init_storage(self, num_envs, num_transitions_per_env, actor_obs_shape, critic_obs_shape, action_shape):
    self.storage = RolloutStorage(num_envs, num_transitions_per_env, actor_obs_shape, critic_obs_shape, action_shape, self.device)

这个函数用于初始化经验回放缓存（Rollout Buffer）机制的 数据缓存（Rollout Buffer）。
定义在 storage/rollout_storage.py，我们之后也会解析

2-4 模式函数

def test_mode(self):
    self.actor_critic.test()

def train_mode(self):
    self.actor_critic.train()

这两都是启用 actor_critic 的模式
这个网络的定义在 modules/actor_critic.py，我们之后也会解析

2-5 行动函数 `act()`

这个函数的作用：根据当前状态，计算动作并返回，同时并记录训练数据

def act(self, obs, critic_obs):
    if self.actor_critic.is_recurrent:
        self.transition.hidden_states = self.actor_critic.get_hidden_states()
    # Compute the actions and values
    self.transition.actions = self.actor_critic.act(obs).detach()
    self.transition.values = self.actor_critic.evaluate(critic_obs).detach()
    self.transition.actions_log_prob = self.actor_critic.get_actions_log_prob(self.transition.actions).detach()
    self.transition.action_mean = self.actor_critic.action_mean.detach()
    self.transition.action_sigma = self.actor_critic.action_std.detach()
    # need to record obs and critic_obs before env.step()
    self.transition.observations = obs
    self.transition.critic_observations = critic_obs
    return self.transition.actions

我们一步步看，首先我们来看函数的输入

def act(self, obs, critic_obs):

obs 是 策略网络的输入
critic_obs 是 价值网络输入

if self.actor_critic.is_recurrent:
    self.transition.hidden_states = self.actor_critic.get_hidden_states()

这里判断是否需要使用 RNN / LSTM 网络，如果是，需要保存 hidden_state,否则后面训练无法恢复序列状态。

self.transition.actions = self.actor_critic.act(obs).detach()
self.transition.values = self.actor_critic.evaluate(critic_obs).detach()

Actor 网络根据 策略网络的输入来计算动作，.detach()表示不参与梯度计算，只是进行采样。
Critic 网络计算价值，.detach()表示不参与梯度计算，只是进行采样。

self.transition.actions_log_prob = self.actor_critic.get_actions_log_prob(self.transition.actions).detach()

这里计算动作概率 log π_θ(a|s)，用于后面计算 概率比率（Probability Ratio） 的时候使用

self.transition.action_mean = self.actor_critic.action_mean.detach()
self.transition.action_sigma = self.actor_critic.action_std.detach()

这里保存策略分布，用于 KL 散度计算。其中动作通常来自 高斯分布：a ~ N(μ, σ)

# need to record obs and critic_obs before env.step()
self.transition.observations = obs
self.transition.critic_observations = critic_obs
return self.transition.actions

保存状态并返回动作，这里需要在 env.step() 之前保存，否则状态就改变了

2-6 处理环境反馈函数 `process_env_step()`

这个函数用于处理环境返回结果，并存储数据

def process_env_step(self, rewards, dones, infos):
    self.transition.rewards = rewards.clone()
    self.transition.dones = dones
    # Bootstrapping on time outs
    if 'time_outs' in infos:
        self.transition.rewards += self.gamma * torch.squeeze(self.transition.values * infos['time_outs'].unsqueeze(1).to(self.device), 1)
    # Record the transition
    self.storage.add_transitions(self.transition)
    self.transition.clear()
    self.actor_critic.reset(dones)

self.transition.rewards = rewards.clone()
self.transition.dones = dones

保存奖励 r_t，同时保存终止信号

# Bootstrapping on time outs
if 'time_outs' in infos:
    self.transition.rewards += self.gamma * torch.squeeze(self.transition.values * infos['time_outs'].unsqueeze(1).to(self.device), 1)

有些 episode 结束不是因为失败，而是达到最大步数，那就不能把未来价值 V(s) 当成 0。
这时候修正 value 的计算 r = r + γV(s)

# Record the transition
self.storage.add_transitions(self.transition)
self.transition.clear()
self.actor_critic.reset(dones)

在经验池里头储存数据，每一步的数据包含 (s,a,r,V,log_prob)
清空 transition 并重置 RNN

2-7 计算收获函数 `compute_returns`

计算 PPO 训练需要的奖励 returns 和优势函数 advantage

def compute_returns(self, last_critic_obs):
    last_values = self.actor_critic.evaluate(last_critic_obs).detach()
    self.storage.compute_returns(last_values, self.gamma, self.lam)

第一行会计算最后状态价值 V(s_T)
第二行就是 GAE 计算，将 多步 TD 误差进行加权平均，从而得到更加稳定的 Advantage 估计。
Return：R_t = r_t + γR_{t+1}
优势函数 Advantage（GAE）: δ_t = r_t + γV(s_{t+1}) - V(s_t) A_t = δ_t + γλδ_{t+1} + (γλ)^2δ_{t+2}+...

3 核心函数 update

3-1 完整实现

前面的代码主要完成 数据采样与优势计算，而 PPO 的核心训练逻辑全部在 update() 函数中完成。这个函数负责：
- 重新计算策略概率
- 计算 PPO loss
- 反向传播更新网络

def update(self):
    mean_value_loss = 0
    mean_surrogate_loss = 0
    if self.actor_critic.is_recurrent:
        generator = self.storage.reccurent_mini_batch_generator(self.num_mini_batches, self.num_learning_epochs)
    else:
        generator = self.storage.mini_batch_generator(self.num_mini_batches, self.num_learning_epochs)
    for obs_batch, critic_obs_batch, actions_batch, target_values_batch, advantages_batch, returns_batch, old_actions_log_prob_batch, \
        old_mu_batch, old_sigma_batch, hid_states_batch, masks_batch in generator:
        self.actor_critic.act(obs_batch, masks=masks_batch, hidden_states=hid_states_batch[0])
        actions_log_prob_batch = self.actor_critic.get_actions_log_prob(actions_batch)
        value_batch = self.actor_critic.evaluate(critic_obs_batch, masks=masks_batch, hidden_states=hid_states_batch[1])
        mu_batch = self.actor_critic.action_mean
        sigma_batch = self.actor_critic.action_std
        entropy_batch = self.actor_critic.entropy
        # KL
        if self.desired_kl != None and self.schedule == 'adaptive':
            with torch.inference_mode():
                kl = torch.sum(
                    torch.log(sigma_batch / old_sigma_batch + 1.e-5) +
                    (torch.square(old_sigma_batch) + torch.square(old_mu_batch - mu_batch)) / (2.0 * torch.square(sigma_batch)) - 0.5,
                    axis=-1)
                kl_mean = torch.mean(kl)
                 kl_mean > .desired_kl * :
                    .learning_rate = (, .learning_rate / )
                 kl_mean < .desired_kl /   kl_mean > :
                    .learning_rate = (, .learning_rate * )
             param_group  .optimizer.param_groups:
                param_group[] = .learning_rate
        
        ratio = torch.exp(actions_log_prob_batch - torch.squeeze(old_actions_log_prob_batch))
        surrogate = -torch.squeeze(advantages_batch) * ratio
        surrogate_clipped = -torch.squeeze(advantages_batch) * torch.clamp(ratio,  - .clip_param,  + .clip_param)
        surrogate_loss = torch.(surrogate, surrogate_clipped).mean()
        
         .use_clipped_value_loss:
            value_clipped = target_values_batch + (value_batch - target_values_batch).clamp(-.clip_param, .clip_param)
            value_losses = (value_batch - returns_batch).()
            value_losses_clipped = (value_clipped - returns_batch).()
            value_loss = torch.(value_losses, value_losses_clipped).mean()
        :
            value_loss = (returns_batch - value_batch).().mean()
        loss = surrogate_loss + .value_loss_coef * value_loss - .entropy_coef * entropy_batch.mean()
        
        .optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(.actor_critic.parameters(), .max_grad_norm)
        .optimizer.step()
        mean_value_loss += value_loss.item()
        mean_surrogate_loss += surrogate_loss.item()
    num_updates = .num_learning_epochs * .num_mini_batches
    mean_value_loss /= num_updates
    mean_surrogate_loss /= num_updates
    .storage.clear()
     mean_value_loss, mean_surrogate_loss

3-2 参数定义

我们一步步来看：

mean_value_loss = 0
mean_surrogate_loss = 0
if self.actor_critic.is_recurrent:
    generator = self.storage.reccurent_mini_batch_generator(self.num_mini_batches, self.num_learning_epochs)
else:
    generator = self.storage.mini_batch_generator(self.num_mini_batches, self.num_learning_epochs)

mean_value_loss 和 mean_surrogate_loss:统计整个训练过程中的 平均 loss，用于日志打印。
然后我们根据是否使用 RNN / LSTM 网络来构造 mini-batch 迭代器

3-3 每个 batch

for obs_batch, critic_obs_batch, actions_batch, target_values_batch, advantages_batch, returns_batch, old_actions_log_prob_batch, \
    old_mu_batch, old_sigma_batch, hid_states_batch, masks_batch in generator:

然后我们在每个 batch 取出这些变量：
- obs_batch：Actor 网络输入
- critic_obs_batch：Critic 网络输入
- actions_batch：采样动作
- target_values_batch：旧价值函数 V(s)
- advantages_batch：GAE 优势函数
- critic_obs_batch：Critic 输入
- returns_batch：目标价值
- old_actions_log_prob_batch：旧策略概率 log π_θ(a|s)
- old_mu_batch：旧策略均值 μ
- old_sigma_batch：旧策略方差 σ

self.actor_critic.act(obs_batch, masks=masks_batch, hidden_states=hid_states_batch[0])

首先调用 act 函数进行 Actor 前向计算，重新计算 当前策略的动作分布 π_θ(a|s) = N(μ_θ(s), σ_θ(s))

actions_log_prob_batch = self.actor_critic.get_actions_log_prob(actions_batch)

然后获取当前动作概率 log prob log π_θ(a_t|s_t)，用于一会计算概率比率

value_batch = self.actor_critic.evaluate(critic_obs_batch)

Critic 计算价值函数 value

mu_batch = self.actor_critic.action_mean
sigma_batch = self.actor_critic.action_std
entropy_batch = self.actor_critic.entropy

mu_batch：μ 策略均值
sigma_batch：σ 策略方差
entropy:策略熵

3-4 KL 散度控制

KL 散度控制就干一件事：如果 KL 太大，降低学习率。
这是一种简单的 Trust Region 近似实现，用于防止策略更新过大导致训练不稳定。

# KL
if self.desired_kl != None and self.schedule == 'adaptive':
    with torch.inference_mode():
        kl = torch.sum(
            torch.log(sigma_batch / old_sigma_batch + 1.e-5) +
            (torch.square(old_sigma_batch) + torch.square(old_mu_batch - mu_batch)) / (2.0 * torch.square(sigma_batch)) - 0.5,
            axis=-1)
        kl_mean = torch.mean(kl)
        if kl_mean > self.desired_kl * 2.0:
            self.learning_rate = max(1e-5, self.learning_rate / 1.5)
        elif kl_mean < self.desired_kl / 2.0 and kl_mean > 0.0:
            self.learning_rate = min(1e-2, self.learning_rate * 1.5)
    for param_group in self.optimizer.param_groups:
        param_group['lr'] = self.learning_rate

其中的 kl 对应 高斯分布 KL 公式 KL(π_old || π_new) = log(σ/σ_old) + (σ_old^2 + (μ_old - μ)^2)/(2σ^2) - 1/2

if kl_mean > self.desired_kl * 2.0:
    self.learning_rate = max(1e-5, self.learning_rate / 1.5)
elif kl_mean < self.desired_kl / 2.0 and kl_mean > 0.0:
    self.learning_rate = min(1e-2, self.learning_rate * 1.5)
for param_group in self.optimizer.param_groups:
    param_group['lr'] = self.learning_rate

自适应学习率并更新

KL	说明
太大	更新太猛
太小	更新太慢

3-5 PPO 核心：概率比率

ratio = torch.exp(actions_log_prob_batch - old_actions_log_prob_batch)

这就是 PPO 的核心公式：r_t(θ) = π_θ(a_t|s_t) / π_{θ_old}(a_t|s_t)
它表示：
- 新策略和旧策略在某个动作上的概率比例。
- 如果 r ≈ 1，说明新旧策略 差不多
- 如果 r >> 1 或 r << 1，说明策略 变化太大

3-6 PPO 又一核心 Clip 裁切

surrogate = -torch.squeeze(advantages_batch) * ratio
surrogate_clipped = -torch.squeeze(advantages_batch) * torch.clamp(ratio, 1.0 - self.clip_param, 1.0 + self.clip_param)
surrogate_loss = torch.max(surrogate, surrogate_clipped).mean()

也就是对应的 L^{CLIP} = E[min(r_t(θ)A_t, clip(r_t(θ), 1-ε, 1+ε)A_t)]
第一行是原始策略梯度，也就是公式中的 L^{PG} = E[r_t(θ)A_t]
通过上述公式，PPO 会限制 r(θ) 的取值范围 [1-ε, 1+ε]。如果超过这个范围，梯度就会被裁剪，不再继续增大。
注意：这里加负号是因为 PyTorch 默认最小化 loss

3-7 损失函数

# Value function loss
if self.use_clipped_value_loss:
    value_clipped = target_values_batch + (value_batch - target_values_batch).clamp(-self.clip_param, self.clip_param)
    value_losses = (value_batch - returns_batch).pow(2)
    value_losses_clipped = (value_clipped - returns_batch).pow(2)
    value_loss = torch.max(value_losses, value_losses_clipped).mean()
else:
    value_loss = (returns_batch - value_batch).pow(2).mean()
loss = surrogate_loss + self.value_loss_coef * value_loss - self.entropy_coef * entropy_batch.mean()

这里计算完整的损失函数公式 L = L_policy + c_1 L_value - c_2 H(π)
其中：
- self.value_loss_coef * value_loss:价值网络损失
- surrogate_loss:策略网络损失
- self.entropy_coef * entropy_batch.mean():策略熵函数损失
这里根据是否使用 PPO value clip 分为两种计算 value_loss 的方式

普通 value loss L_V = (V_θ(s) - R_t)^2
PPO value clip V^{clip} = V_old + clip(V_θ - V_old) L_V = max((V - R)^2, (V^{clip} - R)^2)

如果使用 PPO value clip 可以防止 critic 更新过大

3-8 梯度推进

# Gradient step
self.optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(self.actor_critic.parameters(), self.max_grad_norm)
self.optimizer.step()
mean_value_loss += value_loss.item()
mean_surrogate_loss += surrogate_loss.item()

剩下的就是
- 清空梯度
- 反向传播
- 梯度裁剪 ||g|| < max_grad_norm
- 更新参数
- 记录 loss

3-9 外层循环收尾工作

计算平均 loss，清空 rollout buffer

num_updates = self.num_learning_epochs * self.num_mini_batches
mean_value_loss /= num_updates
mean_surrogate_loss /= num_updates
self.storage.clear()

3-10 PPO 训练循环

PPO 训练循环：

收集 rollout
计算 advantage (GAE)
多轮 mini-batch 更新
clip policy
clip value
entropy regularization

数学目标：L = E[min(r_t A_t, clip(r_t, 1-ε, 1+ε)A_t)] + c_1(V-R)^2 - c_2H(π)

在这里插入图片描述

小结

本期我们对 rsl_rl 仓库中 PPO 算法的 Python 实现进行了全面解析：从初始化超参数、经验回放缓存、动作采样、环境反馈处理，到优势函数计算与策略更新的完整流程。核心机制包括概率比率裁剪 (clip)、GAE 优势估计、价值函数裁剪、防止梯度爆炸、以及可选的自适应学习率和 KL 控制，最终通过组合策略损失、价值损失和策略熵形成完整优化目标，实现对四足机器人稳定且高效的强化学习训练。

宇树机器人强化学习：PPO 算法 Python 实现与解析

前言

0 仓库安装

0-1 PPO 公式回顾

1 仓库一览

1-1 `algorithms/` 目录

1-2 `env/` 目录

1-3 `modules/` 目录

1-4 `runners/` 目录

1-5 `storage/` 目录

1-6 `utils/` 目录

2 PPO 算法的 Python 实现

2-1 代码一览

2-2 初始化函数

2-3 初始化经验回放缓存函数 `init_storage()`

2-4 模式函数

2-5 行动函数 `act()`

2-6 处理环境反馈函数 `process_env_step()`

2-7 计算收获函数 `compute_returns`

3 核心函数 update

3-1 完整实现

3-2 参数定义

3-3 每个 batch

3-4 KL 散度控制

3-5 PPO 核心：概率比率

3-6 PPO 又一核心 Clip 裁切

3-7 损失函数

3-8 梯度推进

3-9 外层循环收尾工作

3-10 PPO 训练循环

小结

更多推荐文章

相关免费在线工具

宇树机器人强化学习：PPO 算法 Python 实现与解析

前言

0 仓库安装

0-1 PPO 公式回顾

1 仓库一览

1-1 algorithms/ 目录

1-2 env/ 目录

1-3 modules/ 目录

1-4 runners/ 目录

1-5 storage/ 目录

1-6 utils/ 目录

2 PPO 算法的 Python 实现

2-1 代码一览

2-2 初始化函数

2-3 初始化经验回放缓存函数 init_storage()

2-4 模式函数

2-5 行动函数 act()

2-6 处理环境反馈函数 process_env_step()

2-7 计算收获函数 compute_returns

3 核心函数 update

3-1 完整实现

3-2 参数定义

3-3 每个 batch

3-4 KL 散度控制

3-5 PPO 核心：概率比率

3-6 PPO 又一核心 Clip 裁切

3-7 损失函数

3-8 梯度推进

3-9 外层循环收尾工作

3-10 PPO 训练循环

小结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

1-1 `algorithms/` 目录

1-2 `env/` 目录

1-3 `modules/` 目录

1-4 `runners/` 目录

1-5 `storage/` 目录

1-6 `utils/` 目录

2-3 初始化经验回放缓存函数 `init_storage()`

2-5 行动函数 `act()`

2-6 处理环境反馈函数 `process_env_step()`

2-7 计算收获函数 `compute_returns`