背景与意义
本文介绍基于 VeRL 框架上提出的 GSPO 算法在昇腾 NPU 上进行实践部署,并简单介绍 GRPO 算法思想以及其和 GSPO 算法特性差异。
算法原理
论文地址 GRPO:https://arxiv.org/abs/2402.03300 GSPO:https://arxiv.org/abs/2507.18071
VeRL 框架中 GSPO 算法在昇腾 NPU(Atlas 800T A2)上的部署实践。对比了 GRPO 与 GSPO 算法原理,指出 GSPO 通过序列级重要性采样解决了 MoE 模型训练中的稳定性问题。文章提供了详细的调试环境配置、Qwen25-3B 和 Qwen3-30B-A3B 模型的调试脚本及结果分析,涵盖了 vLLM 后端及 engineworker 功能的验证,并总结了常见报错与解决方案。
本文介绍基于 VeRL 框架上提出的 GSPO 算法在昇腾 NPU 上进行实践部署,并简单介绍 GRPO 算法思想以及其和 GSPO 算法特性差异。
论文地址 GRPO:https://arxiv.org/abs/2402.03300 GSPO:https://arxiv.org/abs/2507.18071
群组相对策略优化(GRPO)的核心创新在于消除了对计算密集型价值模型的依赖,该算法采用了一种创新的群组生成和相对评估机制:对于给定的输入提示,系统生成 G 个不同的响应构成一个群组,随后奖励模型对群组内所有响应进行评分。通过计算群组内分数的均值和标准差,算法为每个响应计算相对优势值 $\hat{A}_i$。优于群组平均水平的响应获得正向优势,反之则获得负向优势。这种设计显著降低了强化学习训练的内存占用和计算复杂度,使大规模模型的训练变得更加高效和可行,但其底层实现存在一个关键的设计缺陷,该缺陷在大规模模型训练中会导致严重的稳定性问题。
问题的根源在于奖励分配与优化更新之间的粒度不匹配:奖励值 $\hat{A}i$ 是基于完整序列计算得出,而 GRPO 的优化更新却在 token 级别执行。为了将序列级奖励应用于每个 token,GRPO 引入了 token 级别的重要性权重 $w{i,t}(\theta)$,权重 $w_{i,t}(\theta)$ 表示当前模型生成 token 的概率 ÷ 旧模型生成相同 token 的概率比值。

序列内各 token 的重要性权重可能出现显著差异,导致学习信号的噪声化和不一致性。随着训练序列长度的增加,这种噪声效应累积并可能触发整个训练过程的失稳,最终导致模型崩溃。该问题在稀疏专家混合(Mixture-of-Experts, MoE)模型中尤为严重,在 MoE 架构的模型训练过程中,由于模型更新后每次激活的专家可能会发生变化,off-policy 偏差会变得更严重。因此,GRPO 的这种重要性采样的方式有可能会导致更大的偏差,致使训练崩溃。
上述问题可以发现,其根源在于优化粒度和奖励粒度不在同一单位。所以 GSPO 算法提出序列级别的重要性采样,并引入 $y_i$ 保证在数值上保持稳定,无论序列长度为 10 个 token 还是 1000 个 token。该算法使用稳定的序列级重要性比率 $s_i(\theta)$ 替代了噪声较大的 token 级别权重,给定序列内的所有 token 接收完全一致的更新,该更新由 $s_i(\theta)\hat{A}_i$ 确定。token 级别的不一致反馈被消除,取而代之的是基于完整序列奖励的统一更新机制。

针对多轮场景,GSPO 提供 token-level 的变体,以进行更精细的应用

论文中实验数据及算法表明,对于 GRPO 算法来说,在 MoE 模型上进行训练,由于模型更新后每次激活的专家可能会发生变化,off-policy 偏差会变得更严重,而这种混合专家激活导致的波动会严重影响模型收敛。在固定采样和训练激活相同的专家后(Routing Replay),训练 reward 可以正常上涨,但会引入额外的内存和通信开销,限制 MOE 的实际容量。而 GSPO 用 sequence-level 的 clip 进行优化,对精度差异的容忍度要更高,从根本上解决了 MoE 模型中的专家激活波动问题,简化和稳定了训练过程,训练效果也更好。
VeRL 的 GRPO 整体算法流程如下图所示,GSPO 与其类似,差异点在于重要性采样计算和 loss 计算模块。

基于 verl 代码分析,可以发现其重要性采样计算相关代码位置位于:verl/trainer/ppo/core_algos.py
@register_policy_loss("gspo")
def compute_policy_loss_gspo(
old_log_prob: torch.Tensor,
log_prob: torch.Tensor,
advantages: torch.Tensor,
response_mask: torch.Tensor,
loss_agg_mode: str = "seq-mean-token-mean",
config: Optional[ActorConfig] = None,
rollout_is_weights: torch.Tensor | None = None,
) -> tuple[torch.Tensor, dict[str, Any]]:
""" Compute the clipped policy objective and related metrics for GSPO. See https://arxiv.org/pdf/2507.18071 for more details.
Args:
old_log_prob (torch.Tensor): Log-probabilities of actions under the old policy, shape (batch_size, response_length).
log_prob (torch.Tensor): Log-probabilities of actions under the current policy, shape (batch_size, response_length).
advantages (torch.Tensor): Advantage estimates for each action, shape (batch_size, response_length).
response_mask (torch.Tensor): Mask indicating which tokens to include in the loss, shape (batch_size, response_length).
loss_agg_mode (str, optional): Aggregation mode for `agg_loss`. For GSPO, it is recommended to use "seq-mean-token-mean".
"""
assert config is not None
assert isinstance(config, ActorConfig)
clip_ratio_low = config.clip_ratio_low if config.clip_ratio_low is not None else config.clip_ratio
clip_ratio_high = config.clip_ratio_high if config.clip_ratio_high is not None else config.clip_ratio
negative_approx_kl = log_prob - old_log_prob
# compute sequence-level importance ratio:
# si(θ) = (π_θ(yi|x)/π_θold(yi|x))^(1/|yi|) =
# exp [(1/|y_i|) * Σ_t log(π_θ(y_i,t|x,y_i,<t)/π_θold(y_i,t|x,y_i,<t))]
seq_lengths = torch.sum(response_mask, dim=-1).clamp(min=1)
negative_approx_kl_seq = torch.sum(negative_approx_kl * response_mask, dim=-1) / seq_lengths
# Combined ratio at token level:
# s_i,t(θ) = sg[s_i(θ)] · π_θ(y_i,t|x, y_i,<t) / sg[π_θ(y_i,t|x, y_i,<t)]
# In log space: log(s_i,t(θ)) = sg[log(s_i(θ))] + log_prob - sg[log_prob]
log_seq_importance_ratio = log_prob - log_prob.detach() + negative_approx_kl_seq.detach().unsqueeze(-1)
log_seq_importance_ratio = torch.clamp(log_seq_importance_ratio, max=10.0) # clamp for numerical stability
# finaly exp() to remove log
seq_importance_ratio = torch.exp(log_seq_importance_ratio)
pg_losses1 = -advantages * seq_importance_ratio
pg_losses2 = -advantages * torch.clamp(seq_importance_ratio, 1 - clip_ratio_low, 1 + clip_ratio_high)
pg_losses = torch.maximum(pg_losses1, pg_losses2)
# Apply rollout correction weights if provided
if rollout_is_weights is not None:
pg_losses = pg_losses * rollout_is_weights
# for GSPO, we need to aggregate the loss at the sequence level (seq-mean-token-mean)
pg_loss = agg_loss(
loss_mat=pg_losses,
loss_mask=response_mask,
loss_agg_mode="seq-mean-token-mean",
**config.global_batch_info
)
# For compatibility, return zero for pg_clipfrac_lower (not used in standard GSPO)
pg_clipfrac = verl_F.masked_mean(torch.gt(pg_losses2, pg_losses1).float(), response_mask)
pg_clipfrac_lower = torch.tensor(0.0, device=pg_loss.device)
ppo_kl = verl_F.masked_mean(-negative_approx_kl, response_mask)
pg_metrics = {
"actor/pg_clipfrac": pg_clipfrac.detach().item(),
"actor/ppo_kl": ppo_kl.detach().item(),
"actor/pg_clipfrac_lower": pg_clipfrac_lower.detach().item(),
}
return pg_loss, pg_metrics
GRPO 基于 token 级别计算重要性 ratio,GSPO 基于 sequence 级别计算重要性 ratio,相关代码差异如下图所示。

背景:GRPO 已在 NPU 支持
观测现有 GRPO 和 GSPO 的代码差异分析,仅涉及常规 torch 编码,不涉及其余组件,不涉及 NPU 适配工作。
调试目标
配置 GSPO 相关参数,拉起训练,reward 曲线正常上升。
基于 GRPO 脚本,参数使能
loss_mode=gspo loss_agg_mode="seq-mean-token-mean" \
actor_rollout_ref.actor.policy_loss.loss_mode=${loss_mode} \
actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
调试用例 1:Qwen25-3B
稠密模型调试 FSDP 后端,以官方脚本为准,观测 reward 曲线是否上升。
调试用例 2:Qwen3-30B-A3B
MOE 模型调试 Megatron 后端,观测 GRPO 和 GSPO 两种算法的 reward 曲线。
| 配置项 | 版本信息 |
|---|---|
| AI 服务器 | Atlas 800T A2 64G |
| 驱动、固件 | 24.1.0.3 |
| Python | 3.10.12 |
| CANN | 8.2.RC2 |
| torch | 2.7.1 |
| torch_npu | 2.7.1 |
| transformer | 4.53.3 |
| vllm | 0.10.0 |
| vllm-ascend | 0.10.0rc1 |
| verl | 0.7.0.dev0 |
| Megatron-core | 0.12.1 |
| MindSpeed | 2.2.0_core_r0.12.1 |
| Mbridge | 0.13.1 |
set -x
pkill -9 python
ps -ef |grep"python"|grep -v grep|awk'{print $2}'|xargs -t -i kill -9 {}
ray stop --force
pkill -9 python
pkill -9 torchrun
ps -ef |grep"defunct"|grep python |awk'{print $3}'|xargs -t -i kill -9 {}
ps -ef |grep"defunct"|grep torchrun |awk'{print $3}'|xargs -t -i kill -9 {}
ps -ef |grep -i python |grep -i [name]|grep -v grep|awk'{print $2}'|xargs -t -I {}kill -9 {}
ps -ef |grep -i torchrun |grep -i [name]|grep -v grep|awk'{print $2}'|xargs -t -I {}kill -9 {}
ps -ef |grep"python"|grep -v grep|awk'{print $2}'|xargs -t -i kill -9 {}
# Set how many GPUs we actually have on this node.
export GPUS_PER_NODE=8
export NNODES=1
echo"Using $NNODES nodes for training..."
# ------------------------------------- Setup xp params ---------------------------------------
project_name='RL-GSPO'
adv_estimator=grpo
loss_mode=gspo
loss_agg_mode="seq-mean-token-mean"
MODEL_PATH=XX/Qwen25-3B-Instruct
offload=false # it's a small model, offloading will just slow-down training
rollout_engine=vllm
rollout_mode=sync # can be async to speedup large scale
xp_sgpu_memory_utilization=0.6
reward_manager=dapo
adv_estimator=grpo
shuffle_dataset=true
first_time_dataset_prep=true # prepare dataset
test_freq=10
save_freq=10
total_epochs=10
total_training_steps=500
val_before_train=false
use_kl_in_reward=false
kl_coef=0.0
use_kl_loss=false
kl_loss_coef=0.0
clip_ratio_low=0.0003 # as recommended by the paper, see Sec. 5.1
clip_ratio_high=0.0004 # as recommended by the paper, see Sec. 5.1
train_batch_size=512
ppo_mini_batch_size=128 # maintain 4 mini-batches as recommended by the paper, see Sec. 5.1
ppo_micro_batch_size_per_gpu=8 # setup depending on your GPU memory
n_resp_per_prompt=16
max_prompt_length=$((1024*2))
max_response_length=$((1024*8))
# dapo reward manager params
enable_overlong_buffer=false # true
overlong_buffer_len=$((1024*4))
overlong_penalty_factor=1.0
# Paths and namings
SFT_MODEL=$(basename $MODEL_PATH)
exp_name="${loss_mode}-epslow-${clip_ratio_low}-epshigh-${clip_ratio_high}-${SFT_MODEL}-RL"
CKPTS_DIR=/rl/checkpoints/experimental/4b/${loss_mode}/${exp_name}
# Sampling params at rollout
temperature=1.0
top_p=1.0
top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
val_top_p=0.7
# Performance Related Parameters
sp_size=4
use_dynamic_bsz=true
actor_ppo_max_token_len=$(((max_prompt_length + max_response_length)*1))
infer_ppo_max_token_len=$(((max_prompt_length + max_response_length)*1))
offload=true
gen_tp=2
entropy_checkpointing=true # This enables entropy recomputation specifically for the entropy calculation, lowering memory usage during training.
gsm8k_train_path=xx/gsm8k/post_data/gsm8k/train.parquet
gsm8k_test_path=xx/gsm8k/post_data/gsm8k/test.parquet # set the path
train_files="['$gsm8k_train_path']"
test_files="['$gsm8k_test_path']"
#! 修改 filter_overlong_prompts false
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=${adv_estimator} \
actor_rollout_ref.actor.policy_loss.loss_mode=${loss_mode} \
data.train_files="${train_files}" \
data.val_files="${test_files}" \
data.shuffle=$shuffle_dataset \
data.prompt_key=prompt \
data.truncation='error' \
data.filter_overlong_prompts=true \
data.train_batch_size=${train_batch_size} \
data.max_prompt_length=${max_prompt_length} \
data.max_response_length=${max_response_length} \
actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
algorithm.use_kl_in_reward=${use_kl_in_reward} \
algorithm.kl_ctrl.kl_coef=${kl_coef} \
actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
actor_rollout_ref.model.use_remove_padding=true \
actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.name=${rollout_engine} \
actor_rollout_ref.rollout.mode=${rollout_mode} \
actor_rollout_ref.model.path="${MODEL_PATH}" \
actor_rollout_ref.model.enable_gradient_checkpointing=true \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.05 \
actor_rollout_ref.actor.optim.weight_decay=0.1 \
actor_rollout_ref.actor.ppo_mini_batch_size=${ppo_mini_batch_size} \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${ppo_micro_batch_size_per_gpu} \
actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.actor.grad_clip=1.0 \
actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
actor_rollout_ref.rollout.gpu_memory_utilization=${gpu_memory_utilization} \
actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
actor_rollout_ref.rollout.enable_chunked_prefill=true \
actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
actor_rollout_ref.rollout.temperature=${temperature} \
actor_rollout_ref.rollout.top_p=${top_p} \
actor_rollout_ref.rollout.top_k=${top_k} \
actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
actor_rollout_ref.rollout.val_kwargs.do_sample=true \
actor_rollout_ref.rollout.val_kwargs.n=1 \
actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
actor_rollout_ref.actor.entropy_checkpointing=${entropy_checkpointing} \
reward_model.reward_manager=${reward_manager} \
+reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
+reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \
+reward_model.reward_kwargs.overlong_buffer_cfg.log=false \
+reward_model.reward_kwargs.max_resp_len=${max_response_length} \
trainer.logger='["console"]' \
actor_rollout_ref.rollout.enforce_eager=True \
actor_rollout_ref.actor.use_torch_compile=False \
trainer.project_name="${project_name}" \
trainer.experiment_name="${exp_name}" \
trainer.n_gpus_per_node=16 \
trainer.nnodes=1 \
trainer.val_before_train=${val_before_train} \
trainer.test_freq=${test_freq} \
trainer.save_freq=${save_freq} \
trainer.total_epochs=${total_epochs} \
trainer.total_training_steps=${total_training_steps} \
trainer.default_local_dir="${CKPTS_DIR}" \
trainer.resume_mode=auto \
trainer.device=npu \
trainer.log_val_generations=2
$@


结论:曲线正常上升,上升趋势一致,符合预期结果
project_name='DAPO'
exp_name='GSPO-Qwen3-30B-A3B-4nodes'
adv_estimator=grpo
use_kl_in_reward=False
kl_coef=0.0
use_kl_loss=False
kl_loss_coef=0.0
clip_ratio_low=3e-4
clip_ratio_high=4e-4
max_prompt_length=$((1024*2))
max_response_length=$((1024*8))
enable_overlong_buffer=True
overlong_buffer_len=$((1024*4))
overlong_penalty_factor=1.0
loss_agg_mode="token-mean"
loss_mode=gspo
train_prompt_bsz=32
n_resp_per_prompt=2
train_prompt_mini_bsz=4
NNODES=${NNODES:-2}
NGPUS_PER_NODE=${NGPUS_PER_NODE:-8}
MODEL_PATH=xx/weight/Qwen3_30B/Qwen3_30B
CKPTS_DIR=$DATA_ROOT/checkpoint/${project_name}/${exp_name}
TRAIN_FILE=xx/rl_data/dapo-math/dapo-math-17k.parquet
aime24_test_path=xx/rl_data/dapo-math/dapo-math-17k.parquet
TEST_FILE="['$aime24_test_path']"
# Algorithm
temperature=1.0
top_p=1.0
top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
val_top_p=0.7
# Performance Related Parameter
use_dynamic_bsz=True
actor_ppo_max_token_len=$(((max_prompt_length + max_response_length)*1))
infer_ppo_max_token_len=$(((max_prompt_length + max_response_length)*1))
offload=True
# gen
rollout_name=vllm # vllm or sglang
gen_tp=4
gen_dp=8
# train
train_tp=4
train_pp=4
EP=8
ETP=1
RUNTIME_ENV=verl/trainer/mc2_env.yaml
cd /opt/verl
ray job submit --runtime-env="${RUNTIME_ENV}" \
-- python3 -m verl.trainer.main_ppo \
--config-path=config \
--config-name='ppo_megatron_trainer.yaml' \
data.train_files="${TRAIN_FILE}" \
data.val_files="${TEST_FILE}" \
data.prompt_key=prompt \
data.return_raw_chat=True \
data.truncation='left' \
data.max_prompt_length=${max_prompt_length} \
data.max_response_length=${max_response_length} \
data.train_batch_size=${train_prompt_bsz} \
actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
actor_rollout_ref.actor.policy_loss.loss_mode=${loss_mode} \
algorithm.adv_estimator=${adv_estimator} \
algorithm.use_kl_in_reward=${use_kl_in_reward} \
algorithm.kl_ctrl.kl_coef=${kl_coef} \
actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
actor_rollout_ref.actor.clip_ratio_c=10.0 \
actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
actor_rollout_ref.model.path="${MODEL_PATH}" \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
actor_rollout_ref.actor.optim.weight_decay=0.1 \
actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.actor.optim.clip_grad=1.0 \
actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
actor_rollout_ref.actor.megatron.param_offload=${offload} \
actor_rollout_ref.actor.megatron.optimizer_offload=${offload} \
actor_rollout_ref.actor.megatron.grad_offload=${offload} \
actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=${train_pp} \
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=${train_tp} \
actor_rollout_ref.actor.megatron.expert_model_parallel_size=$EP \
actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=$ETP \
actor_rollout_ref.rollout.gpu_memory_utilization=0.80 \
actor_rollout_ref.rollout.enable_chunked_prefill=True \
actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
actor_rollout_ref.rollout.temperature=${temperature} \
actor_rollout_ref.rollout.top_p=${top_p} \
actor_rollout_ref.rollout.top_k=${top_k} \
actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
actor_rollout_ref.rollout.val_kwargs.do_sample=True \
actor_rollout_ref.rollout.val_kwargs.n=1 \
actor_rollout_ref.rollout.name=${rollout_name} \
actor_rollout_ref.rollout.mode=sync \
actor_rollout_ref.rollout.calculate_log_probs=True \
actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
actor_rollout_ref.rollout.data_parallel_size=${gen_dp} \
actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=${train_pp} \
actor_rollout_ref.ref.megatron.tensor_model_parallel_size=${train_tp} \
actor_rollout_ref.ref.megatron.expert_model_parallel_size=$EP \
actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=$ETP \
actor_rollout_ref.ref.megatron.param_offload=${offload} \
actor_rollout_ref.actor.megatron.use_mbridge=True \
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_router_dtype=fp32 \
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform \
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full \
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1 \
reward_model.reward_manager=dapo \
+reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
+reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \
+reward_model.reward_kwargs.overlong_buffer_cfg.log=False \
+reward_model.reward_kwargs.max_resp_len=${max_response_length} \
trainer.logger="console" \
trainer.project_name="${project_name}" \
trainer.experiment_name="${exp_name}-tp${gen_tp}-ep${gen_ep}" \
trainer.n_gpus_per_node="${NGPUS_PER_NODE}" \
trainer.nnodes="${NNODES}" \
trainer.val_before_train=False \
trainer.test_freq=-1 \
trainer.save_freq=-1 \
trainer.total_epochs=10 \
trainer.total_training_steps=300 \
trainer.default_local_dir="${CKPTS_DIR}" \
trainer.resume_mode=auto \
trainer.log_val_generations=10 \
+actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True \
trainer.device="npu"
$@

结论:GSPO 在 MOE 模型上展示出了更好的训练效果。
基于最新版本 verl 调试服务化 vllm 后端和 engineworker 功能。
| 配置项 | 版本信息 | 备注 |
|---|---|---|
| CANN | 8.3.RC1 | |
| torch | 2.7.1 | |
| torch_npu | 2.7.1 | |
| transformer | 4.57.3 | |
| vllm | v0.11.0 | |
| vllm-ascend | v0.11.0rc1 | |
| verl | commit:5a2e0b1c272b33 | 10 月 10 日代码 |
#!/usr/bin/env bash
set -x
pkill -9 python
ps -ef |grep"python"|grep -v grep|awk'{print $2}'|xargs -t -i kill -9 {}
ray stop --force
pkill -9 python
pkill -9 torchrun
ps -ef |grep"defunct"|grep python |awk'{print $3}'|xargs -t -i kill -9 {}
ps -ef |grep"defunct"|grep torchrun |awk'{print $3}'|xargs -t -i kill -9 {}
ps -ef |grep -i python |grep -i [name]|grep -v grep|awk'{print $2}'|xargs -t -I {}kill -9 {}
ps -ef |grep -i torchrun |grep -i [name]|grep -v grep|awk'{print $2}'|xargs -t -I {}kill -9 {}
ps -ef |grep"python"|grep -v grep|awk'{print $2}'|xargs -t -i kill -9 {}
# Set how many GPUs we actually have on this node.
export GPUS_PER_NODE=8
export NNODES=1
export VLLM_ASCEND_ENABLE_NZ=0
echo"Using $NNODES nodes for training..."
#export ASCEND_LAUNCH_BLOCKING=1
# ------------------------------------- Setup xp params ---------------------------------------
project_name='RL-GSPO'
adv_estimator=grpo
loss_mode=gspo
loss_agg_mode="seq-mean-token-mean"
MODEL_PATH=xx/weights/Qwen2.5-3B-Instruct
offload=false # it's a small model, offloading will just slow-down training
rollout_engine=vllm
rollout_mode=async
return_raw_chat="True"
if [ "$rollout_engine" = "vllm" ]; then
export VLLM_USE_V1=1
fi
gpu_memory_utilization=0.6
reward_manager=dapo
adv_estimator=grpo
shuffle_dataset=true
first_time_dataset_prep=true # prepare dataset
test_freq=10
save_freq=10
total_epochs=10
total_training_steps=500
val_before_train=false
use_kl_in_reward=false
kl_coef=0.0
use_kl_loss=false
kl_loss_coef=0.0
clip_ratio_low=0.0003 # as recommended by the paper, see Sec. 5.1
clip_ratio_high=0.0004 # as recommended by the paper, see Sec. 5.1
train_batch_size=512
ppo_mini_batch_size=128 # maintain 4 mini-batches as recommended by the paper, see Sec. 5.1
ppo_micro_batch_size_per_gpu=8 # setup depending on your GPU memory
n_resp_per_prompt=16
max_prompt_length=$((1024*2))
max_response_length=$((1024*8))
# dapo reward manager params
enable_overlong_buffer=false # true
overlong_buffer_len=$((1024*4))
overlong_penalty_factor=1.0
# Paths and namings
SFT_MODEL=$(basename $MODEL_PATH)
exp_name="${loss_mode}-epslow-${clip_ratio_low}-epshigh-${clip_ratio_high}-${SFT_MODEL}-RL"
CKPTS_DIR=/rl/checkpoints/experimental/4b/${loss_mode}/${exp_name}
# Sampling params at rollout
temperature=1.0
top_p=1.0
top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
val_top_p=0.7
# Performance Related Parameters
sp_size=4
use_dynamic_bsz=true
actor_ppo_max_token_len=$(((max_prompt_length + max_response_length)*1))
infer_ppo_max_token_len=$(((max_prompt_length + max_response_length)*1))
offload=true
gen_tp=2
entropy_checkpointing=true # This enables entropy recomputation specifically for the entropy calculation, lowering memory usage during training.
# ------------------------------------- train/val data preparation ---------------------------------------
# if [ "$first_time_dataset_prep" = true ]; then
# echo "Preprocessing GSM8K dataset..."
# python examples/data_preprocess/gsm8k.py --local_save_dir /data01/huawei-2025/rl_data/gsm8k/data_later --local_dataset_path /data01/huawei-2025/rl_data/gsm8k/
# figsm8k_train_path=xx/data/post_gsm8k/train.parquet gsm8k_test_path=xx/data/post_gsm8k/test.parquet # set the path
train_files="['$gsm8k_train_path']"
test_files="['$gsm8k_test_path']"
#! 修改 filter_overlong_prompts false
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=${adv_estimator} \
actor_rollout_ref.actor.policy_loss.loss_mode=${loss_mode} \
data.train_files="${train_files}" \
data.val_files="${test_files}" \
data.shuffle=$shuffle_dataset \
data.prompt_key=prompt \
data.truncation='error' \
data.filter_overlong_prompts=true \
data.return_raw_chat=${return_raw_chat} \
data.train_batch_size=${train_batch_size} \
data.max_prompt_length=${max_prompt_length} \
data.max_response_length=${max_response_length} \
actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
algorithm.use_kl_in_reward=${use_kl_in_reward} \
algorithm.kl_ctrl.kl_coef=${kl_coef} \
actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
actor_rollout_ref.model.use_remove_padding=true \
actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
actor_rollout_ref.rollout.name=${rollout_engine} \
actor_rollout_ref.rollout.mode=${rollout_mode} \
actor_rollout_ref.model.path="${MODEL_PATH}" \
actor_rollout_ref.model.enable_gradient_checkpointing=true \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.05 \
actor_rollout_ref.actor.optim.weight_decay=0.1 \
actor_rollout_ref.actor.ppo_mini_batch_size=${ppo_mini_batch_size} \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${ppo_micro_batch_size_per_gpu} \
actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.actor.grad_clip=1.0 \
actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
actor_rollout_ref.rollout.gpu_memory_utilization=${gpu_memory_utilization} \
actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
actor_rollout_ref.rollout.enable_chunked_prefill=true \
actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
actor_rollout_ref.rollout.temperature=${temperature} \
actor_rollout_ref.rollout.top_p=${top_p} \
actor_rollout_ref.rollout.top_k=${top_k} \
actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
actor_rollout_ref.rollout.val_kwargs.do_sample=true \
actor_rollout_ref.rollout.val_kwargs.n=1 \
actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
actor_rollout_ref.actor.entropy_checkpointing=${entropy_checkpointing} \
reward_model.reward_manager=${reward_manager} \
+reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
+reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \
+reward_model.reward_kwargs.overlong_buffer_cfg.log=false \
+reward_model.reward_kwargs.max_resp_len=${max_response_length} \
trainer.logger='["console"]' \
actor_rollout_ref.rollout.enforce_eager=True \
actor_rollout_ref.actor.use_torch_compile=False \
trainer.project_name="${project_name}" \
trainer.experiment_name="${exp_name}" \
trainer.n_gpus_per_node=8 \
trainer.nnodes=1 \
trainer.val_before_train=${val_before_train} \
trainer.test_freq=${test_freq} \
trainer.save_freq=-1 \
trainer.total_epochs=${total_epochs} \
trainer.total_training_steps=${total_training_steps} \
trainer.default_local_dir="${CKPTS_DIR}" \
trainer.resume_mode=auto \
trainer.device=npu \
trainer.log_val_generations=2
$@

结论:曲线正常上升,上升趋势与官方脚本一致,调试结果符合要求
trainer.use_legacy_worker_impl=disable \
#!/usr/bin/env bash
set -x
pkill -9 python
ps -ef |grep"python"|grep -v grep|awk'{print $2}'|xargs -t -i kill -9 {}
ray stop --force
pkill -9 python
pkill -9 torchrun
ps -ef |grep"defunct"|grep python |awk'{print $3}'|xargs -t -i kill -9 {}
ps -ef |grep"defunct"|grep torchrun |awk'{print $3}'|xargs -t -i kill -9 {}
ps -ef |grep -i python |grep -i [name]|grep -v grep|awk'{print $2}'|xargs -t -I {}kill -9 {}
ps -ef |grep -i torchrun |grep -i [name]|grep -v grep|awk'{print $2}'|xargs -t -I {}kill -9 {}
ps -ef |grep"python"|grep -v grep|awk'{print $2}'|xargs -t -i kill -9 {}
# Set how many GPUs we actually have on this node.
export GPUS_PER_NODE=8
export NNODES=1
export VLLM_ASCEND_ENABLE_NZ=0
echo"Using $NNODES nodes for training..."
#export ASCEND_LAUNCH_BLOCKING=1
# ------------------------------------- Setup xp params ---------------------------------------
project_name='RL-GSPO'
adv_estimator=grpo
loss_mode=gspo
loss_agg_mode="seq-mean-token-mean"
MODEL_PATH=xx/weights/Qwen2.5-3B-Instruct
offload=false # it's a small model, offloading will just slow-down training
rollout_engine=vllm
rollout_mode=async
return_raw_chat="True"
if [ "$rollout_engine" = "vllm" ]; then
export VLLM_USE_V1=1
fi
gpu_memory_utilization=0.6
reward_manager=dapo
adv_estimator=grpo
shuffle_dataset=true
first_time_dataset_prep=true # prepare dataset
test_freq=10
save_freq=10
total_epochs=10
total_training_steps=500
val_before_train=false
use_kl_in_reward=false
kl_coef=0.0
use_kl_loss=false
kl_loss_coef=0.0
clip_ratio_low=0.0003 # as recommended by the paper, see Sec. 5.1
clip_ratio_high=0.0004 # as recommended by the paper, see Sec. 5.1
train_batch_size=512
ppo_mini_batch_size=128 # maintain 4 mini-batches as recommended by the paper, see Sec. 5.1
ppo_micro_batch_size_per_gpu=8 # setup depending on your GPU memory
n_resp_per_prompt=16
max_prompt_length=$((1024*2))
max_response_length=$((1024*8))
# dapo reward manager params
enable_overlong_buffer=false # true
overlong_buffer_len=$((1024*4))
overlong_penalty_factor=1.0
# Paths and namings
SFT_MODEL=$(basename $MODEL_PATH)
exp_name="${loss_mode}-epslow-${clip_ratio_low}-epshigh-${clip_ratio_high}-${SFT_MODEL}-RL"
CKPTS_DIR=/rl/checkpoints/experimental/4b/${loss_mode}/${exp_name}
# Sampling params at rollout
temperature=1.0
top_p=1.0
top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
val_top_p=0.7
# Performance Related Parameters
sp_size=4
use_dynamic_bsz=true
actor_ppo_max_token_len=$(((max_prompt_length + max_response_length)*1))
infer_ppo_max_token_len=$(((max_prompt_length + max_response_length)*1))
offload=true
gen_tp=2
entropy_checkpointing=true # This enables entropy recomputation specifically for the entropy calculation, lowering memory usage during training.
# ------------------------------------- train/val data preparation ---------------------------------------
# if [ "$first_time_dataset_prep" = true ]; then
# echo "Preprocessing GSM8K dataset..."
# python examples/data_preprocess/gsm8k.py --local_save_dir /xx/rl_data/gsm8k/data_later --local_dataset_path xx/rl_data/gsm8k/
# figsm8k_train_path=xx/data/post_gsm8k/train.parquet gsm8k_test_path=xx/data/post_gsm8k/test.parquet # set the path
train_files="['$gsm8k_train_path']"
test_files="['$gsm8k_test_path']"
#! 修改 filter_overlong_prompts false
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=${adv_estimator} \
actor_rollout_ref.actor.policy_loss.loss_mode=${loss_mode} \
data.train_files="${train_files}" \
data.val_files="${test_files}" \
data.shuffle=$shuffle_dataset \
data.prompt_key=prompt \
data.truncation='error' \
data.filter_overlong_prompts=false \
data.return_raw_chat=${return_raw_chat} \
data.train_batch_size=${train_batch_size} \
data.max_prompt_length=${max_prompt_length} \
data.max_response_length=${max_response_length} \
actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
algorithm.use_kl_in_reward=${use_kl_in_reward} \
algorithm.kl_ctrl.kl_coef=${kl_coef} \
actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
actor_rollout_ref.model.use_remove_padding=true \
actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
actor_rollout_ref.rollout.name=${rollout_engine} \
actor_rollout_ref.rollout.mode=${rollout_mode} \
actor_rollout_ref.model.path="${MODEL_PATH}" \
actor_rollout_ref.model.enable_gradient_checkpointing=true \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.05 \
actor_rollout_ref.actor.optim.weight_decay=0.1 \
actor_rollout_ref.actor.ppo_mini_batch_size=${ppo_mini_batch_size} \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${ppo_micro_batch_size_per_gpu} \
actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.actor.grad_clip=1.0 \
actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
actor_rollout_ref.rollout.gpu_memory_utilization=${gpu_memory_utilization} \
actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
actor_rollout_ref.rollout.enable_chunked_prefill=true \
actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
actor_rollout_ref.rollout.temperature=${temperature} \
actor_rollout_ref.rollout.top_p=${top_p} \
actor_rollout_ref.rollout.top_k=${top_k} \
actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
actor_rollout_ref.rollout.val_kwargs.do_sample=true \
actor_rollout_ref.rollout.val_kwargs.n=1 \
actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
actor_rollout_ref.actor.entropy_checkpointing=${entropy_checkpointing} \
reward_model.reward_manager=${reward_manager} \
+reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
+reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \
+reward_model.reward_kwargs.overlong_buffer_cfg.log=false \
+reward_model.reward_kwargs.max_resp_len=${max_response_length} \
trainer.logger='["console"]' \
actor_rollout_ref.rollout.enforce_eager=True \
actor_rollout_ref.actor.use_torch_compile=False \
trainer.project_name="${project_name}" \
trainer.experiment_name="${exp_name}" \
trainer.n_gpus_per_node=8 \
trainer.nnodes=1 \
trainer.val_before_train=${val_before_train} \
trainer.test_freq=${test_freq} \
trainer.save_freq=-1 \
trainer.total_epochs=${total_epochs} \
trainer.total_training_steps=${total_training_steps} \
trainer.use_legacy_worker_impl=disable \
trainer.default_local_dir="${CKPTS_DIR}" \
trainer.resume_mode=auto \
trainer.device=npu \
trainer.log_val_generations=2
$@

结论:reward 曲线正常上升,调试结果符合要求
定位: transformer 的 npu 版本 patch 未正确使能,可修改如下部分代码使能 4.52.4 或 4.57.3 的 transformer 版本也可以解决。

在代码定位过程中发现官方 bug:https://github.com/volcengine/verl/pull/3978,现已合入。
File "/opt/pyvenv/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 760, in _compile_fx_inner
定位: 走到了图模式。


所以增加参数
actor_rollout_ref.actor.use_torch_compile=False
问题解决后,拉起训练。


调小 gbs 配置,修改 gpu_utils_memory 值,这个要根据 seq,模型动态调整,否则容易存在碎片张量 OOM。

解决方案:
train_prompt_bsz=32 n_resp_per_prompt=8 train_prompt_mini_bsz=8

修改训练阶段的策略,使用序列并行。
+actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True \

模型参数限制
num_query_groups (4) 需要能被 TP 切分,切换训练切分策略
修改后,拉起 megaton 后端的代码。

原因:GSPO 没有这个 filter_groups 属性,脚本改动疏忽修改即可。


微信公众号「极客日志」,在微信中扫描左侧二维码关注。展示文案:极客日志 zeeklog
使用加密算法(如AES、TripleDES、Rabbit或RC4)加密和解密文本明文。 在线工具,加密/解密文本在线工具,online
生成新的随机RSA私钥和公钥pem证书。 在线工具,RSA密钥对生成器在线工具,online
基于 Mermaid.js 实时预览流程图、时序图等图表,支持源码编辑与即时渲染。 在线工具,Mermaid 预览与可视化编辑在线工具,online
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。 在线工具,curl 转代码在线工具,online
将字符串编码和解码为其 Base64 格式表示形式即可。 在线工具,Base64 字符串编码/解码在线工具,online
将字符串、文件或图像转换为其 Base64 表示形式。 在线工具,Base64 文件转换器在线工具,online