论文:《DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning》
一、GRPO 损失函数
二、GRPO 算法核心组成部分
GRPO 算法可分解为四个关键部分:
策略损失(policy loss):模型在有适配器和没有适配器情况下的词元概率分布比率。
优势值(advantages):从奖励函数中计算得出。
比率裁剪(clip):确保在任何单独步骤中都没有大的损失值。
KL 散度:确保训练过程中,模型不会偏离基准模型太多。
1. 模型加载与初始化
首先加载所需的模型和分词器,并打印模型的网络结构和生成文本的效果。
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# 初始化 model 和 tokenizer
model_str = "babylm/babyllama-100m-2024"
base_model = AutoModelForCausalLM.from_pretrained(model_str)
tokenizer = AutoTokenizer.from_pretrained(model_str)
# pad on the left so we can append new tokenizer on the right
tokenizer.padding_side = "left"
tokenizer.truncation_side = "left"print(base_model)
prompt = "The quick brown fox jumped over the "
input_ids = tokenizer(prompt, return_tensors="pt")
print(input_ids)
# Generate next 2 tokens with torch.no_grad()with torch.no_grad():
outputs = base_model.generate(
**input_ids,
max_new_tokens=2,
pad_token_id=tokenizer.pad_token_id
)
# Decode the generated text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
generated_portion = generated_text[len(prompt):]
print(f"Generated text: {prompt}")
import copy
from peft import LoraConfig, get_peft_model
# Create a copy of the base model to use as the reference model
ref_model = copy.deepcopy(base_model)
# 初始化 LoRA 配置文件
lora_config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
init_lora_weights=False,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA to model
model = get_peft_model(base_model, lora_config)
print(model)