内卷地狱
LearnAI Knowledge BaseReinforcement learning

PPO

Edit Me

PPO

1. Complete Pipeline

prompt batch -> actor.forward -> reward model -> critic.forward

2. Algorithm Implementation

  1. Importance sampling: Use a single sample to perform multiple updates. This maximizes sample efficiency while correcting for the divergence between the old policy and the new policy.
  2. Clipping (most common) / KL constraint: Limits the shift between the old and new policies to prevent gradient explosion or collapse.

1. Formulas

  1. Probability ratio rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}

π\pi: policy
θ\theta: parameters
aa: action
ss: state
tt: time step

Personal interpretation: the relative difference in decision-making between the new and old policies given the same state.

  1. Clipped objective

LCLIP(θ)=Et[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{CLIP}(\theta) = \mathbb{E}_t[min(r_t(\theta)\hat{A}_t,clip(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t)]

LL: loss function
Et\mathbb{E}_t: expected value of importance-sampling results at time step tt
A^t\hat{A}_t: advantage function at the current time step
ϵ\epsilon: clipping coefficient

Personal interpretation: clipping constrains the update step size — not too large, not too small — ensuring training stability.

  1. Advantage function:
  • GAE (Generalized Advantage Estimation)

AGAE(at,st)=l=0(γλ)lδt+lA^{GAE}(a_t,s_t) = \sum^{\infty}_{l=0}(\gamma\lambda)^l\delta_{t+l}

δt=rt+γV(st+1)V(st)\delta_t = r_t+\gamma V(s_{t+1})-V(s_t)

γ\gamma: discount factor
rr: reward
ll: number of delayed steps
λ\lambda: controls the bias-variance tradeoff of TD
λ=1\lambda = 1: equivalent to Monte Carlo return — retains every time step's TD
λ=0\lambda = 0: equivalent to single-step TD
0<λ<10 < \lambda < 1: retains every time step's TD, but with different weights for each

  1. Value function regression

Lvalue=12(Vθ(st)R^t)2L^{value} = \frac{1}{2}(V_{\theta}(s_t)-\hat{R}_t)^2

Vθ(st)V_{\theta}(s_t): value function at state sts_t at time tt, approximated by an MLP.

  1. Entropy regularization (encourages exploration) H(πθ)=Et[απθ(αst)logπθ(αst)]H(\pi_{\theta})=\mathbb{E_t}[-\sum_{\alpha}\pi_{\theta}(\alpha|s_t)log\pi_\theta(\alpha|s_t)]

Note: This is simply computing the entropy and taking its mean.

  1. Total loss

LPPO=LCLIP(θ)c1Lvalue+c2H(πθ)L^{PPO} = L^{CLIP}(\theta)-c_1L_{value}+c_2H(\pi_{\theta})

2. Detailed Walkthrough

1. Initialization Phase

1. Prompt batch

A batch of input data.

2. actor.forward

  • Uses the backbone network to generate a response for each data point, and saves the logits for each token step by step as logπoldlog\pi_{old}.

3. Reward

  • Reward model

Purpose: scores each output to obtain a reward value; the reward is used in the formulas above to construct the loss function. The reward value rr is stored in a buffer for use in subsequent training.

Architecture: an MLP

Position: attached after the last layer of the backbone network

Advantages: strong generalization ability

Disadvantages: requires large amounts of labeled data, poor interpretability, poor stability, computationally expensive

  • Reward function

Uses concrete rules to assign rewards, such as edit distance, repetition rate, etc.

Advantages: computationally simple and fast, strong interpretability, stable

Disadvantages: poor generalization — only effective in specific scenarios

4. critic.forward

Purpose: approximates the value function of the backbone network, computes the advantage function, and stores it in a buffer for use in subsequent training.

Architecture: an MLP

Position: attached after the last layer of the backbone network

2. Training Phase

Repeatedly train the model using data stored in the buffer; re-sample after several rounds.

3. Network Architecture

graph TD
    Inputs--> b[Backbone network]
    b --(optional)--> reward
    b --> crtirc

4. Dataset Construction

  • reward function
{
role: user,
content: <full context / including special tokens to distinguish roles>
}
solution: <model answer (positive example or set of positive examples)>: used for reward function

Important Note

If there are any errors or unclear explanations, please reach out for corrections. WeChat: m1197501753


贡献者


这篇文章有帮助吗?

最近更新

Involution Hell© 2026 byCommunityunderCC BY-NC-SA 4.0CCBYNCSA