PPO
PPO
1. Complete Pipeline
prompt batch -> actor.forward -> reward model -> critic.forward
2. Algorithm Implementation
- Importance sampling: Use a single sample to perform multiple updates. This maximizes sample efficiency while correcting for the divergence between the old policy and the new policy.
- Clipping (most common) / KL constraint: Limits the shift between the old and new policies to prevent gradient explosion or collapse.
1. Formulas
- Probability ratio
: policy
: parameters
: action
: state
: time step
Personal interpretation: the relative difference in decision-making between the new and old policies given the same state.
- Clipped objective
: loss function
: expected value of importance-sampling results at time step
: advantage function at the current time step
: clipping coefficient
Personal interpretation: clipping constrains the update step size — not too large, not too small — ensuring training stability.
- Advantage function:
- GAE (Generalized Advantage Estimation)
: discount factor
: reward
: number of delayed steps
: controls the bias-variance tradeoff of TD
: equivalent to Monte Carlo return — retains every time step's TD
: equivalent to single-step TD
: retains every time step's TD, but with different weights for each
- Value function regression
: value function at state at time , approximated by an MLP.
- Entropy regularization (encourages exploration)
Note: This is simply computing the entropy and taking its mean.
- Total loss
2. Detailed Walkthrough
1. Initialization Phase
1. Prompt batch
A batch of input data.
2. actor.forward
- Uses the backbone network to generate a response for each data point, and saves the logits for each token step by step as .
3. Reward
- Reward model
Purpose: scores each output to obtain a reward value; the reward is used in the formulas above to construct the loss function. The reward value is stored in a buffer for use in subsequent training.
Architecture: an MLP
Position: attached after the last layer of the backbone network
Advantages: strong generalization ability
Disadvantages: requires large amounts of labeled data, poor interpretability, poor stability, computationally expensive
- Reward function
Uses concrete rules to assign rewards, such as edit distance, repetition rate, etc.
Advantages: computationally simple and fast, strong interpretability, stable
Disadvantages: poor generalization — only effective in specific scenarios
4. critic.forward
Purpose: approximates the value function of the backbone network, computes the advantage function, and stores it in a buffer for use in subsequent training.
Architecture: an MLP
Position: attached after the last layer of the backbone network
2. Training Phase
Repeatedly train the model using data stored in the buffer; re-sample after several rounds.
3. Network Architecture
graph TD
Inputs--> b[Backbone network]
b --(optional)--> reward
b --> crtirc4. Dataset Construction
- reward function
{
role: user,
content: <full context / including special tokens to distinguish roles>
}
solution: <model answer (positive example or set of positive examples)>: used for reward functionImportant Note
If there are any errors or unclear explanations, please reach out for corrections. WeChat: m1197501753
贡献者
最近更新
Involution Hell© 2026 byCommunityunderCC BY-NC-SA 4.0
Wang Shusen Recommender Systems Study Notes — Retrieval
Learn Item-based Collaborative Filtering for recommender system retrieval, covering similarity computation and user interest estimation—ideal for AI/ML students and engineers.
Reinforcement Learning
Fundamentals of reinforcement learning, Chain-of-Thought (CoT), GRPO, and their applications in large language models