Training parameters
| Flag | Default | Suggested range | When to change |
|---|---|---|---|
--epochs | 1 | 1 – 10 (whole numbers only) | Add 1-2 more passes if the reward still climbs steadily near the end of training. Too many epochs risks over-fitting. |
--learning-rate | 1 e-4 | 1 e-5 – 5 e-4 | Decrease when the reward spikes then collapses; increase when the curve plateaus too early. |
--lora-rank | 8 | 4 – 32 (powers of 2) | Higher ranks may potentially improve training quality, but often require more data/iterations to train |
--max-context-length | 8192 tokens | Up to model limit | Raise only when your prompts truncate; remember longer sequences consume quadratic compute. |
Example usage
Rollout (sampling) parameters
During each training step, the model generates multiple responses with stochastic decoding. These parameters control that generation process.| Field | CLI flag | Default | Recommended range | Why it matters |
|---|---|---|---|---|
| Maximum tokens | --max-tokens | 2 048 | 16 – 16 384 | Longer responses improve reward on summarisation / story tasks but add cost. |
| Temperature | --temperature | 0.7 | 0.1 – 2.0 ( > 0 only ) | Values below 0.1 converge towards greedy decoding and kill exploration; 0.5–1.0 is a sweet spot for RLHF. |
| Top-p | --top-p | 1.0 | 0 – 1 | Lower to 0.2–0.5 to clamp long-tail tokens when the reward penalises hallucinations. |
| Top-k | --top-k | 40 | 0 – 100 (0 = off) | Combine with temperature for more creative exploration; keep ≤50 for latency. |
| n (choices) | --n | 4 | 2 – 8 | Policy-Optimization needs multiple candidates to compute a meaningful KL term; ≥2 is mandatory. |
| Extra body JSON | --inference-extra-body | empty | valid JSON | Pass extra OpenAI-style params (e.g., stop, logit_bias). Invalid JSON is rejected. |
Example usage
Quick reference by goal
| Goal | Parameters to adjust |
|---|---|
| Faster convergence | ↑ epochs, tune learning-rate < 2× default |
| Safer / less toxic | ↓ temperature, top_p, top_k |
| More creative | temperature ≈ 1 – 1.2, top_p 0.9 |
| Cheaper roll-outs | ↓ n, max_tokens, batch size |
| Higher capacity | ↑ lora-rank |
Important constraints
Temperature must be > 0
Greedy sampling (temperature 0) is deterministic and cannot produce multiple different rollouts, which is a requirement for RFT. It also can lead to mode-dropping and repetitive text.At least 2 rollouts required
Policy optimization needs multiple candidates per prompt to compute a meaningful KL divergence term. Setting--inference-n 1 will fail.