Quickstart
Set dcp_save_interval and log_path, then rerun with the same log_path to resume:
from training.recipes.sft_loop import Config, main
from training.utils import InfraConfig
config = Config(
log_path="./my_training",
base_model="accounts/fireworks/models/qwen3-8b",
dataset="data.jsonl",
tokenizer_model="Qwen/Qwen3-8B",
dcp_save_interval=10, # save every 10 steps
infra=InfraConfig(
training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
),
)
main(config)
# If interrupted, just run again with the same config.
# It finds the last checkpoint in log_path and resumes automatically.
main(config)
Checkpoint kinds
Every checkpoint uses a CheckpointKind:
| Kind | What is saved | Resumable | Promotable |
|---|
STATE | DCP (optimizer + weights) | Yes | No |
SAMPLER | HF weights for inference | No | Yes |
BOTH | DCP + HF weights | Yes | Yes |
- Mid-training saves (
dcp_save_interval) use STATE — resumable but not promotable.
- Final checkpoint (end of training) always uses
BOTH — resumable and promotable.
- To promote a mid-training checkpoint, call
save_checkpoint explicitly with kind=SAMPLER or kind=BOTH.
dcp_save_interval defaults to 0 (off). Without setting it, training cannot be resumed from intermediate steps.
checkpoints.jsonl
Checkpoint metadata is written to {log_path}/checkpoints.jsonl — one JSON line per save. The fields present depend on the kind:
{"name": "step-10", "step": 10, "data_consumed": 40, "state_path": "cross_job://job-abc/step-10", "source_job_id": "job-abc", "base_model": "accounts/fireworks/models/qwen3-8b"}
{"name": "step-50", "step": 50, "data_consumed": 200, "state_path": "cross_job://job-abc/step-50", "sampler_path": "step-50-a1b2c3d4", "source_job_id": "job-abc", "base_model": "accounts/fireworks/models/qwen3-8b"}
| Field | Present in | Description |
|---|
state_path | STATE, BOTH | Remote DCP reference for resume |
sampler_path | SAMPLER, BOTH | Snapshot name for promotion |
source_job_id | All | Trainer job that created this checkpoint |
base_model | All | Base model (auto-detected by promote script) |
WeightSyncer.save_and_hotload() saves HF weights to GCS for hotloading but does not write to checkpoints.jsonl. Those checkpoints exist remotely but are not tracked here.
Resume
Automatic (same log_path)
Just rerun with the same log_path. The recipe reads checkpoints.jsonl, finds the last entry with a state_path, loads DCP state, and continues from the saved step.
From another job (init_from_checkpoint)
config = Config(
log_path="./new_run",
init_from_checkpoint="i44pvd4syzg8hjfk:step-4", # job_id:checkpoint_name
...
)
Loads weights from the specified job, resets step to 0. Mutually exclusive with automatic resume.
Only entries with sampler_path can be promoted (kind=SAMPLER or kind=BOTH). The final checkpoint is always promotable. Mid-training DCP saves are not.
export FIREWORKS_API_KEY=...
# Promote the latest promotable checkpoint:
python promote_checkpoint.py \
--checkpoints-jsonl ./my_training/checkpoints.jsonl
# Promote a specific step:
python promote_checkpoint.py \
--checkpoints-jsonl ./my_training/checkpoints.jsonl \
--step 50
Without checkpoints.jsonl, use the SDK directly with the source_job_id and sampler_path:
from fireworks.training.sdk import FireworksClient
client = FireworksClient(api_key=api_key)
client.promote_checkpoint("job-abc", "step-50-a1b2c3d4", "my-fine-tuned-model")
See Saving and Loading — Promoting for full API details.
Config fields
| Field | Type | Default | Description |
|---|
log_path | str | (required) | Directory for checkpoints.jsonl and logs |
dcp_save_interval | int | 0 | Save DCP checkpoint every N steps. 0 = off. |
init_from_checkpoint | str | None | None | Load DCP state from another job ("job-id:checkpoint-name"). Step resets to 0. |