Checkpoints and Resume

Quickstart

Set dcp_save_interval and log_path, then rerun with the same log_path to resume:

from training.recipes.sft_loop import Config, main
from training.utils import InfraConfig

config = Config(
    log_path="./my_training",
    base_model="accounts/fireworks/models/qwen3-8b",
    dataset="data.jsonl",
    tokenizer_model="Qwen/Qwen3-8B",
    dcp_save_interval=10,  # save every 10 steps
    infra=InfraConfig(
        training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
    ),
)
main(config)

# If interrupted, just run again with the same config.
# It finds the last checkpoint in log_path and resumes automatically.
main(config)

Checkpoint kinds

Every checkpoint uses a CheckpointKind:

Kind	What is saved	Resumable	Promotable
`STATE`	DCP (optimizer + weights)	Yes	No
`SAMPLER`	HF weights for inference	No	Yes
`BOTH`	DCP + HF weights	Yes	Yes

Mid-training saves (dcp_save_interval) use STATE — resumable but not promotable.
Final checkpoint (end of training) always uses BOTH — resumable and promotable.
To promote a mid-training checkpoint, call save_checkpoint explicitly with kind=SAMPLER or kind=BOTH.

dcp_save_interval defaults to 0 (off). Without setting it, training cannot be resumed from intermediate steps.

checkpoints.jsonl

Checkpoint metadata is written to {log_path}/checkpoints.jsonl — one JSON line per save. The fields present depend on the kind:

{"name": "step-10", "step": 10, "data_consumed": 40, "state_path": "cross_job://job-abc/step-10", "source_job_id": "job-abc", "base_model": "accounts/fireworks/models/qwen3-8b"}
{"name": "step-50", "step": 50, "data_consumed": 200, "state_path": "cross_job://job-abc/step-50", "sampler_path": "step-50-a1b2c3d4", "source_job_id": "job-abc", "base_model": "accounts/fireworks/models/qwen3-8b"}

Field	Present in	Description
`state_path`	STATE, BOTH	Remote DCP reference for resume
`sampler_path`	SAMPLER, BOTH	Snapshot name for promotion
`source_job_id`	All	Trainer job that created this checkpoint
`base_model`	All	Base model (auto-detected by promote script)

WeightSyncer.save_and_hotload() saves HF weights to GCS for hotloading but does not write to checkpoints.jsonl. Those checkpoints exist remotely but are not tracked here.

Resume

Automatic (same log_path)

Just rerun with the same log_path. The recipe reads checkpoints.jsonl, finds the last entry with a state_path, loads DCP state, and continues from the saved step.

From another job (init_from_checkpoint)

config = Config(
    log_path="./new_run",
    init_from_checkpoint="i44pvd4syzg8hjfk:step-4",  # job_id:checkpoint_name
    ...
)

Loads weights from the specified job, resets step to 0. Mutually exclusive with automatic resume.

Promoting a checkpoint

Only entries with sampler_path can be promoted (kind=SAMPLER or kind=BOTH). The final checkpoint is always promotable. Mid-training DCP saves are not.

export FIREWORKS_API_KEY=...

# Promote the latest promotable checkpoint:
python promote_checkpoint.py \
    --checkpoints-jsonl ./my_training/checkpoints.jsonl

# Promote a specific step:
python promote_checkpoint.py \
    --checkpoints-jsonl ./my_training/checkpoints.jsonl \
    --step 50

Without checkpoints.jsonl, use the SDK directly with the source_job_id and sampler_path:

from fireworks.training.sdk import FireworksClient

client = FireworksClient(api_key=api_key)
client.promote_checkpoint("job-abc", "step-50-a1b2c3d4", "my-fine-tuned-model")

See Saving and Loading — Promoting for full API details.

Config fields

Field	Type	Default	Description
`log_path`	`str`	(required)	Directory for `checkpoints.jsonl` and logs
`dcp_save_interval`	`int`	`0`	Save DCP checkpoint every N steps. `0` = off.
`init_from_checkpoint`	`str \| None`	`None`	Load DCP state from another job (`"job-id:checkpoint-name"`). Step resets to 0.

Get Started

Fire Pass

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

Quickstart

Checkpoint kinds

checkpoints.jsonl

Resume

Automatic (same log_path)

From another job (init_from_checkpoint)

Promoting a checkpoint

Config fields

Get Started

Fire Pass

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

​Quickstart

​Checkpoint kinds

​checkpoints.jsonl

​Resume

​Automatic (same log_path)

​From another job (init_from_checkpoint)

​Promoting a checkpoint

​Config fields

Quickstart

Checkpoint kinds

checkpoints.jsonl

Resume

Automatic (same log_path)

From another job (init_from_checkpoint)

Promoting a checkpoint

Config fields