Saving and Loading

What this is

During training, you save checkpoints for three purposes:

Weight sync (save_weights_for_sampler_ext): Push updated weights to a running inference deployment without restarting it.
Resuming (save_state / load_state_with_optimizer): Persist full training state (weights + optimizer) so you can continue training from where you left off.
Promotion (promote_checkpoint): Turn a saved sampler checkpoint into a deployable Fireworks model.

Sampler checkpoints

Base vs. delta

Type	What it saves	Size	Promotable
`"base"`	Full model weights	Large (~16 GB for 8B model)	Yes
`"delta"`	XOR diff from previous base	Small (~10x smaller)	No

Delta checkpoints are much faster to save and transfer, making per-step weight sync practical for on-policy training. The deployment applies: current_weights = base XOR delta. LoRA exception: The server always saves full adapter weights regardless of checkpoint_type. Every LoRA sampler checkpoint is promotable.

Saving checkpoints

# First checkpoint — must be base (full weights)
result = training_client.save_weights_for_sampler_ext(
    "step-0001",
    checkpoint_type="base",
)
# result.snapshot_name is session-qualified (e.g. "step-0001-a1b2c3d4")

# Subsequent checkpoints — delta is faster
result = training_client.save_weights_for_sampler_ext(
    "step-0010",
    checkpoint_type="delta",
)

# With TTL (auto-delete after N seconds)
result = training_client.save_weights_for_sampler_ext(
    "temp-checkpoint",
    checkpoint_type="delta",
    ttl_seconds=3600,
)

Promoting a checkpoint to a model

Promote a sampler checkpoint to a deployable Fireworks model. Available on both FireworksClient and TrainerJobManager. The trainer job does not need to be running — job_id only resolves where checkpoint files are stored. Only base sampler checkpoints are promotable for full-parameter training. For LoRA, all sampler checkpoints are promotable.

from fireworks.training.sdk import FireworksClient

client = FireworksClient(api_key=api_key)
model = client.promote_checkpoint(
    job_id=endpoint.job_id,
    checkpoint_id=result.snapshot_name,
    output_model_id="my-fine-tuned-qwen3-8b",
)

Parameter	Type	Description
`job_id`	`str`	RLOR trainer job ID that produced the checkpoint
`checkpoint_id`	`str`	The `snapshot_name` from `save_weights_for_sampler_ext`
`output_model_id`	`str`	Desired model ID (1-63 chars, lowercase a-z, 0-9, hyphen only)

Weight sync

Weight sync pushes a checkpoint onto a running inference deployment without restarting it. See WeightSyncer for the recommended lifecycle manager.

from fireworks.training.sdk import WeightSyncer

syncer = WeightSyncer(
    policy_client=training_client,
    deploy_mgr=deploy_mgr,
    deployment_id="my-deployment",
    base_model="accounts/fireworks/models/qwen3-8b",
    hotload_timeout=600,
    first_checkpoint_type="base",
)

# Automatically handles base (first) vs delta (subsequent)
syncer.save_and_hotload(f"step-{step:05d}")

save_and_hotload saves HF weights to remote storage and hotloads them, but does not write to the cookbook’s checkpoints.jsonl. To create a promotable checkpoint tracked in checkpoints.jsonl, use the cookbook’s save_checkpoint with kind=SAMPLER or kind=BOTH.

Train-state checkpoints

Use save_state to persist full training state, and one of two load methods to restore it:

Method	Weights	Optimizer state
`load_state_with_optimizer(path)`	Restored	Restored
`load_state(path)`	Restored	Reset to zero

# Save full train state for resume
training_client.save_state("train_state_step_100").result()

# Resume training (weights + optimizer restored)
training_client.load_state_with_optimizer("train_state_step_100").result()

save_state accepts an optional ttl_seconds parameter for auto-expiring checkpoints.

For the raw FiretitanTrainingClient, save_state(), load_state(), and load_state_with_optimizer() return futures — call .result() to block. The cookbook’s ReconnectableClient wrapper blocks for you.

Cross-job checkpoint resolution

checkpoint_ref = training_client.resolve_checkpoint_path(
    "step-4",
    source_job_id="previous-job-id",
)
training_client.load_state_with_optimizer(checkpoint_ref).result()

List available checkpoints

checkpoint_names = training_client.list_checkpoints()
print(checkpoint_names)  # e.g. ["step-2", "step-4"]

Checkpoints and Resume (cookbook) — checkpoints.jsonl, resume, and promotion via cookbook
WeightSyncer reference — full weight sync lifecycle
DeploymentManager reference — direct hotload API

Get Started

Fire Pass

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

What this is

Sampler checkpoints

Base vs. delta

Saving checkpoints

Promoting a checkpoint to a model

Weight sync

Train-state checkpoints

Cross-job checkpoint resolution

List available checkpoints

Get Started

Fire Pass

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

​What this is

​Sampler checkpoints

​Base vs. delta

​Saving checkpoints

​Promoting a checkpoint to a model

​Weight sync

​Train-state checkpoints

​Cross-job checkpoint resolution

​List available checkpoints

​Related guides

What this is

Sampler checkpoints

Base vs. delta

Saving checkpoints

Promoting a checkpoint to a model

Weight sync

Train-state checkpoints

Cross-job checkpoint resolution

List available checkpoints

Related guides