> ## Documentation Index
> Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Price comparison vs Tinker

> Estimate the cost of multi-turn agentic RL rollouts on Fireworks compared to Tinker's per-token pricing

export const MultiTurnCostCalculator = () => {
  const TINKER = {
    "kimi-k2p6": {
      prefill: 5.15,
      sample: 12.81,
      label: "Kimi K2.6 (128K)"
    },
    "kimi-k2p5": {
      prefill: 5.15,
      sample: 12.81,
      label: "Kimi K2.5 (128K)"
    },
    "qwen3p5-397b": {
      prefill: 4.0,
      sample: 10.0,
      label: "Qwen3.5-397B-A17B (256K)"
    },
    "gpt-oss-120b": {
      prefill: 0.63,
      sample: 1.54,
      label: "GPT-OSS-120B (128K)"
    }
  };
  const FW_GPU = {
    H100_80GB: 7.0,
    H200_141GB: 7.0,
    B200_180GB: 10.0,
    B300_288GB: 12.0
  };
  const DEDICATED_DEFAULTS = {
    "kimi-k2p6": {
      gpus: 8,
      gpuType: "B300_288GB",
      promptTokensPerSec: 30000,
      decodeTokensPerSec: 1500
    },
    "kimi-k2p5": {
      gpus: 8,
      gpuType: "B300_288GB",
      promptTokensPerSec: 30000,
      decodeTokensPerSec: 1500
    },
    "qwen3p5-397b": {
      gpus: 8,
      gpuType: "B200_180GB",
      promptTokensPerSec: 20000,
      decodeTokensPerSec: 1200
    },
    "gpt-oss-120b": {
      gpus: 4,
      gpuType: "H200_141GB",
      promptTokensPerSec: 25000,
      decodeTokensPerSec: 2500
    }
  };
  const [modelKey, setModelKey] = useState("kimi-k2p6");
  const [episodes, setEpisodes] = useState(1000);
  const [turnsPerEpisode, setTurnsPerEpisode] = useState(8);
  const [initContext, setInitContext] = useState(8000);
  const [tokensAddedPerTurn, setTokensAddedPerTurn] = useState(12000);
  const [decodePerTurn, setDecodePerTurn] = useState(2000);
  const dedDefaults = DEDICATED_DEFAULTS[modelKey];
  const [gpus, setGpus] = useState(dedDefaults.gpus);
  const [promptTps, setPromptTps] = useState(dedDefaults.promptTokensPerSec);
  const [decodeTps, setDecodeTps] = useState(dedDefaults.decodeTokensPerSec);
  const onModelChange = k => {
    setModelKey(k);
    const d = DEDICATED_DEFAULTS[k];
    setGpus(d.gpus);
    setPromptTps(d.promptTokensPerSec);
    setDecodeTps(d.decodeTokensPerSec);
  };
  const T = turnsPerEpisode;
  const sumPromptTokens = T * initContext + tokensAddedPerTurn * (T * (T - 1)) / 2;
  const finalContext = initContext + (T - 1) * tokensAddedPerTurn;
  const decodeTokens = T * decodePerTurn;
  const tinkerPromptTokens = sumPromptTokens;
  const tinkerDecodeTokens = decodeTokens;
  const fwUncachedPromptTokens = finalContext;
  const fwCachedPromptTokens = Math.max(0, sumPromptTokens - finalContext);
  const cacheHitPct = sumPromptTokens > 0 ? fwCachedPromptTokens / sumPromptTokens * 100 : 0;
  const fmt = n => `$${n.toFixed(n < 10 ? 2 : 0)}`;
  const M = 1_000_000;
  const tinker = TINKER[modelKey];
  const tinkerPerEpisode = tinkerPromptTokens / M * tinker.prefill + tinkerDecodeTokens / M * tinker.sample;
  const tinkerTotal = tinkerPerEpisode * episodes;
  const totalUncachedPromptTokens = fwUncachedPromptTokens * episodes;
  const totalDecodeTokens = decodeTokens * episodes;
  const prefillTimeSec = totalUncachedPromptTokens / promptTps;
  const decodeTimeSec = totalDecodeTokens / decodeTps;
  const dedicatedSec = Math.max(prefillTimeSec, decodeTimeSec);
  const dedicatedHours = dedicatedSec / 3600;
  const gpuRate = FW_GPU[dedDefaults.gpuType] || 10.0;
  const dedicatedTotal = dedicatedHours * gpus * gpuRate;
  const compareLabel = (tinker, fw) => {
    if (!(fw > 0) || !(tinker > 0)) return "—";
    const r = tinker / fw;
    if (r >= 1.05) return r.toFixed(1) + "× cheaper";
    if (r <= 0.95) return (1 / r).toFixed(1) + "× more";
    return "at par";
  };
  const tinkerSec = tinkerTotal * 3600 / (gpus * gpuRate);
  const promptTpsParity = totalUncachedPromptTokens / Math.max(tinkerSec, 1);
  const decodeTpsParity = totalDecodeTokens / Math.max(tinkerSec, 1);
  const baseSec = Math.max(totalUncachedPromptTokens / promptTps, totalDecodeTokens / decodeTps);
  const gpusParityMax = Math.max(1, Math.floor(tinkerTotal * 3600 / (baseSec * gpuRate)));
  const labelCls = "text-sm font-medium text-zinc-700 dark:text-zinc-300";
  const inputCls = "w-full mt-2 rounded-lg border border-zinc-300 dark:border-zinc-600 bg-white dark:bg-zinc-800 px-3 py-2 text-sm text-zinc-900 dark:text-zinc-100";
  return <div className="not-prose mt-4 mb-4 p-6 rounded-2xl border border-zinc-200 dark:border-zinc-700 bg-white dark:bg-zinc-900">
      <div style={{
    display: "grid",
    gridTemplateColumns: "repeat(auto-fit, minmax(180px, 1fr))",
    gap: "16px",
    marginBottom: "24px"
  }}>
        <div style={{
    gridColumn: "1 / -1"
  }}>
          <label className={labelCls}>Model</label>
          <select value={modelKey} onChange={e => onModelChange(e.target.value)} className={inputCls}>
            {Object.entries(TINKER).map(([k, v]) => <option key={k} value={k}>{v.label}</option>)}
          </select>
        </div>

        <div>
          <label className={labelCls}>Episodes</label>
          <input type="number" min={1} max={10_000_000} value={episodes} onChange={e => setEpisodes(Math.max(1, Number(e.target.value) || 1))} className={inputCls} />
        </div>
        <div>
          <label className={labelCls}>Turns per episode</label>
          <input type="number" min={1} max={200} value={turnsPerEpisode} onChange={e => setTurnsPerEpisode(Math.max(1, Number(e.target.value) || 1))} className={inputCls} />
        </div>
        <div>
          <label className={labelCls}>Initial prompt (tokens)</label>
          <input type="number" min={0} value={initContext} onChange={e => setInitContext(Math.max(0, Number(e.target.value) || 0))} className={inputCls} />
        </div>
        <div>
          <label className={labelCls}>Tokens added per turn (gen + tool result)</label>
          <input type="number" min={0} value={tokensAddedPerTurn} onChange={e => setTokensAddedPerTurn(Math.max(0, Number(e.target.value) || 0))} className={inputCls} />
        </div>
        <div>
          <label className={labelCls}>Generation tokens per turn</label>
          <input type="number" min={0} value={decodePerTurn} onChange={e => setDecodePerTurn(Math.max(0, Number(e.target.value) || 0))} className={inputCls} />
        </div>
      </div>

      <details>
        <summary className="cursor-pointer text-sm text-zinc-600 dark:text-zinc-400 mb-4">
          Advanced — dedicated cluster sizing & throughput (defaults are saturated estimates)
        </summary>
        <div style={{
    display: "grid",
    gridTemplateColumns: "repeat(auto-fit, minmax(180px, 1fr))",
    gap: "16px",
    marginBottom: "16px"
  }}>
          <div>
            <label className={labelCls}>
              Dedicated GPUs ({dedDefaults.gpuType})
              <span style={{
    color: "#888",
    fontWeight: "normal"
  }}> · max {gpusParityMax} (at-par with Tinker)</span>
            </label>
            <input type="number" min={1} max={gpusParityMax} value={gpus} onChange={e => {
    const v = Number(e.target.value) || 1;
    setGpus(Math.min(gpusParityMax, Math.max(1, v)));
  }} className={inputCls} />
          </div>
          <div>
            <label className={labelCls}>
              Prompt tokens/sec (cluster)
              <span style={{
    color: "#888",
    fontWeight: "normal"
  }}> · min {Math.ceil(promptTpsParity).toLocaleString()} (at-par)</span>
            </label>
            <input type="number" min={Math.ceil(promptTpsParity)} value={promptTps} onChange={e => {
    const v = Number(e.target.value) || promptTpsParity;
    setPromptTps(Math.max(Math.ceil(promptTpsParity), v));
  }} className={inputCls} />
          </div>
          <div>
            <label className={labelCls}>
              Decode tokens/sec (cluster)
              <span style={{
    color: "#888",
    fontWeight: "normal"
  }}> · min {Math.ceil(decodeTpsParity).toLocaleString()} (at-par)</span>
            </label>
            <input type="number" min={Math.ceil(decodeTpsParity)} value={decodeTps} onChange={e => {
    const v = Number(e.target.value) || decodeTpsParity;
    setDecodeTps(Math.max(Math.ceil(decodeTpsParity), v));
  }} className={inputCls} />
          </div>
        </div>
      </details>

      {}
      <div className="text-sm text-zinc-700 dark:text-zinc-300 mb-3">
        <div>
          <strong>Per-episode token math:</strong>{" "}
          {sumPromptTokens.toLocaleString()} prompt-tokens-summed (Tinker bills all of these as prefill);{" "}
          {finalContext.toLocaleString()} unique tokens prefilled on Fireworks (rest hit the prefix cache);{" "}
          {decodeTokens.toLocaleString()} decode tokens.{" "}
          <strong>Cache hit:</strong> {cacheHitPct.toFixed(1)}%
        </div>
      </div>

      {}
      <div style={{
    display: "flex",
    flexDirection: "column",
    gap: "12px"
  }}>
        {[{
    key: "tinker",
    title: "Tinker",
    subtitle: "per-token, no cross-turn cache",
    perEp: tinkerPerEpisode,
    total: tinkerTotal,
    vs: "1×",
    highlight: false
  }, {
    key: "fw-ded",
    title: "Fireworks Dedicated",
    subtitle: `${gpus}× ${dedDefaults.gpuType.replace("_", " ")} · $${gpuRate}/GPU-hr · ≈ ${dedicatedHours.toFixed(2)} cluster-hours at chosen throughput`,
    perEp: dedicatedTotal / episodes,
    total: dedicatedTotal,
    vs: compareLabel(tinkerTotal, dedicatedTotal),
    highlight: true
  }].map(row => <div key={row.key} className={"rounded-xl border p-4 " + (row.highlight ? "border-zinc-300 dark:border-zinc-600 bg-zinc-50 dark:bg-zinc-800/40" : "border-zinc-200 dark:border-zinc-700")}>
            <div className="text-sm font-semibold text-zinc-900 dark:text-zinc-100" style={{
    wordBreak: "break-word"
  }}>
              {row.title}
            </div>
            <div className="text-xs text-zinc-500 dark:text-zinc-400 mt-1" style={{
    wordBreak: "break-word"
  }}>
              {row.subtitle}
            </div>
            <div style={{
    display: "grid",
    gridTemplateColumns: "repeat(auto-fit, minmax(110px, 1fr))",
    gap: "8px",
    marginTop: "10px"
  }}>
              <div>
                <div className="text-[11px] uppercase tracking-wide text-zinc-500 dark:text-zinc-400">$ / episode</div>
                <div className="text-base font-medium text-zinc-900 dark:text-zinc-100" style={{
    fontVariantNumeric: "tabular-nums"
  }}>
                  {fmt(row.perEp)}
                </div>
              </div>
              <div>
                <div className="text-[11px] uppercase tracking-wide text-zinc-500 dark:text-zinc-400">
                  $ / {episodes.toLocaleString()} ep
                </div>
                <div className="text-base font-medium text-zinc-900 dark:text-zinc-100" style={{
    fontVariantNumeric: "tabular-nums"
  }}>
                  {fmt(row.total)}
                </div>
              </div>
              <div>
                <div className="text-[11px] uppercase tracking-wide text-zinc-500 dark:text-zinc-400">vs. Tinker</div>
                <div className="text-base font-semibold" style={{
    fontVariantNumeric: "tabular-nums",
    color: row.key === "tinker" ? "#71717a" : "#10b981"
  }}>
                  {row.vs}
                </div>
              </div>
            </div>
          </div>)}
      </div>

      <p style={{
    marginTop: 16,
    fontSize: 12,
    color: "#666"
  }}>
        Order-of-magnitude estimates. Pricing constants reflect rates last verified 2026-05-16:
        Tinker rates from <a href="https://thinkingmachines.ai/tinker">thinkingmachines.ai/tinker</a>,
        Fireworks GPU-hour from <a href="https://fireworks.ai/pricing">fireworks.ai/pricing</a>.
        Dedicated throughput defaults are saturated estimates; tune them in the Advanced panel for your workload.
      </p>
    </div>;
};

If you're running RL or agentic post-training on a long-context model and your
provider bills you per token with **no cross-turn prefix cache**, the prefill
cost grows quadratically with the number of turns — every turn re-prefills the
full conversation history. On Fireworks Dedicated, session-affinity routing
keeps an episode pinned to one replica so the KV cache is reused across turns,
and cached prompt tokens contribute essentially zero extra compute.

The calculator below makes that difference concrete. Set your episode shape
(turns, context growth, generation length) and compare:

* **Tinker** — flat per-token billing, no cross-turn cache (re-prefill every turn)
* **Fireworks Dedicated** — on-demand GPU-hour billing; the cache savings show up as more work per hour, not as a discounted token rate

<MultiTurnCostCalculator />

## Performance and benchmarking notes

### Dedicated trainer vs pooled/serverless resourcing

Tinker runs training jobs on a **pooled/serverless** GPU fleet, which lets a
single job burst onto many more GPUs than you would dedicate to a replica on
Fireworks. That burst is what makes individual Tinker steps feel fast — but it
also **caps the maximum training speed you can buy**: you cannot pay to scale
beyond the pool's per-job allocation, and you cannot reserve isolated capacity.

Fireworks dedicated trainers take the opposite trade-off: predictable,
isolated execution with no shared-pool queueing or noisy-neighbor variance,
and the ability to scale **wall-clock time and cost independently** by
adjusting replica count. If you want faster steps on dedicated, increase
replica count and parallelize work.

For **large model training or longer rollouts**, we have consistently found
the dedicated setup like ours is **cheaper overall and can also be faster**
depending on the customer's resourcing needs.

### Context-length benchmarking caveat

Benchmark comparisons are only apples-to-apples when truncation policy and
effective context length are matched. If one system truncates `>32k` samples
and another does not, the non-truncating run is doing more work and will
appear slower.

### Replica count is a speed/cost knob

Users can trade cost and wall-clock time by scaling replicas. A quick
back-of-envelope estimate:

$$
\text{\$ / 1M tokens} \approx \frac{\text{GPU count} \cdot \text{\$ / GPU-hour}}{\text{tokens/sec(cluster)} \cdot 3600} \cdot 10^6
$$

### Check utilization before scaling

Fireworks Dedicated is billed by GPU-hour, so low rollout traffic can make a
job look slow or expensive even when the deployment has spare capacity. Before
adding replicas, first confirm whether the inference deployment is saturated or
waiting for more work from your rollout client.

Useful signals:

* **Per-request performance metrics:** log Fireworks response metrics such as
  prompt tokens, cached prompt tokens, time to first token, and total server
  processing time from your rollout client. Non-streaming requests include
  these in response headers; for streaming requests, set
  [`perf_metrics_in_response`](/guides/querying-text-models#usage--performance-tracking)
  to include them in the final response chunk.
* **Deployment-level metrics:** export
  [Prometheus-style metrics](/deployments/exporting-metrics) for request rate,
  prompt and cached-token rates, queue latency, KV-cache usage, and concurrent
  request count. Low request/concurrency metrics with low queueing usually mean
  the deployment can accept more traffic.
* **Training API efficiency hints:** when available, monitor
  `trainer/training_efficiency/.../effective_batch_fill_ratio:last` and
  `trainer/training_efficiency/.../trainer_waiting_for_work:last`. These are
  returned in the `metrics` dict on your `forward` / `forward_backward`
  responses, not on the deployment dashboard. Low batch fill or a
  trainer-waiting-for-work signal usually points to the rollout side not
  feeding the trainer fast enough. See
  [Reading Training API efficiency metrics](#reading-training-api-efficiency-metrics)
  below for how to access and interpret them.

If the deployment is not saturated, increase rollout traffic first. For managed
RFT and Training API jobs, the main throughput knob is concurrent rollouts; see
[`max_concurrent_rollouts`](/fine-tuning/parameter-tuning#max-concurrent-rollouts)
and the Training API [deployment replica guidance](/fine-tuning/training-api/reference/deployment-manager#deployment-shape-and-training-shapes).

#### Reading Training API efficiency metrics

The two `trainer/training_efficiency/...` metrics are returned in the `metrics`
dict on your `forward` / `forward_backward` responses. They do **not** appear on
inference deployment dashboards, the per-request and deployment-level signals
above are separate.

```python theme={null}
# forward / forward_backward return a future, call .result()
result = training_client.forward_backward(datums, "cross_entropy").result()

# result.metrics is a dict; it includes:
#   trainer/training_efficiency/.../effective_batch_fill_ratio:last
#   trainer/training_efficiency/.../trainer_waiting_for_work:last
print(result.metrics)
```

* **`effective_batch_fill_ratio:last`**: the number of tokens in a batch
  divided by the maximum possible. **1.0 means fully saturated**; consistently
  low values across steps indicate under-filling.
* **`trainer_waiting_for_work:last`**: how much time the trainer (GPU) sat idle
  since the last op, i.e. the gap between `forward` calls. More waiting means the
  trainer is starved for work.

Low fill or significant waiting-for-work means the rollout side isn't feeding the
trainer fast enough: raise rollout concurrency
(`max_concurrent_rollouts`) before adding deployment replicas.

## How the numbers come together

### Tinker (the cost customers describe)

Each turn re-prefills the full accumulated context:

$$
\text{Prefill tokens (Tinker)} = \sum_{t=1}^{T} P_t = T \cdot P_1 + \Delta \cdot \frac{T(T-1)}{2}
$$

…where $P_1$ is the initial prompt (system + tools + task), $\Delta$ is the
context added per turn (model response + tool result), and $T$ is the turn
count. This is **quadratic in $T$**.

$$
\text{Cost (Tinker)} = \frac{\text{Prefill tokens}}{10^6} \cdot r_{\text{prefill}} + \frac{\text{Decode tokens}}{10^6} \cdot r_{\text{sample}}
$$

### Fireworks Dedicated — GPU-hour billing

Dedicated deployments are billed per GPU-second, so the prefix cache shows up
as **higher effective throughput** rather than a discount on per-token rates.
Across one episode, each unique token is prefilled at most once — the rest of
the prompt is served from the prefix cache and contributes essentially no GPU
work. The uncached portion that actually hits prefill is:

$$
\text{Uncached prompt} = P_T = P_1 + (T - 1) \Delta
$$

On a saturated cluster:

$$
\text{Cluster-hours} = \frac{\text{Uncached prompt} / \text{prefill TPS}}{3600}
$$

$$
\text{Cost} = \text{Cluster-hours} \cdot N_{\text{GPU}} \cdot r_{\text{GPU/hr}}
$$

Because cached tokens contribute essentially nothing to wall-clock work, the
cluster's effective \$/M token rate falls as utilization rises. For continuous
RL training, where rollouts run at sustained pace, dedicated is typically the
cheapest path at scale.

<Note>
  The calculator's dedicated path uses *saturated* throughput estimates as
  defaults. A small, lightly-loaded test deployment will look more expensive
  per token than these numbers because the cluster is paid for whether it's
  busy or idle. Tune the throughput inputs in the **Advanced** panel to match
  your actual rollout pace.
</Note>

## What's covered

The calculator currently includes the four models for which Tinker publishes
per-token rates:

| Model                    | Tinker prefill / sample (per 1M) |
| ------------------------ | -------------------------------- |
| Kimi K2.6 (128K)         | $5.15 / $12.81                   |
| Kimi K2.5 (128K)         | $5.15 / $12.81                   |
| Qwen3.5-397B-A17B (256K) | $4.00 / $10.00                   |
| GPT-OSS-120B (128K)      | $0.63 / $1.54                    |

All Fireworks-side rates are taken from the public pages linked below and the
constants live in `snippets/multi-turn-cost-calculator.jsx` — update there if
either side's pricing changes.

## FAQ

### What is the fastest way to reduce wall-clock time?

Increase replicas and overlap sampling/training where your workflow allows it.
Those are usually the most direct levers for shortening end-to-end cycle time.

### How should I compare costs between providers?

Use matched assumptions for context length, truncation policy, and effective
resource allocation. The calculator at the top of this page handles the math
once you plug in your episode shape — be sure to also align truncation policy
and effective context window between providers before drawing conclusions.

## Sources

* Tinker pricing: [thinkingmachines.ai/tinker](https://thinkingmachines.ai/tinker)
* Fireworks GPU-hour pricing: [fireworks.ai/pricing](https://fireworks.ai/pricing)
* Related: [RFT Cost Estimator](/fine-tuning/rft-cost-estimator) — same idea, but
  for the training-side bill (Fireworks GPU-hour, no comparison column).

<Warning>
  This is an estimator, not a quote (updated). Real costs depend on your exact workload,
  cache hit rate, hardware utilization, and rate-card terms at run time.
</Warning>
