← All posts

Prompt Caching vs. Structured Output on Amazon Bedrock: What Actually Invalidates the Cache

· Grzegorz Juszkiewicz
awsbedrockclaudeprompt-cachingstructured-outputpython

TL;DR

  • output_config / outputConfig is part of the prompt-cache key. Changing the JSON schema between calls — even by a single description field — invalidates the cache completely. This is empirically confirmed but not documented by AWS or Anthropic.
  • Caching and Bedrock-enforced structured output are mutually exclusive when the schema varies. There is no way today to get both simultaneously across calls with different schemas.
  • The workaround: keep the schema out of output_config and put it in the prompt instead. Place a cache_control / cachePoint boundary after the document, let the schema text sit after the boundary, and validate the output client-side with jsonschema. You get full cache hits and reliable JSON.
  • The Converse API with cachePoint is the simplest pattern — the boundary fences the document hash cleanly, and changing the schema text after it has no effect on cache hits.
  • If you must use InvokeModel with varying schemas, use dual cache_control checkpoints: one on the document, one on a static instruction block. Keep the variable schema in a third, uncached block.
  • Description-only schema changes also invalidate the cache. The cache key is the full serialised JSON string — no semantic parsing occurs.

Prompt caching on Amazon Bedrock promises up to a 90 % discount on repeated input tokens and dramatically lower latency once a prefix is warm. The pitch is compelling for document-processing pipelines: upload a long PDF once, cache it, then fire multiple extraction requests against it.

Structured output is equally attractive — Bedrock’s native output_config / outputConfig enforces a JSON schema server-side, so you never have to parse or retry malformed model responses.

The natural question is: can you have both at the same time? Upload a document, cache it, and extract different schemas from it on separate calls — each with full server-side schema enforcement?

The short answer is no. But the full picture is more nuanced and worth walking through carefully, because there is a clean workaround that gets you most of the way there.


Background

How prompt caching works

When you mark a content block with cache_control: {"type": "ephemeral"} (InvokeModel) or insert a cachePoint block (Converse API), Bedrock computes a hash of everything from the start of the request up to and including that block, stores the key-value attention tensors from prefill, and returns them on subsequent requests whose prefix hash matches. The default TTL is 5 minutes.

The hash is cumulative and strict:

“Because the hash is cumulative, covering everything up to and including the breakpoint, changing any block at or before the breakpoint produces a different hash on the next request.” — Anthropic prompt caching docs

Cache reads are billed at 0.1× the normal input-token price. Cache writes cost 1.25× (5-minute TTL) or (1-hour TTL, available for select models). For a large document processed many times the savings compound quickly.

Two separate caches — do not confuse them

Bedrock’s structured output system actually maintains two distinct caches:

CacheWhat it storesTTLScope
Prompt / KV cacheAttention key-value tensors from prefill5 min (default)Per request prefix hash
Grammar / FSM cacheCompiled finite-state machine from the JSON schema24 hoursPer AWS account

This investigation focuses entirely on the prompt cache. The grammar cache is separately documented and not the subject of the findings below.

Minimum token threshold

Claude models require at least 1,024 tokens per cache checkpoint (Sonnet 4.6, earlier Sonnets, Opus 4). Claude Opus 4.5 / 4.6 / 4.7 and Haiku 4.5 require 4,096 tokens. Our test PDFs consistently produce ~9,900–10,260 tokens — well above either threshold.


Experimental setup

Model and APIs

  • Model: eu.anthropic.claude-sonnet-4-6 (eu-west-1)
  • APIs tested: Converse API (cachePoint + outputConfig) and InvokeModel API (cache_control + output_config)

Why we generate a unique PDF per run

If you reuse the same document across test runs, a previous run’s warm cache (5-minute TTL) can silently produce false-positive cache hits. To guarantee a cold cache at the start of every run we generate a fresh, UUID-seeded PDF each time using fpdf2:

# pdf.py (simplified)
import uuid
from fpdf import FPDF

def generate():
    run_id = uuid.uuid4().hex[:8]
    pdf = FPDF()
    pdf.set_margins(left=15, top=15, right=15)
    pdf.add_page()
    pdf.set_font("Helvetica", size=11)
    W = pdf.epw

    pdf.cell(W, 10, f"Distributed Systems Performance Report - Run {run_id}")

    # 20 sections, each seeded with fresh UUIDs
    for i in range(1, 21):
        section_id = uuid.uuid4().hex
        latency     = uuid.uuid4().int % 500 + 100
        throughput  = uuid.uuid4().int % 8000 + 500
        trace_id    = uuid.uuid4()
        text = (
            f"Section {i:02d} [{section_id}]: Latency {latency}ms, "
            f"throughput {throughput} req/s, trace {trace_id}. ..."
        )
        pdf.multi_cell(W, 7, text)

    pdf.output(str(filename))
    return filename, filename.read_bytes()

The generator produces PDFs that tokenise to roughly 9,900–10,200 tokens — safely above the 1,024-token minimum.

Utility wrapper

All API calls go through a thin boto3 wrapper that normalises token counts and measures latency:

# client.py
import json, time, boto3
from dataclasses import dataclass

MODEL_ID = "eu.anthropic.claude-sonnet-4-6"
REGION   = "eu-west-1"

@dataclass
class CacheResult:
    output_text:        str
    cache_write_tokens: int
    cache_read_tokens:  int
    input_tokens:       int
    output_tokens:      int
    latency_ms:         float

    @property
    def cache_hit(self) -> bool:
        return self.cache_read_tokens > 0


def converse(messages, system=None, output_config=None, max_tokens=4096) -> CacheResult:
    client = boto3.client("bedrock-runtime", region_name=REGION)
    kwargs = {"modelId": MODEL_ID, "messages": messages,
              "inferenceConfig": {"maxTokens": max_tokens}}
    if system:        kwargs["system"]       = system
    if output_config: kwargs["outputConfig"] = output_config

    t0       = time.perf_counter()
    response = client.converse(**kwargs)
    latency  = (time.perf_counter() - t0) * 1000

    usage = response.get("usage", {})
    return CacheResult(
        output_text        = _extract_text_converse(response),
        cache_write_tokens = usage.get("cacheWriteInputTokens", 0),
        cache_read_tokens  = usage.get("cacheReadInputTokens",  0),
        input_tokens       = usage.get("inputTokens",  0),
        output_tokens      = usage.get("outputTokens", 0),
        latency_ms         = latency,
    )


def invoke_model(body: dict, max_tokens=4096) -> CacheResult:
    client = boto3.client("bedrock-runtime", region_name=REGION)
    body.setdefault("anthropic_version", "bedrock-2023-05-31")
    body.setdefault("max_tokens", max_tokens)

    t0       = time.perf_counter()
    response = client.invoke_model(modelId=MODEL_ID, body=json.dumps(body),
                                   contentType="application/json", accept="application/json")
    latency  = (time.perf_counter() - t0) * 1000

    payload = json.loads(response["body"].read())
    usage   = payload.get("usage", {})
    return CacheResult(
        output_text        = _extract_text_invoke(payload),
        cache_write_tokens = usage.get("cache_creation_input_tokens", 0),
        cache_read_tokens  = usage.get("cache_read_input_tokens",     0),
        input_tokens       = usage.get("input_tokens",  0),
        output_tokens      = usage.get("output_tokens", 0),
        latency_ms         = latency,
    )

Notice the different key names: cacheWriteInputTokens / cacheReadInputTokens (Converse) vs. cache_creation_input_tokens / cache_read_input_tokens (InvokeModel). A common gotcha when switching between the two APIs.


The tests

Test 01 — Converse API + cachePoint, different schemas ✅ HIT

Question: Does cachePoint allow the document to be cached while the extraction prompt — including the JSON schema — varies between calls?

# tests/t01_converse_cachepoint_different_schemas.py
from schemas import SCHEMA_SUMMARY, SCHEMA_METADATA
import client, pdf as pdf_gen

_, pdf_bytes = pdf_gen.generate()

def call(schema: dict) -> client.CacheResult:
    messages = [{
        "role": "user",
        "content": [
            {
                "document": {
                    "format": "pdf",
                    "name":   "input_document",
                    "source": {"bytes": pdf_bytes},
                }
            },
            {"cachePoint": {"type": "default"}},          # <-- boundary
            {"text": f"Extract JSON matching this schema:\n{json.dumps(schema, indent=2)}"},
        ],
    }]
    return client.converse(messages=messages)

r1 = call(SCHEMA_SUMMARY)   # Call 1 — writes cache
r2 = call(SCHEMA_METADATA)  # Call 2 — different schema, but...

Result:

Call 1: Cache: MISS | write=9935 read=0    input=208 output=282 | latency=8032ms
Call 2: Cache: HIT  | write=0    read=9935 input=248 output=130 | latency=3154ms

Why it works: The cachePoint block acts as a strict fence. Bedrock hashes everything before it (the PDF bytes) and ignores everything after it (the text block carrying the schema). Since the PDF is identical between calls, the hash matches and the KV tensors are reused. The latency on call 2 drops from 8 s to 3 s — a ~60 % reduction.


Test 02 — InvokeModel + output_config, different schemas ❌ MISS

Question: Does output_config.format participate in the cache key?

# tests/t02_invokemodel_output_config_different_schemas.py
def call(schema: dict) -> client.CacheResult:
    body = {
        "messages": [{
            "role": "user",
            "content": [
                {
                    "type":          "document",
                    "source":        {"type": "base64", "media_type": "application/pdf", "data": pdf_b64},
                    "cache_control": {"type": "ephemeral"},   # checkpoint on the document
                },
                {"type": "text", "text": "Extract structured data from this document."},
            ],
        }],
        "output_config": {
            "format": {
                "type":   "json_schema",
                "schema": schema,          # <-- changes between calls
            }
        },
    }
    return client.invoke_model(body)

r1 = call(SCHEMA_SUMMARY)   # Call 1 — writes cache
r2 = call(SCHEMA_METADATA)  # Call 2 — different schema

Result:

Call 1: Cache: MISS | write=10218 read=0     input=11 output=339 | latency=12193ms
Call 2: Cache: MISS | write=10258 read=0     input=11 output=68  | latency=7168ms

Both calls write to the cache independently. Despite the cache_control marker on the document, changing output_config produces a different prefix hash every time.


Test 03 — InvokeModel + output_config, same schema ✅ HIT

Before concluding that output_config is the culprit, we need to confirm caching actually works with structured output when nothing changes.

# tests/t03_invokemodel_output_config_same_schema.py
body = build_body(pdf_b64, schema=SCHEMA_SUMMARY)  # identical each time

r1 = client.invoke_model(body)  # writes cache
r2 = client.invoke_model(body)  # same body — should hit

Result:

Call 1: Cache: MISS | write=10207 read=0     input=11 output=340 | latency=9217ms
Call 2: Cache: HIT  | write=0     read=10207 input=11 output=334 | latency=9126ms

Caching with output_config works fine. The issue is specifically changing the schema between calls.


Test 04 — InvokeModel, no output_config, different prompts ✅ HIT

What if we drop output_config entirely and embed the schema instructions in the prompt text instead?

# tests/t04_invokemodel_no_structured_output.py
def call(schema: dict) -> client.CacheResult:
    body = {
        "messages": [{
            "role": "user",
            "content": [
                {
                    "type":          "document",
                    "source":        {"type": "base64", "media_type": "application/pdf", "data": pdf_b64},
                    "cache_control": {"type": "ephemeral"},
                },
                {
                    "type": "text",
                    "text": f"Return valid JSON matching this schema:\n{json.dumps(schema, indent=2)}",
                    # No output_config — schema is just part of the prompt
                },
            ],
        }],
        # No output_config at all
    }
    return client.invoke_model(body)

Result:

Call 1: Cache: MISS | write=9948 read=0    input=215 output=329 | latency=8414ms
Call 2: Cache: HIT  | write=0    read=9948 input=255 output=128 | latency=3360ms

Cache hit. The schema text changes, but it sits after the cache_control marker so it is excluded from the prefix hash. The model still returns valid JSON — it just isn’t server-enforced.


Test 05 — InvokeModel + dual cache_control, different schemas ✅ HIT

This is the recommended workaround for the output_config problem.

The idea: place a second cache_control marker on a static instruction block. The variable schema goes into a third block with no marker at all — meaning it sits entirely outside the cached prefix and can change freely.

# tests/t05_invokemodel_dual_cache_control.py
def call(schema: dict) -> client.CacheResult:
    body = {
        "messages": [{
            "role": "user",
            "content": [
                {
                    "type":          "document",
                    "source":        {"type": "base64", "media_type": "application/pdf", "data": pdf_b64},
                    "cache_control": {"type": "ephemeral"},   # checkpoint 1: document
                },
                {
                    "type":          "text",
                    "text":          "Extract structured data from this document.",
                    "cache_control": {"type": "ephemeral"},   # checkpoint 2: static instruction
                },
                {
                    "type": "text",
                    "text": f"Return a JSON object matching this schema:\n{json.dumps(schema, indent=2)}",
                    # No cache_control — lives outside the prefix hash, free to vary
                },
            ],
        }],
        # No output_config
    }
    return client.invoke_model(body)

Result:

Call 1: Cache: MISS | write=9971 read=0    input=209 output=365 | latency=10254ms
Call 2: Cache: HIT  | write=0    read=9971 input=249 output=131 | latency=3681ms

The hash covers blocks 1 and 2. Block 3 — containing the variable schema — is new input on each call. Cache hit, latency halved, 9,971 tokens read from cache at 0.1× the normal price.

Trade-off: You lose Bedrock’s server-side constrained decoding. Claude is highly reliable at following JSON schema instructions in practice, but add client-side validation as a safety net:

import json, jsonschema

result = client.invoke_model(body)
data   = json.loads(result.output_text)
jsonschema.validate(data, schema)   # raises ValidationError if the model goes off-script

Test 06 — Converse API + cachePoint + outputConfig.textFormat, different schemas ❌ MISS

What if we use the Converse API’s cachePoint together with Bedrock’s structured output via outputConfig.textFormat? Maybe the explicit boundary saves us.

# tests/t06_converse_cachepoint_outputconfig.py
def call(schema: dict, schema_name: str) -> client.CacheResult:
    messages = [{
        "role": "user",
        "content": [
            {"document": {"format": "pdf", "name": "input_document", "source": {"bytes": pdf_bytes}}},
            {"cachePoint": {"type": "default"}},
            {"text": "Extract structured data from this document."},
        ],
    }]
    output_config = {
        "textFormat": {
            "type": "json_schema",
            "structure": {
                "jsonSchema": {
                    "schema": json.dumps(schema),
                    "name":   schema_name,
                }
            },
        }
    }
    return client.converse(messages=messages, output_config=output_config)

r1 = call(SCHEMA_SUMMARY,  "SummarySchema")
r2 = call(SCHEMA_METADATA, "MetadataSchema")

Result:

Call 1: Cache: MISS | write=10225 read=0     input=11 output=291 | latency=8929ms
Call 2: Cache: MISS | write=10265 read=0     input=11 output=64  | latency=3853ms

Miss. outputConfig is a request-level parameter passed outside the messages array — the explicit cachePoint boundary inside the messages cannot fence it out of the hash computation.


Test 07 — InvokeModel + dual cache_control + output_config, different schemas ❌ MISS

Can dual cache_control checkpoints (the Test 05 workaround) shield the prefix when output_config is also present?

# tests/t07_invokemodel_dual_cache_control_output_config.py
def call(schema: dict) -> client.CacheResult:
    body = {
        "messages": [{
            "role": "user",
            "content": [
                {
                    "type":          "document",
                    "source":        {"type": "base64", "media_type": "application/pdf", "data": pdf_b64},
                    "cache_control": {"type": "ephemeral"},   # checkpoint 1
                },
                {
                    "type":          "text",
                    "text":          "Extract structured data from this document.",
                    "cache_control": {"type": "ephemeral"},   # checkpoint 2
                },
            ],
        }],
        "output_config": {          # <-- still here, still changes
            "format": {"type": "json_schema", "schema": schema}
        },
    }
    return client.invoke_model(body)

Result:

Call 1: Cache: MISS | write=10221 read=0     input=3 output=368 | latency=9893ms
Call 2: Cache: MISS | write=10261 read=0     input=3 output=63  | latency=3790ms

Miss. output_config is unconditionally included in the prefix hash regardless of where cache_control markers sit inside the message body. The cache_control placement strategy simply cannot reach a request-level parameter.


Test 08 — InvokeModel + output_config, description-only schema change ❌ MISS

Some community posts claimed that changing only description fields in a schema — while keeping property names, types, and required identical — is cache-safe. We test this directly.

# schemas.py (excerpt)
SCHEMA_DESC_A = {
    "type": "object",
    "properties": {
        "title":      {"type": "string", "description": "The main title of the document"},
        "summary":    {"type": "string", "description": "A brief summary of the content"},
        "key_topics": {"type": "array",  "items": {"type": "string"},
                       "description": "List of key topics covered in the document"},
    },
    "required": ["title", "summary", "key_topics"],
    "additionalProperties": False,
}

SCHEMA_DESC_B = {
    "type": "object",
    "properties": {
        "title":      {"type": "string", "description": "Extract the document heading or report name"},
        "summary":    {"type": "string", "description": "Provide a comprehensive overview of findings"},
        "key_topics": {"type": "array",  "items": {"type": "string"},
                       "description": "Enumerate the primary subjects discussed"},
    },
    "required": ["title", "summary", "key_topics"],   # identical to SCHEMA_DESC_A
    "additionalProperties": False,
}
# tests/t08_invokemodel_description_only_change.py
r1 = call(SCHEMA_DESC_A)   # original descriptions
r2 = call(SCHEMA_DESC_B)   # same structure, different descriptions only

Result:

Call 1: Cache: MISS | write=10166 read=0     input=11 output=186 | latency=7996ms
Call 2: Cache: MISS | write=10166 read=0     input=11 output=390 | latency=10993ms

Miss. The cache treats the entire serialised JSON schema as an opaque byte string in the hash — it performs no semantic parsing to distinguish structural changes from metadata changes. Any character that differs in the schema produces a different hash.

As a side note: the output token count jumped from 186 to 390 between the two calls, confirming that descriptions genuinely influence model behaviour — but that has no bearing on the cache.


Results summary

#APIStructured outputSchema changesResult
01Converse + cachePointPrompt-basedYes✅ HIT (9,935 tokens)
02InvokeModel + output_configBedrock-enforcedYes❌ MISS
03InvokeModel + output_configBedrock-enforcedNo✅ HIT (10,207 tokens)
04InvokeModel, no output_configPrompt-basedYes✅ HIT (9,948 tokens)
05InvokeModel + dual cache_controlPrompt-basedYes✅ HIT (9,971 tokens)
06Converse + cachePoint + outputConfigBedrock-enforcedYes❌ MISS
07InvokeModel + dual cache_control + output_configBedrock-enforcedYes❌ MISS
08InvokeModel + output_config (descriptions only)Bedrock-enforcedYes (descriptions)❌ MISS

Key findings

1. output_config / outputConfig is part of the cache key

Whether you use output_config in InvokeModel or outputConfig.textFormat in the Converse API, changing the JSON schema between calls invalidates the prompt cache completely. This holds even when an explicit cachePoint is present and even when you use the dual cache_control workaround. output_config is a request-level parameter and Bedrock includes it unconditionally in the prefix hash.

This behaviour is empirically observed but not documented by either Anthropic or AWS. The official “What invalidates the cache” tables mention tool_choice, tool definitions, and thinking parameters — but are silent on structured output config. Multiple independent community reports corroborate our findings.

2. Without Bedrock-enforced structured output, caching works reliably

All three patterns without output_configcachePoint in Converse (Test 01), single cache_control in InvokeModel (Test 04), and dual cache_control in InvokeModel (Test 05) — produce cache hits when the schema or prompt varies after the last cache boundary. Claude reliably returns valid JSON when instructed via prompt; the only thing you give up is the server-side guarantee.

3. Schema content is hashed as an opaque string

There is no semantic parsing of the schema during cache key computation. Changing a single description field is indistinguishable from changing a property type — both produce a cache miss. This rules out any strategy of making “description-only” changes to preserve a cache entry.

4. Two distinct caches — don’t mix them up

The grammar / FSM cache (24-hour TTL, account-scoped) that stores compiled decoding grammars is separate from the prompt / KV cache (5-minute TTL) we tested here. The grammar cache being schema-sensitive is documented. The prompt cache being schema-sensitive is what we discovered empirically.


Practical recommendations

If you need document caching across multiple schema extractions

Use the Converse API with cachePoint and prompt-based schema instructions (Test 01 pattern). It is the simplest approach and produces reliable cache hits:

messages = [{
    "role": "user",
    "content": [
        {
            "document": {
                "format": "pdf",
                "name":   "input_document",
                "source": {"bytes": pdf_bytes},
            }
        },
        {"cachePoint": {"type": "default"}},   # boundary: everything above is cached
        {"text": f"Extract JSON matching this schema:\n{json.dumps(schema, indent=2)}"},
    ],
}]
response = bedrock.converse(modelId=MODEL_ID, messages=messages)

Validate the output client-side if your application requires schema compliance:

import json, jsonschema
data = json.loads(response["output"]["message"]["content"][0]["text"])
jsonschema.validate(data, schema)

If you must use InvokeModel with Anthropic-native features

Use the dual cache_control pattern (Test 05). Place the second checkpoint on a static instruction block and keep the variable schema in an uncached trailing block:

body = {
    "messages": [{
        "role": "user",
        "content": [
            {
                "type":          "document",
                "source":        {"type": "base64", "media_type": "application/pdf", "data": pdf_b64},
                "cache_control": {"type": "ephemeral"},   # checkpoint 1: document
            },
            {
                "type":          "text",
                "text":          "Extract structured data from this document.",
                "cache_control": {"type": "ephemeral"},   # checkpoint 2: static text
            },
            {
                "type": "text",
                "text": f"Return a JSON object matching this schema:\n{json.dumps(schema, indent=2)}",
                # No cache_control — free to vary without invalidating the prefix
            },
        ],
    }],
    # No output_config
}

Token economics from our test run:

  • Call 1: 9,971 tokens written to cache, 209 uncached input tokens
  • Call 2: 9,971 tokens read from cache at ~0.1× price, 249 uncached input tokens

If you need server-enforced schema compliance AND caching

Today there is no way to get both simultaneously when the schema varies between calls. Your options are:

  1. Fix the schema across all calls to a given document. Cache will hit reliably (Test 03). Design your pipeline so the extraction schema is stable per document type.
  2. Accept prompt-based compliance with client-side validation (Tests 01 / 05). Claude is highly reliable at following JSON schema instructions; add jsonschema validation and retry logic as a safety net.
  3. Pre-warm with a fixed schema, then extract. Run a single structured-output call to warm the grammar cache (24-hour TTL), then use prompt-based extraction with cache_control for the remaining calls.

Caveat: observed vs. documented behaviour

The central finding — that output_config participates in the prompt-cache key — is not explicitly documented by Anthropic or AWS. It is consistent across all 8 of our tests and corroborated by independent community reports, but it is technically undocumented behaviour that AWS could change at any time.

Architecturally the behaviour makes sense: constrained decoding changes the model’s generation mode at a fundamental level, and production KV-cache implementations commonly include the full inference-time configuration in the cache key to prevent cross-mode cache contamination — the same reason tool_choice and thinking parameters are documented invalidators.


References

  1. Anthropic — Prompt Caching Documentation
  2. AWS — Prompt caching for faster model inference (Bedrock User Guide)
  3. AWS — Get validated JSON results from models (Structured Output)
  4. AWS Blog — Effectively use prompt caching on Amazon Bedrock
  5. AWS Blog — Structured outputs on Amazon Bedrock: Schema-compliant AI responses
  6. AWS re:Post — “Does Bedrock include outputConfig in its prompt caching key?”
  7. AWS re:Post — “Is Bedrock Converse API prompt caching expected to hit when only the structured output schema is changed?”
  8. Amazon Bedrock Pricing
  9. Anthropic — Structured Outputs Documentation