Prompt Caching vs. Structured Output on Amazon Bedrock: What Actually Invalidates the Cache
TL;DR
output_config/outputConfigis part of the prompt-cache key. Changing the JSON schema between calls — even by a single description field — invalidates the cache completely. This is empirically confirmed but not documented by AWS or Anthropic.- Caching and Bedrock-enforced structured output are mutually exclusive when the schema varies. There is no way today to get both simultaneously across calls with different schemas.
- The workaround: keep the schema out of
output_configand put it in the prompt instead. Place acache_control/cachePointboundary after the document, let the schema text sit after the boundary, and validate the output client-side withjsonschema. You get full cache hits and reliable JSON. - The Converse API with
cachePointis the simplest pattern — the boundary fences the document hash cleanly, and changing the schema text after it has no effect on cache hits. - If you must use InvokeModel with varying schemas, use dual
cache_controlcheckpoints: one on the document, one on a static instruction block. Keep the variable schema in a third, uncached block. - Description-only schema changes also invalidate the cache. The cache key is the full serialised JSON string — no semantic parsing occurs.
Prompt caching on Amazon Bedrock promises up to a 90 % discount on repeated input tokens and dramatically lower latency once a prefix is warm. The pitch is compelling for document-processing pipelines: upload a long PDF once, cache it, then fire multiple extraction requests against it.
Structured output is equally attractive — Bedrock’s native output_config / outputConfig
enforces a JSON schema server-side, so you never have to parse or retry malformed model
responses.
The natural question is: can you have both at the same time? Upload a document, cache it, and extract different schemas from it on separate calls — each with full server-side schema enforcement?
The short answer is no. But the full picture is more nuanced and worth walking through carefully, because there is a clean workaround that gets you most of the way there.
Background
How prompt caching works
When you mark a content block with cache_control: {"type": "ephemeral"} (InvokeModel)
or insert a cachePoint block (Converse API), Bedrock computes a hash of everything
from the start of the request up to and including that block, stores the
key-value attention tensors from prefill, and returns them on subsequent requests
whose prefix hash matches. The default TTL is 5 minutes.
The hash is cumulative and strict:
“Because the hash is cumulative, covering everything up to and including the breakpoint, changing any block at or before the breakpoint produces a different hash on the next request.” — Anthropic prompt caching docs
Cache reads are billed at 0.1× the normal input-token price. Cache writes cost 1.25× (5-minute TTL) or 2× (1-hour TTL, available for select models). For a large document processed many times the savings compound quickly.
Two separate caches — do not confuse them
Bedrock’s structured output system actually maintains two distinct caches:
| Cache | What it stores | TTL | Scope |
|---|---|---|---|
| Prompt / KV cache | Attention key-value tensors from prefill | 5 min (default) | Per request prefix hash |
| Grammar / FSM cache | Compiled finite-state machine from the JSON schema | 24 hours | Per AWS account |
This investigation focuses entirely on the prompt cache. The grammar cache is separately documented and not the subject of the findings below.
Minimum token threshold
Claude models require at least 1,024 tokens per cache checkpoint (Sonnet 4.6, earlier Sonnets, Opus 4). Claude Opus 4.5 / 4.6 / 4.7 and Haiku 4.5 require 4,096 tokens. Our test PDFs consistently produce ~9,900–10,260 tokens — well above either threshold.
Experimental setup
Model and APIs
- Model:
eu.anthropic.claude-sonnet-4-6(eu-west-1) - APIs tested: Converse API (
cachePoint+outputConfig) and InvokeModel API (cache_control+output_config)
Why we generate a unique PDF per run
If you reuse the same document across test runs, a previous run’s warm cache (5-minute TTL) can silently produce false-positive cache hits. To guarantee a cold cache at the start of every run we generate a fresh, UUID-seeded PDF each time using fpdf2:
# pdf.py (simplified)
import uuid
from fpdf import FPDF
def generate():
run_id = uuid.uuid4().hex[:8]
pdf = FPDF()
pdf.set_margins(left=15, top=15, right=15)
pdf.add_page()
pdf.set_font("Helvetica", size=11)
W = pdf.epw
pdf.cell(W, 10, f"Distributed Systems Performance Report - Run {run_id}")
# 20 sections, each seeded with fresh UUIDs
for i in range(1, 21):
section_id = uuid.uuid4().hex
latency = uuid.uuid4().int % 500 + 100
throughput = uuid.uuid4().int % 8000 + 500
trace_id = uuid.uuid4()
text = (
f"Section {i:02d} [{section_id}]: Latency {latency}ms, "
f"throughput {throughput} req/s, trace {trace_id}. ..."
)
pdf.multi_cell(W, 7, text)
pdf.output(str(filename))
return filename, filename.read_bytes()
The generator produces PDFs that tokenise to roughly 9,900–10,200 tokens — safely above the 1,024-token minimum.
Utility wrapper
All API calls go through a thin boto3 wrapper that normalises token counts and measures latency:
# client.py
import json, time, boto3
from dataclasses import dataclass
MODEL_ID = "eu.anthropic.claude-sonnet-4-6"
REGION = "eu-west-1"
@dataclass
class CacheResult:
output_text: str
cache_write_tokens: int
cache_read_tokens: int
input_tokens: int
output_tokens: int
latency_ms: float
@property
def cache_hit(self) -> bool:
return self.cache_read_tokens > 0
def converse(messages, system=None, output_config=None, max_tokens=4096) -> CacheResult:
client = boto3.client("bedrock-runtime", region_name=REGION)
kwargs = {"modelId": MODEL_ID, "messages": messages,
"inferenceConfig": {"maxTokens": max_tokens}}
if system: kwargs["system"] = system
if output_config: kwargs["outputConfig"] = output_config
t0 = time.perf_counter()
response = client.converse(**kwargs)
latency = (time.perf_counter() - t0) * 1000
usage = response.get("usage", {})
return CacheResult(
output_text = _extract_text_converse(response),
cache_write_tokens = usage.get("cacheWriteInputTokens", 0),
cache_read_tokens = usage.get("cacheReadInputTokens", 0),
input_tokens = usage.get("inputTokens", 0),
output_tokens = usage.get("outputTokens", 0),
latency_ms = latency,
)
def invoke_model(body: dict, max_tokens=4096) -> CacheResult:
client = boto3.client("bedrock-runtime", region_name=REGION)
body.setdefault("anthropic_version", "bedrock-2023-05-31")
body.setdefault("max_tokens", max_tokens)
t0 = time.perf_counter()
response = client.invoke_model(modelId=MODEL_ID, body=json.dumps(body),
contentType="application/json", accept="application/json")
latency = (time.perf_counter() - t0) * 1000
payload = json.loads(response["body"].read())
usage = payload.get("usage", {})
return CacheResult(
output_text = _extract_text_invoke(payload),
cache_write_tokens = usage.get("cache_creation_input_tokens", 0),
cache_read_tokens = usage.get("cache_read_input_tokens", 0),
input_tokens = usage.get("input_tokens", 0),
output_tokens = usage.get("output_tokens", 0),
latency_ms = latency,
)
Notice the different key names: cacheWriteInputTokens / cacheReadInputTokens
(Converse) vs. cache_creation_input_tokens / cache_read_input_tokens
(InvokeModel). A common gotcha when switching between the two APIs.
The tests
Test 01 — Converse API + cachePoint, different schemas ✅ HIT
Question: Does cachePoint allow the document to be cached while the extraction
prompt — including the JSON schema — varies between calls?
# tests/t01_converse_cachepoint_different_schemas.py
from schemas import SCHEMA_SUMMARY, SCHEMA_METADATA
import client, pdf as pdf_gen
_, pdf_bytes = pdf_gen.generate()
def call(schema: dict) -> client.CacheResult:
messages = [{
"role": "user",
"content": [
{
"document": {
"format": "pdf",
"name": "input_document",
"source": {"bytes": pdf_bytes},
}
},
{"cachePoint": {"type": "default"}}, # <-- boundary
{"text": f"Extract JSON matching this schema:\n{json.dumps(schema, indent=2)}"},
],
}]
return client.converse(messages=messages)
r1 = call(SCHEMA_SUMMARY) # Call 1 — writes cache
r2 = call(SCHEMA_METADATA) # Call 2 — different schema, but...
Result:
Call 1: Cache: MISS | write=9935 read=0 input=208 output=282 | latency=8032ms
Call 2: Cache: HIT | write=0 read=9935 input=248 output=130 | latency=3154ms
Why it works: The cachePoint block acts as a strict fence. Bedrock hashes
everything before it (the PDF bytes) and ignores everything after it (the text
block carrying the schema). Since the PDF is identical between calls, the hash
matches and the KV tensors are reused. The latency on call 2 drops from 8 s to 3 s —
a ~60 % reduction.
Test 02 — InvokeModel + output_config, different schemas ❌ MISS
Question: Does output_config.format participate in the cache key?
# tests/t02_invokemodel_output_config_different_schemas.py
def call(schema: dict) -> client.CacheResult:
body = {
"messages": [{
"role": "user",
"content": [
{
"type": "document",
"source": {"type": "base64", "media_type": "application/pdf", "data": pdf_b64},
"cache_control": {"type": "ephemeral"}, # checkpoint on the document
},
{"type": "text", "text": "Extract structured data from this document."},
],
}],
"output_config": {
"format": {
"type": "json_schema",
"schema": schema, # <-- changes between calls
}
},
}
return client.invoke_model(body)
r1 = call(SCHEMA_SUMMARY) # Call 1 — writes cache
r2 = call(SCHEMA_METADATA) # Call 2 — different schema
Result:
Call 1: Cache: MISS | write=10218 read=0 input=11 output=339 | latency=12193ms
Call 2: Cache: MISS | write=10258 read=0 input=11 output=68 | latency=7168ms
Both calls write to the cache independently. Despite the cache_control marker on
the document, changing output_config produces a different prefix hash every time.
Test 03 — InvokeModel + output_config, same schema ✅ HIT
Before concluding that output_config is the culprit, we need to confirm caching
actually works with structured output when nothing changes.
# tests/t03_invokemodel_output_config_same_schema.py
body = build_body(pdf_b64, schema=SCHEMA_SUMMARY) # identical each time
r1 = client.invoke_model(body) # writes cache
r2 = client.invoke_model(body) # same body — should hit
Result:
Call 1: Cache: MISS | write=10207 read=0 input=11 output=340 | latency=9217ms
Call 2: Cache: HIT | write=0 read=10207 input=11 output=334 | latency=9126ms
Caching with output_config works fine. The issue is specifically changing the
schema between calls.
Test 04 — InvokeModel, no output_config, different prompts ✅ HIT
What if we drop output_config entirely and embed the schema instructions in the
prompt text instead?
# tests/t04_invokemodel_no_structured_output.py
def call(schema: dict) -> client.CacheResult:
body = {
"messages": [{
"role": "user",
"content": [
{
"type": "document",
"source": {"type": "base64", "media_type": "application/pdf", "data": pdf_b64},
"cache_control": {"type": "ephemeral"},
},
{
"type": "text",
"text": f"Return valid JSON matching this schema:\n{json.dumps(schema, indent=2)}",
# No output_config — schema is just part of the prompt
},
],
}],
# No output_config at all
}
return client.invoke_model(body)
Result:
Call 1: Cache: MISS | write=9948 read=0 input=215 output=329 | latency=8414ms
Call 2: Cache: HIT | write=0 read=9948 input=255 output=128 | latency=3360ms
Cache hit. The schema text changes, but it sits after the cache_control marker
so it is excluded from the prefix hash. The model still returns valid JSON — it just
isn’t server-enforced.
Test 05 — InvokeModel + dual cache_control, different schemas ✅ HIT
This is the recommended workaround for the output_config problem.
The idea: place a second cache_control marker on a static instruction block.
The variable schema goes into a third block with no marker at all — meaning it sits
entirely outside the cached prefix and can change freely.
# tests/t05_invokemodel_dual_cache_control.py
def call(schema: dict) -> client.CacheResult:
body = {
"messages": [{
"role": "user",
"content": [
{
"type": "document",
"source": {"type": "base64", "media_type": "application/pdf", "data": pdf_b64},
"cache_control": {"type": "ephemeral"}, # checkpoint 1: document
},
{
"type": "text",
"text": "Extract structured data from this document.",
"cache_control": {"type": "ephemeral"}, # checkpoint 2: static instruction
},
{
"type": "text",
"text": f"Return a JSON object matching this schema:\n{json.dumps(schema, indent=2)}",
# No cache_control — lives outside the prefix hash, free to vary
},
],
}],
# No output_config
}
return client.invoke_model(body)
Result:
Call 1: Cache: MISS | write=9971 read=0 input=209 output=365 | latency=10254ms
Call 2: Cache: HIT | write=0 read=9971 input=249 output=131 | latency=3681ms
The hash covers blocks 1 and 2. Block 3 — containing the variable schema — is new input on each call. Cache hit, latency halved, 9,971 tokens read from cache at 0.1× the normal price.
Trade-off: You lose Bedrock’s server-side constrained decoding. Claude is highly reliable at following JSON schema instructions in practice, but add client-side validation as a safety net:
import json, jsonschema
result = client.invoke_model(body)
data = json.loads(result.output_text)
jsonschema.validate(data, schema) # raises ValidationError if the model goes off-script
Test 06 — Converse API + cachePoint + outputConfig.textFormat, different schemas ❌ MISS
What if we use the Converse API’s cachePoint together with Bedrock’s structured
output via outputConfig.textFormat? Maybe the explicit boundary saves us.
# tests/t06_converse_cachepoint_outputconfig.py
def call(schema: dict, schema_name: str) -> client.CacheResult:
messages = [{
"role": "user",
"content": [
{"document": {"format": "pdf", "name": "input_document", "source": {"bytes": pdf_bytes}}},
{"cachePoint": {"type": "default"}},
{"text": "Extract structured data from this document."},
],
}]
output_config = {
"textFormat": {
"type": "json_schema",
"structure": {
"jsonSchema": {
"schema": json.dumps(schema),
"name": schema_name,
}
},
}
}
return client.converse(messages=messages, output_config=output_config)
r1 = call(SCHEMA_SUMMARY, "SummarySchema")
r2 = call(SCHEMA_METADATA, "MetadataSchema")
Result:
Call 1: Cache: MISS | write=10225 read=0 input=11 output=291 | latency=8929ms
Call 2: Cache: MISS | write=10265 read=0 input=11 output=64 | latency=3853ms
Miss. outputConfig is a request-level parameter passed outside the messages
array — the explicit cachePoint boundary inside the messages cannot fence it out
of the hash computation.
Test 07 — InvokeModel + dual cache_control + output_config, different schemas ❌ MISS
Can dual cache_control checkpoints (the Test 05 workaround) shield the prefix
when output_config is also present?
# tests/t07_invokemodel_dual_cache_control_output_config.py
def call(schema: dict) -> client.CacheResult:
body = {
"messages": [{
"role": "user",
"content": [
{
"type": "document",
"source": {"type": "base64", "media_type": "application/pdf", "data": pdf_b64},
"cache_control": {"type": "ephemeral"}, # checkpoint 1
},
{
"type": "text",
"text": "Extract structured data from this document.",
"cache_control": {"type": "ephemeral"}, # checkpoint 2
},
],
}],
"output_config": { # <-- still here, still changes
"format": {"type": "json_schema", "schema": schema}
},
}
return client.invoke_model(body)
Result:
Call 1: Cache: MISS | write=10221 read=0 input=3 output=368 | latency=9893ms
Call 2: Cache: MISS | write=10261 read=0 input=3 output=63 | latency=3790ms
Miss. output_config is unconditionally included in the prefix hash regardless of
where cache_control markers sit inside the message body. The cache_control
placement strategy simply cannot reach a request-level parameter.
Test 08 — InvokeModel + output_config, description-only schema change ❌ MISS
Some community posts claimed that changing only description fields in a schema —
while keeping property names, types, and required identical — is cache-safe.
We test this directly.
# schemas.py (excerpt)
SCHEMA_DESC_A = {
"type": "object",
"properties": {
"title": {"type": "string", "description": "The main title of the document"},
"summary": {"type": "string", "description": "A brief summary of the content"},
"key_topics": {"type": "array", "items": {"type": "string"},
"description": "List of key topics covered in the document"},
},
"required": ["title", "summary", "key_topics"],
"additionalProperties": False,
}
SCHEMA_DESC_B = {
"type": "object",
"properties": {
"title": {"type": "string", "description": "Extract the document heading or report name"},
"summary": {"type": "string", "description": "Provide a comprehensive overview of findings"},
"key_topics": {"type": "array", "items": {"type": "string"},
"description": "Enumerate the primary subjects discussed"},
},
"required": ["title", "summary", "key_topics"], # identical to SCHEMA_DESC_A
"additionalProperties": False,
}
# tests/t08_invokemodel_description_only_change.py
r1 = call(SCHEMA_DESC_A) # original descriptions
r2 = call(SCHEMA_DESC_B) # same structure, different descriptions only
Result:
Call 1: Cache: MISS | write=10166 read=0 input=11 output=186 | latency=7996ms
Call 2: Cache: MISS | write=10166 read=0 input=11 output=390 | latency=10993ms
Miss. The cache treats the entire serialised JSON schema as an opaque byte string in the hash — it performs no semantic parsing to distinguish structural changes from metadata changes. Any character that differs in the schema produces a different hash.
As a side note: the output token count jumped from 186 to 390 between the two calls, confirming that descriptions genuinely influence model behaviour — but that has no bearing on the cache.
Results summary
| # | API | Structured output | Schema changes | Result |
|---|---|---|---|---|
| 01 | Converse + cachePoint | Prompt-based | Yes | ✅ HIT (9,935 tokens) |
| 02 | InvokeModel + output_config | Bedrock-enforced | Yes | ❌ MISS |
| 03 | InvokeModel + output_config | Bedrock-enforced | No | ✅ HIT (10,207 tokens) |
| 04 | InvokeModel, no output_config | Prompt-based | Yes | ✅ HIT (9,948 tokens) |
| 05 | InvokeModel + dual cache_control | Prompt-based | Yes | ✅ HIT (9,971 tokens) |
| 06 | Converse + cachePoint + outputConfig | Bedrock-enforced | Yes | ❌ MISS |
| 07 | InvokeModel + dual cache_control + output_config | Bedrock-enforced | Yes | ❌ MISS |
| 08 | InvokeModel + output_config (descriptions only) | Bedrock-enforced | Yes (descriptions) | ❌ MISS |
Key findings
1. output_config / outputConfig is part of the cache key
Whether you use output_config in InvokeModel or outputConfig.textFormat in the
Converse API, changing the JSON schema between calls invalidates the prompt cache
completely. This holds even when an explicit cachePoint is present and even when
you use the dual cache_control workaround. output_config is a request-level
parameter and Bedrock includes it unconditionally in the prefix hash.
This behaviour is empirically observed but not documented by either Anthropic or
AWS. The official “What invalidates the cache” tables mention tool_choice, tool
definitions, and thinking parameters — but are silent on structured output config.
Multiple independent community reports corroborate our findings.
2. Without Bedrock-enforced structured output, caching works reliably
All three patterns without output_config — cachePoint in Converse (Test 01),
single cache_control in InvokeModel (Test 04), and dual cache_control in
InvokeModel (Test 05) — produce cache hits when the schema or prompt varies after
the last cache boundary. Claude reliably returns valid JSON when instructed via
prompt; the only thing you give up is the server-side guarantee.
3. Schema content is hashed as an opaque string
There is no semantic parsing of the schema during cache key computation. Changing a
single description field is indistinguishable from changing a property type — both
produce a cache miss. This rules out any strategy of making “description-only”
changes to preserve a cache entry.
4. Two distinct caches — don’t mix them up
The grammar / FSM cache (24-hour TTL, account-scoped) that stores compiled decoding grammars is separate from the prompt / KV cache (5-minute TTL) we tested here. The grammar cache being schema-sensitive is documented. The prompt cache being schema-sensitive is what we discovered empirically.
Practical recommendations
If you need document caching across multiple schema extractions
Use the Converse API with cachePoint and prompt-based schema instructions
(Test 01 pattern). It is the simplest approach and produces reliable cache hits:
messages = [{
"role": "user",
"content": [
{
"document": {
"format": "pdf",
"name": "input_document",
"source": {"bytes": pdf_bytes},
}
},
{"cachePoint": {"type": "default"}}, # boundary: everything above is cached
{"text": f"Extract JSON matching this schema:\n{json.dumps(schema, indent=2)}"},
],
}]
response = bedrock.converse(modelId=MODEL_ID, messages=messages)
Validate the output client-side if your application requires schema compliance:
import json, jsonschema
data = json.loads(response["output"]["message"]["content"][0]["text"])
jsonschema.validate(data, schema)
If you must use InvokeModel with Anthropic-native features
Use the dual cache_control pattern (Test 05). Place the second checkpoint on
a static instruction block and keep the variable schema in an uncached trailing block:
body = {
"messages": [{
"role": "user",
"content": [
{
"type": "document",
"source": {"type": "base64", "media_type": "application/pdf", "data": pdf_b64},
"cache_control": {"type": "ephemeral"}, # checkpoint 1: document
},
{
"type": "text",
"text": "Extract structured data from this document.",
"cache_control": {"type": "ephemeral"}, # checkpoint 2: static text
},
{
"type": "text",
"text": f"Return a JSON object matching this schema:\n{json.dumps(schema, indent=2)}",
# No cache_control — free to vary without invalidating the prefix
},
],
}],
# No output_config
}
Token economics from our test run:
- Call 1: 9,971 tokens written to cache, 209 uncached input tokens
- Call 2: 9,971 tokens read from cache at ~0.1× price, 249 uncached input tokens
If you need server-enforced schema compliance AND caching
Today there is no way to get both simultaneously when the schema varies between calls. Your options are:
- Fix the schema across all calls to a given document. Cache will hit reliably (Test 03). Design your pipeline so the extraction schema is stable per document type.
- Accept prompt-based compliance with client-side validation (Tests 01 / 05).
Claude is highly reliable at following JSON schema instructions; add
jsonschemavalidation and retry logic as a safety net. - Pre-warm with a fixed schema, then extract. Run a single structured-output
call to warm the grammar cache (24-hour TTL), then use prompt-based extraction
with
cache_controlfor the remaining calls.
Caveat: observed vs. documented behaviour
The central finding — that output_config participates in the prompt-cache key — is
not explicitly documented by Anthropic or AWS. It is consistent across all 8 of
our tests and corroborated by independent community reports, but it is technically
undocumented behaviour that AWS could change at any time.
Architecturally the behaviour makes sense: constrained decoding changes the model’s
generation mode at a fundamental level, and production KV-cache implementations
commonly include the full inference-time configuration in the cache key to prevent
cross-mode cache contamination — the same reason tool_choice and thinking
parameters are documented invalidators.
References
- Anthropic — Prompt Caching Documentation
- AWS — Prompt caching for faster model inference (Bedrock User Guide)
- AWS — Get validated JSON results from models (Structured Output)
- AWS Blog — Effectively use prompt caching on Amazon Bedrock
- AWS Blog — Structured outputs on Amazon Bedrock: Schema-compliant AI responses
- AWS re:Post — “Does Bedrock include outputConfig in its prompt caching key?”
- AWS re:Post — “Is Bedrock Converse API prompt caching expected to hit when only the structured output schema is changed?”
- Amazon Bedrock Pricing
- Anthropic — Structured Outputs Documentation