AWS Bedrock in Production: Lessons Learned
Running LLMs in Production
After running Bedrock for 6 months across multiple applications, here's what I've learned about making it reliable and cost-effective.
Model Selection by Use Case
Don't default to the largest model. Match model to task:
| Use Case | Model | Why | |----------|-------|-----| | Embeddings | Cohere Embed v3 | Best price/performance for search | | Classification | Claude Haiku | Fast, cheap, accurate for structured output | | Summarization | Claude Sonnet | Good balance for medium complexity | | Complex reasoning | Claude Opus | When accuracy matters more than cost | | Code generation | Claude Sonnet | Opus overkill for most code tasks |
Cost Control Strategies
1. Caching is your friend
const cacheKey = `embed:${hashContent(text)}`;
const cached = await redis.get(cacheKey);
if (cached) return JSON.parse(cached);
const embedding = await bedrock.embed(text);
await redis.set(cacheKey, JSON.stringify(embedding), "EX", 86400);
return embedding;
2. Batch when possible
Cohere Embed accepts up to 96 texts per request. Batch your embeddings:
const chunks = chunkArray(texts, 96);
const embeddings = await Promise.all(
chunks.map(chunk => bedrock.embedBatch(chunk))
);
return embeddings.flat();
3. Use Provisioned Throughput for predictable workloads
If you're processing >100K tokens/day consistently, PTUs save 30-50%.
Error Handling
Bedrock throws throttling errors under load. Implement exponential backoff:
async function bedrockWithRetry<T>(
fn: () => Promise<T>,
maxRetries = 3
): Promise<T> {
for (let i = 0; i < maxRetries; i++) {
try {
return await fn();
} catch (error) {
if (error.name === "ThrottlingException" && i < maxRetries - 1) {
await sleep(Math.pow(2, i) * 1000 + Math.random() * 1000);
continue;
}
throw error;
}
}
throw new Error("Max retries exceeded");
}
Monitoring
CloudWatch metrics to track:
InvocationCountby model — catch unexpected usage spikesInvocationLatency— p50/p95/p99 for SLA monitoringThrottledCount— if >1%, need Provisioned ThroughputInputTokenCount/OutputTokenCount— for cost attribution
Set alarms on:
- Daily cost exceeding budget
- Throttling rate >5%
- p99 latency exceeding SLA
Structured Output
Use Pydantic + JSON mode for reliable parsing:
from pydantic import BaseModel
class Sentiment(BaseModel):
score: float
label: str
confidence: float
response = bedrock.invoke(
model="anthropic.claude-3-haiku",
messages=[{"role": "user", "content": f"Analyze: {text}"}],
response_format={"type": "json_object"}
)
result = Sentiment.model_validate_json(response.content)
What's Next
Looking forward to:
- Bedrock Agents for complex multi-step workflows
- Knowledge Bases with custom chunking options
- Native function calling improvements
The platform is maturing fast. What seemed bleeding-edge 6 months ago is now table stakes.