AWS Bedrock in Production: Lessons Learned

Running LLMs in Production

After running Bedrock for 6 months across multiple applications, here's what I've learned about making it reliable and cost-effective.

Model Selection by Use Case

Don't default to the largest model. Match model to task:

| Use Case | Model | Why | |----------|-------|-----| | Embeddings | Cohere Embed v3 | Best price/performance for search | | Classification | Claude Haiku | Fast, cheap, accurate for structured output | | Summarization | Claude Sonnet | Good balance for medium complexity | | Complex reasoning | Claude Opus | When accuracy matters more than cost | | Code generation | Claude Sonnet | Opus overkill for most code tasks |

Cost Control Strategies

1. Caching is your friend

const cacheKey = `embed:${hashContent(text)}`;
const cached = await redis.get(cacheKey);
if (cached) return JSON.parse(cached);

const embedding = await bedrock.embed(text);
await redis.set(cacheKey, JSON.stringify(embedding), "EX", 86400);
return embedding;

2. Batch when possible

Cohere Embed accepts up to 96 texts per request. Batch your embeddings:

const chunks = chunkArray(texts, 96);
const embeddings = await Promise.all(
  chunks.map(chunk => bedrock.embedBatch(chunk))
);
return embeddings.flat();

3. Use Provisioned Throughput for predictable workloads

If you're processing >100K tokens/day consistently, PTUs save 30-50%.

Error Handling

Bedrock throws throttling errors under load. Implement exponential backoff:

async function bedrockWithRetry<T>(
  fn: () => Promise<T>,
  maxRetries = 3
): Promise<T> {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (error.name === "ThrottlingException" && i < maxRetries - 1) {
        await sleep(Math.pow(2, i) * 1000 + Math.random() * 1000);
        continue;
      }
      throw error;
    }
  }
  throw new Error("Max retries exceeded");
}

Monitoring

CloudWatch metrics to track:

InvocationCount by model — catch unexpected usage spikes
InvocationLatency — p50/p95/p99 for SLA monitoring
ThrottledCount — if >1%, need Provisioned Throughput
InputTokenCount / OutputTokenCount — for cost attribution

Set alarms on:

Daily cost exceeding budget
Throttling rate >5%
p99 latency exceeding SLA

Structured Output

Use Pydantic + JSON mode for reliable parsing:

from pydantic import BaseModel

class Sentiment(BaseModel):
    score: float
    label: str
    confidence: float

response = bedrock.invoke(
    model="anthropic.claude-3-haiku",
    messages=[{"role": "user", "content": f"Analyze: {text}"}],
    response_format={"type": "json_object"}
)

result = Sentiment.model_validate_json(response.content)

What's Next

Looking forward to:

Bedrock Agents for complex multi-step workflows
Knowledge Bases with custom chunking options
Native function calling improvements

The platform is maturing fast. What seemed bleeding-edge 6 months ago is now table stakes.