Amazon Bedrock
π Quicklinks (Bookmark):
Coming Soon
Amazon Bedrock is AWSβs fully managed platform for building generative-AI apps with a catalog of foundation models (FMs) plus managed RAG (Knowledge Bases), Agents, Flows, Guardrails, Evaluations, and a unified Converse API. Itβs powerful β but itβs easy to overspend on tokens in/out, Provisioned Throughput (PT) hours, RAG ingestion/retrieval, agent/tool loops, and Flows node transitions.
β What youβre using
β What youβre paying
β What you should be doing
β AWS-native tools to make it happen.
π What is Bedrock?
Amazon Bedrock gives you one API/console to access models from multiple providers (e.g., Amazon Nova/Titan, Anthropic, Meta, Mistral, AI21, Cohere, Stability) and tools to customize, evaluate, secure, and operate gen-AI workloads. Pricing for inference is token-based (for text) or per-asset (for images/video), with Batch and PT options for specific workload shapes.
Features
Model choice via one API (text, vision, image, embeddings; some video).
Knowledge Bases for managed RAG (Retrieve / RetrieveAndGenerate).
Agents / AgentCore for tool use and multi-step workflows.
Flows for visually orchestrating prompts/agents/KBs/Guardrails/AWS services β priced per node transition.
Guardrails (safety/PII/topic grounding) with reduced pricing for content filters.
Prompt Caching & Intelligent Prompt Routing to cut token costs and latency.
π§© Components β pick the right ones
Foundation Models (FMs)
Text/chat, vision, image, embeddings (some video)
Core LLM tasks
Tokens in/out (text) or per-asset (image/video), per model/Region.
Knowledge Bases (RAG)
Managed retrieval + Retrieve / RetrieveAndGenerate
Ground responses on private data
Pay for ingestion/retrieval, vector store, and any inference; the KB feature itself isnβt separately metered.
Agents / AgentCore
Tool calling & multi-step plans
Automations & API workflows
Each step can trigger model calls + tools; watch recursion.
Flows
Visual builder for agentic apps
Orchestrating complex workflows
$0.035 / 1K node transitions + underlying services (effective 2025-02-01).
Guardrails
Safety/PII/topic/grounding filters
Enterprise controls
$0.15 / 1K text units (content filters) after 2024 price cut; blocked inputs avoid downstream model charges.
Evaluations
Auto evals for prompts/models/RAG
Model selection & tuning
Pay for generator/evaluator tokens and any KB retrieval.
Converse / ConverseStream
Unified chat API & streaming
Portability across providers
No separate fee; standard model pricing applies.
βοΈ Inference & deployment options
On-Demand
Prototyping; variable usage
Billed by tokens (text) or per image/video/embedding.
Batch
Large offline jobs
Select models available at ~50% lower than on-demand; outputs to S3.
Provisioned Throughput (PT)
Steady, high TPS with SLOs
Hourly per model unit with 1- or 6-month commitments; discount vs on-demand; under-utilization wastes.
Custom Model Import / Fine-tuning
Domain-specific apps
Import free; pay for inference (on-demand or PT) and any customization/storage.
π§ Bedrock optimization strategy (FinOps + reliability)
Pick the right model & mode
Start with smaller/cheaper models for classify/extract/routing; escalate only where impact is proven.
For steady traffic, compute PT breakeven from tokens/min & concurrency; for bulk jobs, trial Batch (often ~50% less).
Control tokens
Shorten system/prompt/context; enforce
maxOutputTokens; favor structured outputs.Use Knowledge Bases to retrieve only whatβs needed; tune chunking/topK/filters to minimize context size.
Turn on Prompt Caching for repeat prefixes (up to 90% cost reduction on cached tokens).
Guardrails first
Run Guardrails before expensive model calls; blocked prompts incur guardrail cost only and save inference spend.
Agents/Flows with budgets
Keep toolsets minimal; cap steps/time; avoid recursive loops; prefer deterministic plans; monitor node transitions in Flows.
Evaluate continuously
Use Evaluations to A/B prompts/models/RAG; keep a champion/challenger cadence.
πΈ Pricing model β where the money goes
Model inference: input/output tokens (text) or per-asset (image/video); embeddings by tokens; varies by provider/Region.
Batch: ~50% lower vs on-demand for supported models (Anthropic, Meta, Mistral, Amazon).
Provisioned Throughput: hourly per model unit, 1 or 6-month terms; discounted vs on-demand.
Knowledge Bases: pay for ingestion/retrieval and vector store; KB feature itself not billed separately.
Guardrails: content filters priced at $0.15 / 1K text units after 2024 price cut.
Flows: $0.035 / 1K node transitions + any underlying services (e.g., Guardrails, model calls).
Prompt Caching: cached-token reads up to 90% discount; model-specific write rules.
Model prices, context limits, and availability vary by Region/provider β always validate on the official pricing page for your models.
β±οΈ Automated management
Budgets/alerts on tokens, PT hours, KB ingestion/retrieval, and Flows transitions.
Quotas per team (daily tokens, max output length).
EventBridge jobs to rotate prompt templates, auto-pause PT at end of trials, and archive logs.
Tag app/team/env in request metadata β feed Cost Explorer/CUR dashboards.
π Security & compliance
Access Bedrock and AgentCore privately with VPC interface endpoints (PrivateLink).
Apply Guardrails for safety/PII/topic and grounding; monitor interventions.
Keep data in-Region and encrypted (KMS). Review retention settings per model/provider.
π Monitoring & tools
CloudWatch/CloudTrail around Bedrock APIs, KB ingestion, Agent actions, and Flows.
Cost Explorer / CUR + Budgets: segment by model / provider / Region / app tag.
Evaluations: track faithfulness, correctness, completeness during A/Bs.
Ops dashboards: latency, time-to-first-token, output length, error rates, agent step counts, Flows transitions.
β
Bedrock FinOps Checklist
π§ AWS Bedrock Cost Optimization Challenges (Q&A)
Cloud GenAI spend tends to balloon from tokens, under-utilized PT, RAG retrieval/ingestion, agent loops, and Flows node transitions. Here are the toughest issues teams hit β and how to fix them fast.
Q1: Our token bill exploded right after launch. Where do we start?
β Solution
Instrument prompt length, context size, and output tokens per request.
Enforce
maxOutputTokens, trim system prompts, remove boilerplate, and switch to structured output (JSON schemas) to constrain verbosity.Start on a smaller/cheaper model; route only hard cases to a larger model (router or classifier in front).
Turn on Prompt Caching for repeated prefixes/system prompts to cut repeat token costs.
Q2: We bought Provisioned Throughput (PT), but itβs not cheaper than on-demand.
β Solution
Calculate PT breakeven from p95 concurrency Γ tokens/min; right-size PT units to that line.
Use on-demand spillover for peaks; avoid paying for idle PT hours.
Set utilization SLOs and alerts; pause or reduce PT when traffic drops.
Q3: RAG (Knowledge Bases) costs are higher than expected.
β Solution
Tune chunk size/overlap, topK, and filters; retrieve fewer, more relevant chunks.
Pre-compute embeddings in batch; re-embed only changed docs; dedupe and compress sources.
Cache frequent answers; enforce context budgets (max tokens from RAG per query).
Prefer Retrieve β filtered context over dumping full documents into the prompt.
Q4: Agents keep looping and burning tokens/tool calls.
β Solution
Cap max steps/time, add guard states, and prefer deterministic plans over open-ended reasoning.
Reduce the toolset to essentials; cache intermediate results across steps.
Terminate on ambiguous goals; add user confirmation checkpoints for risky branches.
Q5: Guardrails add cost without obvious benefit.
β Solution
Run Guardrails before expensive model calls; blocked requests incur guardrail cost only and avoid model tokens.
Log blocked categories; fix noisy sources (spam inputs, malformed payloads) upstream.
Keep guardrail policies modular (turn on only what you need).
Q6: Model choice feels like guesswork and we overpay for βpremiumβ models.
β Solution
Use Evaluations with a representative dataset; A/B small vs. large models and pick the cheapest that meets SLOs.
Separate use cases: classification/extraction/routing on small models; escalate only when needed.
Track cost per 1K tokens and quality metrics together; re-evaluate after prompt/RAG changes.
Q7: Latency SLOs push us to larger (pricier) models.
β Solution
Use streaming (ConverseStream) to improve time-to-first-token without upgrading models.
Shorten prompts; use few-shot with compact exemplars; remove irrelevant context.
Move heavy retrieval/enrichment to Batch or background jobs where possible.
Q8: Flows orchestration is convenient, but node transitions are adding up.
β Solution
Consolidate nodes; combine lightweight transforms; memoize outputs within a flow (reuse results).
Avoid unnecessary fan-out; prune dead branches; set fail-fast conditions.
Monitor node transitions per request; budget transitions like tokens.
Q9: Embeddings/vector store costs are creeping up.
β Solution
Batch embedding jobs; re-embed only diffs (changed pages/sections).
Use lower-dimensional embeddings that meet recall needs; prune duplicate or low-value content.
Apply retention policies on the vector store; tier old sources to cheaper storage.
Q10: Finance wants proof of savings and unit economics.
β Solution
Report before/after cost per 1K tokens, avg prompt length, avg output length, KB hit rate, agent steps, Flows transitions, and PT utilization.
Tag every request (app/team/env) and break down costs by model/provider/Region.
Keep a monthly change log (prompts, models, RAG settings) with measured deltas.
βοΈ Quick Wins (Bedrock)
Cap Output + Tighten Prompts Enforce
maxOutputTokens, switch to structured JSON outputs, and trim system prompts/boilerplate. Impact: 20β50% fewer output tokens in hours.Route to Smaller Models First Use a light classifier/router to send only hard cases to premium models. Impact: 25β60% token cost reduction on mixed workloads.
Turn On Prompt Caching Cache repeated system/preamble tokens for chat/workflows. Impact: Up to ~90% discount on cached prefix tokens + faster latency.
Right-Size PT (or Turn It Off) Compute PT breakeven; set utilization SLOs; spill peaks to on-demand; pause or resize underused PT units. Impact: Eliminates idle-hour waste immediately.
Batch for Offline Work Move backfills, enrichment, and scoring to Batch where supported. Impact: ~50% lower inference unit cost vs on-demand.
RAG: Fewer, Better Chunks Reduce chunk size/overlap, tune
topK, dedupe sources, and set a context budget (max RAG tokens/query). Impact: 20β40% less context, fewer model tokens.Embed in Bulk, Only the Diffs Batch embeddings; re-embed changed pages only; use lower-dim vectors that meet recall targets. Impact: Cuts ingestion + vector store growth.
Guardrails Before Expensive Calls Filter inputs/outputs up front so blocked requests skip model tokens. Impact: Prevents pure-waste generations.
Agents: Cap Steps & Memoize Set hard limits on steps/time, prune toolset, and cache intermediate results between steps. Impact: 20β40% fewer tool/model calls on agent flows.
Flows: Watch Node Transitions Consolidate trivial transforms, remove dead branches, fail fast; budget transitions per request. Impact: Immediate savings on orchestration charges.
Dashboards, Tags, Budgets Tag every call (
app/team/env). Add budgets/alerts for tokens, PT hours, KB retrieval, and Flows transitions. Impact: Faster detection of regressions & anomalies.Separation of Use Cases Split workloads (routing/extraction vs reasoning/chat) so each uses the cheapest capable model. Impact: Keeps βpremiumβ models off commodity tasks.
π References
Last updated