Amazon Bedrock

Coming Soon

Amazon Bedrock is AWS’s fully managed platform for building generative-AI apps with a catalog of foundation models (FMs) plus managed RAG (Knowledge Bases), Agents, Flows, Guardrails, Evaluations, and a unified Converse API. It’s powerful β€” but it’s easy to overspend on tokens in/out, Provisioned Throughput (PT) hours, RAG ingestion/retrieval, agent/tool loops, and Flows node transitions.

β†’ What you’re using

β†’ What you’re paying

β†’ What you should be doing

β†’ AWS-native tools to make it happen.


πŸš€ What is Bedrock?

Amazon Bedrock gives you one API/console to access models from multiple providers (e.g., Amazon Nova/Titan, Anthropic, Meta, Mistral, AI21, Cohere, Stability) and tools to customize, evaluate, secure, and operate gen-AI workloads. Pricing for inference is token-based (for text) or per-asset (for images/video), with Batch and PT options for specific workload shapes.

Features

  • Model choice via one API (text, vision, image, embeddings; some video).

  • Knowledge Bases for managed RAG (Retrieve / RetrieveAndGenerate).

  • Agents / AgentCore for tool use and multi-step workflows.

  • Flows for visually orchestrating prompts/agents/KBs/Guardrails/AWS services β€” priced per node transition.

  • Guardrails (safety/PII/topic grounding) with reduced pricing for content filters.

  • Prompt Caching & Intelligent Prompt Routing to cut token costs and latency.


🧩 Components β€” pick the right ones

Component
What it does
Best for
Cost signals

Foundation Models (FMs)

Text/chat, vision, image, embeddings (some video)

Core LLM tasks

Tokens in/out (text) or per-asset (image/video), per model/Region.

Knowledge Bases (RAG)

Managed retrieval + Retrieve / RetrieveAndGenerate

Ground responses on private data

Pay for ingestion/retrieval, vector store, and any inference; the KB feature itself isn’t separately metered.

Agents / AgentCore

Tool calling & multi-step plans

Automations & API workflows

Each step can trigger model calls + tools; watch recursion.

Flows

Visual builder for agentic apps

Orchestrating complex workflows

$0.035 / 1K node transitions + underlying services (effective 2025-02-01).

Guardrails

Safety/PII/topic/grounding filters

Enterprise controls

$0.15 / 1K text units (content filters) after 2024 price cut; blocked inputs avoid downstream model charges.

Evaluations

Auto evals for prompts/models/RAG

Model selection & tuning

Pay for generator/evaluator tokens and any KB retrieval.

Converse / ConverseStream

Unified chat API & streaming

Portability across providers

No separate fee; standard model pricing applies.


βš™οΈ Inference & deployment options

Mode
When to use
Notes

On-Demand

Prototyping; variable usage

Billed by tokens (text) or per image/video/embedding.

Batch

Large offline jobs

Select models available at ~50% lower than on-demand; outputs to S3.

Provisioned Throughput (PT)

Steady, high TPS with SLOs

Hourly per model unit with 1- or 6-month commitments; discount vs on-demand; under-utilization wastes.

Custom Model Import / Fine-tuning

Domain-specific apps

Import free; pay for inference (on-demand or PT) and any customization/storage.


🧠 Bedrock optimization strategy (FinOps + reliability)

Pick the right model & mode

  • Start with smaller/cheaper models for classify/extract/routing; escalate only where impact is proven.

  • For steady traffic, compute PT breakeven from tokens/min & concurrency; for bulk jobs, trial Batch (often ~50% less).

Control tokens

  • Shorten system/prompt/context; enforce maxOutputTokens; favor structured outputs.

  • Use Knowledge Bases to retrieve only what’s needed; tune chunking/topK/filters to minimize context size.

  • Turn on Prompt Caching for repeat prefixes (up to 90% cost reduction on cached tokens).

Guardrails first

  • Run Guardrails before expensive model calls; blocked prompts incur guardrail cost only and save inference spend.

Agents/Flows with budgets

  • Keep toolsets minimal; cap steps/time; avoid recursive loops; prefer deterministic plans; monitor node transitions in Flows.

Evaluate continuously

  • Use Evaluations to A/B prompts/models/RAG; keep a champion/challenger cadence.


πŸ’Έ Pricing model β€” where the money goes

  • Model inference: input/output tokens (text) or per-asset (image/video); embeddings by tokens; varies by provider/Region.

  • Batch: ~50% lower vs on-demand for supported models (Anthropic, Meta, Mistral, Amazon).

  • Provisioned Throughput: hourly per model unit, 1 or 6-month terms; discounted vs on-demand.

  • Knowledge Bases: pay for ingestion/retrieval and vector store; KB feature itself not billed separately.

  • Guardrails: content filters priced at $0.15 / 1K text units after 2024 price cut.

  • Flows: $0.035 / 1K node transitions + any underlying services (e.g., Guardrails, model calls).

  • Prompt Caching: cached-token reads up to 90% discount; model-specific write rules.

Model prices, context limits, and availability vary by Region/provider β€” always validate on the official pricing page for your models.


⏱️ Automated management

  • Budgets/alerts on tokens, PT hours, KB ingestion/retrieval, and Flows transitions.

  • Quotas per team (daily tokens, max output length).

  • EventBridge jobs to rotate prompt templates, auto-pause PT at end of trials, and archive logs.

  • Tag app/team/env in request metadata β†’ feed Cost Explorer/CUR dashboards.


πŸ”’ Security & compliance

  • Access Bedrock and AgentCore privately with VPC interface endpoints (PrivateLink).

  • Apply Guardrails for safety/PII/topic and grounding; monitor interventions.

  • Keep data in-Region and encrypted (KMS). Review retention settings per model/provider.


πŸ“Š Monitoring & tools

  • CloudWatch/CloudTrail around Bedrock APIs, KB ingestion, Agent actions, and Flows.

  • Cost Explorer / CUR + Budgets: segment by model / provider / Region / app tag.

  • Evaluations: track faithfulness, correctness, completeness during A/Bs.

  • Ops dashboards: latency, time-to-first-token, output length, error rates, agent step counts, Flows transitions.


βœ… Bedrock FinOps Checklist


🧠 AWS Bedrock Cost Optimization Challenges (Q&A)

Cloud GenAI spend tends to balloon from tokens, under-utilized PT, RAG retrieval/ingestion, agent loops, and Flows node transitions. Here are the toughest issues teams hit β€” and how to fix them fast.


Q1: Our token bill exploded right after launch. Where do we start?

βœ… Solution

  • Instrument prompt length, context size, and output tokens per request.

  • Enforce maxOutputTokens, trim system prompts, remove boilerplate, and switch to structured output (JSON schemas) to constrain verbosity.

  • Start on a smaller/cheaper model; route only hard cases to a larger model (router or classifier in front).

  • Turn on Prompt Caching for repeated prefixes/system prompts to cut repeat token costs.


Q2: We bought Provisioned Throughput (PT), but it’s not cheaper than on-demand.

βœ… Solution

  • Calculate PT breakeven from p95 concurrency Γ— tokens/min; right-size PT units to that line.

  • Use on-demand spillover for peaks; avoid paying for idle PT hours.

  • Set utilization SLOs and alerts; pause or reduce PT when traffic drops.


Q3: RAG (Knowledge Bases) costs are higher than expected.

βœ… Solution

  • Tune chunk size/overlap, topK, and filters; retrieve fewer, more relevant chunks.

  • Pre-compute embeddings in batch; re-embed only changed docs; dedupe and compress sources.

  • Cache frequent answers; enforce context budgets (max tokens from RAG per query).

  • Prefer Retrieve β†’ filtered context over dumping full documents into the prompt.


Q4: Agents keep looping and burning tokens/tool calls.

βœ… Solution

  • Cap max steps/time, add guard states, and prefer deterministic plans over open-ended reasoning.

  • Reduce the toolset to essentials; cache intermediate results across steps.

  • Terminate on ambiguous goals; add user confirmation checkpoints for risky branches.


Q5: Guardrails add cost without obvious benefit.

βœ… Solution

  • Run Guardrails before expensive model calls; blocked requests incur guardrail cost only and avoid model tokens.

  • Log blocked categories; fix noisy sources (spam inputs, malformed payloads) upstream.

  • Keep guardrail policies modular (turn on only what you need).


Q6: Model choice feels like guesswork and we overpay for β€œpremium” models.

βœ… Solution

  • Use Evaluations with a representative dataset; A/B small vs. large models and pick the cheapest that meets SLOs.

  • Separate use cases: classification/extraction/routing on small models; escalate only when needed.

  • Track cost per 1K tokens and quality metrics together; re-evaluate after prompt/RAG changes.


Q7: Latency SLOs push us to larger (pricier) models.

βœ… Solution

  • Use streaming (ConverseStream) to improve time-to-first-token without upgrading models.

  • Shorten prompts; use few-shot with compact exemplars; remove irrelevant context.

  • Move heavy retrieval/enrichment to Batch or background jobs where possible.


Q8: Flows orchestration is convenient, but node transitions are adding up.

βœ… Solution

  • Consolidate nodes; combine lightweight transforms; memoize outputs within a flow (reuse results).

  • Avoid unnecessary fan-out; prune dead branches; set fail-fast conditions.

  • Monitor node transitions per request; budget transitions like tokens.


Q9: Embeddings/vector store costs are creeping up.

βœ… Solution

  • Batch embedding jobs; re-embed only diffs (changed pages/sections).

  • Use lower-dimensional embeddings that meet recall needs; prune duplicate or low-value content.

  • Apply retention policies on the vector store; tier old sources to cheaper storage.


Q10: Finance wants proof of savings and unit economics.

βœ… Solution

  • Report before/after cost per 1K tokens, avg prompt length, avg output length, KB hit rate, agent steps, Flows transitions, and PT utilization.

  • Tag every request (app/team/env) and break down costs by model/provider/Region.

  • Keep a monthly change log (prompts, models, RAG settings) with measured deltas.


βš™οΈ Quick Wins (Bedrock)

  • Cap Output + Tighten Prompts Enforce maxOutputTokens, switch to structured JSON outputs, and trim system prompts/boilerplate. Impact: 20–50% fewer output tokens in hours.

  • Route to Smaller Models First Use a light classifier/router to send only hard cases to premium models. Impact: 25–60% token cost reduction on mixed workloads.

  • Turn On Prompt Caching Cache repeated system/preamble tokens for chat/workflows. Impact: Up to ~90% discount on cached prefix tokens + faster latency.

  • Right-Size PT (or Turn It Off) Compute PT breakeven; set utilization SLOs; spill peaks to on-demand; pause or resize underused PT units. Impact: Eliminates idle-hour waste immediately.

  • Batch for Offline Work Move backfills, enrichment, and scoring to Batch where supported. Impact: ~50% lower inference unit cost vs on-demand.

  • RAG: Fewer, Better Chunks Reduce chunk size/overlap, tune topK, dedupe sources, and set a context budget (max RAG tokens/query). Impact: 20–40% less context, fewer model tokens.

  • Embed in Bulk, Only the Diffs Batch embeddings; re-embed changed pages only; use lower-dim vectors that meet recall targets. Impact: Cuts ingestion + vector store growth.

  • Guardrails Before Expensive Calls Filter inputs/outputs up front so blocked requests skip model tokens. Impact: Prevents pure-waste generations.

  • Agents: Cap Steps & Memoize Set hard limits on steps/time, prune toolset, and cache intermediate results between steps. Impact: 20–40% fewer tool/model calls on agent flows.

  • Flows: Watch Node Transitions Consolidate trivial transforms, remove dead branches, fail fast; budget transitions per request. Impact: Immediate savings on orchestration charges.

  • Dashboards, Tags, Budgets Tag every call (app/team/env). Add budgets/alerts for tokens, PT hours, KB retrieval, and Flows transitions. Impact: Faster detection of regressions & anomalies.

  • Separation of Use Cases Split workloads (routing/extraction vs reasoning/chat) so each uses the cheapest capable model. Impact: Keeps β€œpremium” models off commodity tasks.


πŸ“š References

Last updated