Amazon Bedrock

🔗 Quicklinks (Bookmark):

Coming Soon

Amazon Bedrock is AWS’s fully managed platform for building generative-AI apps with a catalog of foundation models (FMs) plus managed RAG (Knowledge Bases), Agents, Flows, Guardrails, Evaluations, and a unified Converse API. It’s powerful — but it’s easy to overspend on tokens in/out, Provisioned Throughput (PT) hours, RAG ingestion/retrieval, agent/tool loops, and Flows node transitions.

→ What you’re using

→ What you’re paying

→ What you should be doing

→ AWS-native tools to make it happen.

🚀 What is Bedrock?

Amazon Bedrock gives you one API/console to access models from multiple providers (e.g., Amazon Nova/Titan, Anthropic, Meta, Mistral, AI21, Cohere, Stability) and tools to customize, evaluate, secure, and operate gen-AI workloads. Pricing for inference is token-based (for text) or per-asset (for images/video), with Batch and PT options for specific workload shapes.

Features

Model choice via one API (text, vision, image, embeddings; some video).
Knowledge Bases for managed RAG (Retrieve / RetrieveAndGenerate).
Agents / AgentCore for tool use and multi-step workflows.
Flows for visually orchestrating prompts/agents/KBs/Guardrails/AWS services — priced per node transition.
Guardrails (safety/PII/topic grounding) with reduced pricing for content filters.
Prompt Caching & Intelligent Prompt Routing to cut token costs and latency.

🧩 Components — pick the right ones

Component

What it does

Best for

Cost signals

Foundation Models (FMs)

Text/chat, vision, image, embeddings (some video)

Core LLM tasks

Tokens in/out (text) or per-asset (image/video), per model/Region.

Knowledge Bases (RAG)

Managed retrieval + Retrieve / RetrieveAndGenerate

Ground responses on private data

Pay for ingestion/retrieval, vector store, and any inference; the KB feature itself isn’t separately metered.

Agents / AgentCore

Tool calling & multi-step plans

Automations & API workflows

Each step can trigger model calls + tools; watch recursion.

Flows

Visual builder for agentic apps

Orchestrating complex workflows

$0.035 / 1K node transitions + underlying services (effective 2025-02-01).

Guardrails

Safety/PII/topic/grounding filters

Enterprise controls

$0.15 / 1K text units (content filters) after 2024 price cut; blocked inputs avoid downstream model charges.

Evaluations

Auto evals for prompts/models/RAG

Model selection & tuning

Pay for generator/evaluator tokens and any KB retrieval.

Converse / ConverseStream

Unified chat API & streaming

Portability across providers

No separate fee; standard model pricing applies.

⚙️ Inference & deployment options

Mode

When to use

Notes

On-Demand

Prototyping; variable usage

Billed by tokens (text) or per image/video/embedding.

Batch

Large offline jobs

Select models available at ~50% lower than on-demand; outputs to S3.

Provisioned Throughput (PT)

Steady, high TPS with SLOs

Hourly per model unit with 1- or 6-month commitments; discount vs on-demand; under-utilization wastes.

Custom Model Import / Fine-tuning

Domain-specific apps

Import free; pay for inference (on-demand or PT) and any customization/storage.

🧠 Bedrock optimization strategy (FinOps + reliability)

Pick the right model & mode

Start with smaller/cheaper models for classify/extract/routing; escalate only where impact is proven.
For steady traffic, compute PT breakeven from tokens/min & concurrency; for bulk jobs, trial Batch (often ~50% less).

Control tokens

Shorten system/prompt/context; enforce maxOutputTokens; favor structured outputs.
Use Knowledge Bases to retrieve only what’s needed; tune chunking/topK/filters to minimize context size.
Turn on Prompt Caching for repeat prefixes (up to 90% cost reduction on cached tokens).

Guardrails first

Run Guardrails before expensive model calls; blocked prompts incur guardrail cost only and save inference spend.

Agents/Flows with budgets

Keep toolsets minimal; cap steps/time; avoid recursive loops; prefer deterministic plans; monitor node transitions in Flows.

Evaluate continuously

Use Evaluations to A/B prompts/models/RAG; keep a champion/challenger cadence.

💸 Pricing model — where the money goes

Model inference: input/output tokens (text) or per-asset (image/video); embeddings by tokens; varies by provider/Region.
Batch: ~50% lower vs on-demand for supported models (Anthropic, Meta, Mistral, Amazon).
Provisioned Throughput: hourly per model unit, 1 or 6-month terms; discounted vs on-demand.
Knowledge Bases: pay for ingestion/retrieval and vector store; KB feature itself not billed separately.
Guardrails: content filters priced at $0.15 / 1K text units after 2024 price cut.
Flows: $0.035 / 1K node transitions + any underlying services (e.g., Guardrails, model calls).
Prompt Caching: cached-token reads up to 90% discount; model-specific write rules.

Model prices, context limits, and availability vary by Region/provider — always validate on the official pricing page for your models.

⏱️ Automated management

Budgets/alerts on tokens, PT hours, KB ingestion/retrieval, and Flows transitions.
Quotas per team (daily tokens, max output length).
EventBridge jobs to rotate prompt templates, auto-pause PT at end of trials, and archive logs.
Tag app/team/env in request metadata → feed Cost Explorer/CUR dashboards.

🔒 Security & compliance

Access Bedrock and AgentCore privately with VPC interface endpoints (PrivateLink).
Apply Guardrails for safety/PII/topic and grounding; monitor interventions.
Keep data in-Region and encrypted (KMS). Review retention settings per model/provider.

📊 Monitoring & tools

CloudWatch/CloudTrail around Bedrock APIs, KB ingestion, Agent actions, and Flows.
Cost Explorer / CUR + Budgets: segment by model / provider / Region / app tag.
Evaluations: track faithfulness, correctness, completeness during A/Bs.
Ops dashboards: latency, time-to-first-token, output length, error rates, agent step counts, Flows transitions.

✅ Bedrock FinOps Checklist

Choose On-Demand vs Batch vs PT by workload shape; set per-workload budgets.
Standardize on Converse/ConverseStream for portability & streaming.
Use Knowledge Bases to trim context; tune chunking/topK/filters.
Put Guardrails in front of expensive calls; log blocked prompts/responses.
For steady QPS, price out PT and commit only what you’ll use; monitor utilization.
Turn on Prompt Caching for repeat prefixes; target cacheable prompt budgets.
Watch agent step counts and Flows transitions; cap recursion/timeouts.
Tag every call (app/team/env); build dashboards for tokens, PT, KB retrieval, Flows.

🧠 AWS Bedrock Cost Optimization Challenges (Q&A)

Cloud GenAI spend tends to balloon from tokens, under-utilized PT, RAG retrieval/ingestion, agent loops, and Flows node transitions. Here are the toughest issues teams hit — and how to fix them fast.

Q1: Our token bill exploded right after launch. Where do we start?

✅ Solution

Instrument prompt length, context size, and output tokens per request.
Enforce maxOutputTokens, trim system prompts, remove boilerplate, and switch to structured output (JSON schemas) to constrain verbosity.
Start on a smaller/cheaper model; route only hard cases to a larger model (router or classifier in front).
Turn on Prompt Caching for repeated prefixes/system prompts to cut repeat token costs.

Q2: We bought Provisioned Throughput (PT), but it’s not cheaper than on-demand.

✅ Solution

Calculate PT breakeven from p95 concurrency × tokens/min; right-size PT units to that line.
Use on-demand spillover for peaks; avoid paying for idle PT hours.
Set utilization SLOs and alerts; pause or reduce PT when traffic drops.

Q3: RAG (Knowledge Bases) costs are higher than expected.

✅ Solution

Tune chunk size/overlap, topK, and filters; retrieve fewer, more relevant chunks.
Pre-compute embeddings in batch; re-embed only changed docs; dedupe and compress sources.
Cache frequent answers; enforce context budgets (max tokens from RAG per query).
Prefer Retrieve → filtered context over dumping full documents into the prompt.

Q4: Agents keep looping and burning tokens/tool calls.

✅ Solution

Cap max steps/time, add guard states, and prefer deterministic plans over open-ended reasoning.
Reduce the toolset to essentials; cache intermediate results across steps.
Terminate on ambiguous goals; add user confirmation checkpoints for risky branches.

Q5: Guardrails add cost without obvious benefit.

✅ Solution

Run Guardrails before expensive model calls; blocked requests incur guardrail cost only and avoid model tokens.
Log blocked categories; fix noisy sources (spam inputs, malformed payloads) upstream.
Keep guardrail policies modular (turn on only what you need).

Q6: Model choice feels like guesswork and we overpay for “premium” models.

✅ Solution

Use Evaluations with a representative dataset; A/B small vs. large models and pick the cheapest that meets SLOs.
Separate use cases: classification/extraction/routing on small models; escalate only when needed.
Track cost per 1K tokens and quality metrics together; re-evaluate after prompt/RAG changes.

Q7: Latency SLOs push us to larger (pricier) models.

✅ Solution

Use streaming (ConverseStream) to improve time-to-first-token without upgrading models.
Shorten prompts; use few-shot with compact exemplars; remove irrelevant context.
Move heavy retrieval/enrichment to Batch or background jobs where possible.

Q8: Flows orchestration is convenient, but node transitions are adding up.

✅ Solution

Consolidate nodes; combine lightweight transforms; memoize outputs within a flow (reuse results).
Avoid unnecessary fan-out; prune dead branches; set fail-fast conditions.
Monitor node transitions per request; budget transitions like tokens.

Q9: Embeddings/vector store costs are creeping up.

✅ Solution

Batch embedding jobs; re-embed only diffs (changed pages/sections).
Use lower-dimensional embeddings that meet recall needs; prune duplicate or low-value content.
Apply retention policies on the vector store; tier old sources to cheaper storage.

Q10: Finance wants proof of savings and unit economics.

✅ Solution

Report before/after cost per 1K tokens, avg prompt length, avg output length, KB hit rate, agent steps, Flows transitions, and PT utilization.
Tag every request (app/team/env) and break down costs by model/provider/Region.
Keep a monthly change log (prompts, models, RAG settings) with measured deltas.

⚙️ Quick Wins (Bedrock)

Cap Output + Tighten Prompts Enforce maxOutputTokens, switch to structured JSON outputs, and trim system prompts/boilerplate. Impact: 20–50% fewer output tokens in hours.
Route to Smaller Models First Use a light classifier/router to send only hard cases to premium models. Impact: 25–60% token cost reduction on mixed workloads.
Turn On Prompt Caching Cache repeated system/preamble tokens for chat/workflows. Impact: Up to ~90% discount on cached prefix tokens + faster latency.
Right-Size PT (or Turn It Off) Compute PT breakeven; set utilization SLOs; spill peaks to on-demand; pause or resize underused PT units. Impact: Eliminates idle-hour waste immediately.
Batch for Offline Work Move backfills, enrichment, and scoring to Batch where supported. Impact: ~50% lower inference unit cost vs on-demand.
RAG: Fewer, Better Chunks Reduce chunk size/overlap, tune topK, dedupe sources, and set a context budget (max RAG tokens/query). Impact: 20–40% less context, fewer model tokens.
Embed in Bulk, Only the Diffs Batch embeddings; re-embed changed pages only; use lower-dim vectors that meet recall targets. Impact: Cuts ingestion + vector store growth.
Guardrails Before Expensive Calls Filter inputs/outputs up front so blocked requests skip model tokens. Impact: Prevents pure-waste generations.
Agents: Cap Steps & Memoize Set hard limits on steps/time, prune toolset, and cache intermediate results between steps. Impact: 20–40% fewer tool/model calls on agent flows.
Flows: Watch Node Transitions Consolidate trivial transforms, remove dead branches, fail fast; budget transitions per request. Impact: Immediate savings on orchestration charges.
Dashboards, Tags, Budgets Tag every call (app/team/env). Add budgets/alerts for tokens, PT hours, KB retrieval, and Flows transitions. Impact: Faster detection of regressions & anomalies.
Separation of Use Cases Split workloads (routing/extraction vs reasoning/chat) so each uses the cheapest capable model. Impact: Keeps “premium” models off commodity tasks.

📚 References

PreviousAmazon Cloudfront Next7-Steps FinOps Strategy

Last updated 3 months ago

hashtag🔗 Quicklinks (Bookmark):

hashtag🚀 What is Bedrock?

hashtag🧩 Components — pick the right ones

hashtag⚙️ Inference & deployment options

hashtag🧠 Bedrock optimization strategy (FinOps + reliability)

hashtag💸 Pricing model — where the money goes

hashtag⏱️ Automated management

hashtag🔒 Security & compliance

hashtag📊 Monitoring & tools

hashtag✅ Bedrock FinOps Checklist

hashtag🧠 AWS Bedrock Cost Optimization Challenges (Q&A)

hashtagQ1: Our token bill exploded right after launch. Where do we start?

hashtagQ2: We bought Provisioned Throughput (PT), but it’s not cheaper than on-demand.

hashtagQ3: RAG (Knowledge Bases) costs are higher than expected.

hashtagQ4: Agents keep looping and burning tokens/tool calls.

hashtagQ5: Guardrails add cost without obvious benefit.

hashtagQ6: Model choice feels like guesswork and we overpay for “premium” models.

hashtagQ7: Latency SLOs push us to larger (pricier) models.

hashtagQ8: Flows orchestration is convenient, but node transitions are adding up.

hashtagQ9: Embeddings/vector store costs are creeping up.

hashtagQ10: Finance wants proof of savings and unit economics.

hashtag⚙️ Quick Wins (Bedrock)

hashtag📚 References

🔗 Quicklinks (Bookmark):

🚀 What is Bedrock?

🧩 Components — pick the right ones

⚙️ Inference & deployment options

🧠 Bedrock optimization strategy (FinOps + reliability)

💸 Pricing model — where the money goes

⏱️ Automated management

🔒 Security & compliance

📊 Monitoring & tools

✅ Bedrock FinOps Checklist

🧠 AWS Bedrock Cost Optimization Challenges (Q&A)

Q1: Our token bill exploded right after launch. Where do we start?

Q2: We bought Provisioned Throughput (PT), but it’s not cheaper than on-demand.

Q3: RAG (Knowledge Bases) costs are higher than expected.

Q4: Agents keep looping and burning tokens/tool calls.

Q5: Guardrails add cost without obvious benefit.

Q6: Model choice feels like guesswork and we overpay for “premium” models.

Q7: Latency SLOs push us to larger (pricier) models.

Q8: Flows orchestration is convenient, but node transitions are adding up.

Q9: Embeddings/vector store costs are creeping up.

Q10: Finance wants proof of savings and unit economics.

⚙️ Quick Wins (Bedrock)

📚 References