# Amazon Bedrock

### 🔗 **Quicklinks (Bookmark):**

**Coming Soon**

**Amazon Bedrock** is AWS’s fully managed platform for building generative-AI apps with a catalog of foundation models (FMs) plus managed RAG (Knowledge Bases), Agents, **Flows**, Guardrails, Evaluations, and a unified Converse API. It’s powerful — but it’s easy to overspend on **tokens in/out**, **Provisioned Throughput (PT)** hours, **RAG ingestion/retrieval**, **agent/tool loops**, and **Flows node transitions**.

→ What you’re using&#x20;

→ What you’re paying&#x20;

→ What you should be doing&#x20;

→ AWS-native tools to make it happen.

***

### 🚀 What is Bedrock?

**Amazon Bedrock** gives you one API/console to access models from multiple providers (e.g., Amazon Nova/Titan, Anthropic, Meta, Mistral, AI21, Cohere, Stability) and tools to customize, evaluate, secure, and operate gen-AI workloads. Pricing for inference is **token-based** (for text) or **per-asset** (for images/video), with **Batch** and **PT** options for specific workload shapes.&#x20;

**Features**

* **Model choice via one API** (text, vision, image, embeddings; some video).&#x20;
* **Knowledge Bases** for managed RAG (Retrieve / RetrieveAndGenerate).&#x20;
* **Agents / AgentCore** for tool use and multi-step workflows.
* **Flows** for visually orchestrating prompts/agents/KBs/Guardrails/AWS services — **priced per node transition**.&#x20;
* **Guardrails** (safety/PII/topic grounding) with reduced pricing for content filters.
* **Prompt Caching** & **Intelligent Prompt Routing** to cut token costs and latency.&#x20;

***

### 🧩 Components — pick the right ones

| Component                     | What it does                                           | Best for                         | Cost signals                                                                                                      |
| ----------------------------- | ------------------------------------------------------ | -------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
| **Foundation Models (FMs)**   | Text/chat, vision, image, embeddings (some video)      | Core LLM tasks                   | **Tokens in/out** (text) or per-asset (image/video), per model/Region.                                            |
| **Knowledge Bases (RAG)**     | Managed retrieval + `Retrieve` / `RetrieveAndGenerate` | Ground responses on private data | Pay for **ingestion/retrieval**, vector store, and any inference; the KB feature itself isn’t separately metered. |
| **Agents / AgentCore**        | Tool calling & multi-step plans                        | Automations & API workflows      | Each step can trigger **model calls + tools**; watch recursion.                                                   |
| **Flows**                     | Visual builder for agentic apps                        | Orchestrating complex workflows  | **$0.035 / 1K node transitions** + underlying services (effective 2025-02-01).                                    |
| **Guardrails**                | Safety/PII/topic/grounding filters                     | Enterprise controls              | **$0.15 / 1K text units** (content filters) after 2024 price cut; blocked inputs avoid downstream model charges.  |
| **Evaluations**               | Auto evals for prompts/models/RAG                      | Model selection & tuning         | Pay for generator/evaluator tokens and any KB retrieval.                                                          |
| **Converse / ConverseStream** | Unified chat API & streaming                           | Portability across providers     | No separate fee; standard model pricing applies.                                                                  |

***

### ⚙️ Inference & deployment options

| Mode                                  | When to use                 | Notes                                                                                                      |
| ------------------------------------- | --------------------------- | ---------------------------------------------------------------------------------------------------------- |
| **On-Demand**                         | Prototyping; variable usage | Billed by **tokens** (text) or **per image/video/embedding**.                                              |
| **Batch**                             | Large offline jobs          | Select models available at **\~50% lower** than on-demand; outputs to S3.                                  |
| **Provisioned Throughput (PT)**       | Steady, high TPS with SLOs  | **Hourly per model unit** with 1- or 6-month commitments; discount vs on-demand; under-utilization wastes. |
| **Custom Model Import / Fine-tuning** | Domain-specific apps        | Import free; pay for inference (on-demand or PT) and any customization/storage.                            |

***

### 🧠 Bedrock optimization strategy (FinOps + reliability)

**Pick the right model & mode**

* Start with **smaller/cheaper models** for classify/extract/routing; escalate only where impact is proven.
* For steady traffic, compute **PT breakeven** from tokens/min & concurrency; for bulk jobs, trial **Batch** (often \~50% less).&#x20;

**Control tokens**

* Shorten system/prompt/context; enforce **`maxOutputTokens`**; favor **structured outputs**.
* Use **Knowledge Bases** to retrieve only what’s needed; tune **chunking/topK/filters** to minimize context size.&#x20;
* Turn on **Prompt Caching** for repeat prefixes (up to **90% cost reduction** on cached tokens).&#x20;

**Guardrails first**

* Run **Guardrails** before expensive model calls; blocked prompts incur **guardrail cost only** and save inference spend.&#x20;

**Agents/Flows with budgets**

* Keep toolsets minimal; cap steps/time; avoid recursive loops; prefer deterministic plans; monitor **node transitions** in Flows.&#x20;

**Evaluate continuously**

* Use **Evaluations** to A/B prompts/models/RAG; keep a **champion/challenger** cadence.&#x20;

***

### 💸 Pricing model — where the money goes

* **Model inference**: **input/output tokens** (text) or **per-asset** (image/video); embeddings by tokens; varies by provider/Region.&#x20;
* **Batch**: \~**50% lower** vs on-demand for supported models (Anthropic, Meta, Mistral, Amazon).&#x20;
* **Provisioned Throughput**: **hourly per model unit**, 1 or 6-month terms; discounted vs on-demand.&#x20;
* **Knowledge Bases**: pay for **ingestion/retrieval** and vector store; KB feature itself not billed separately.&#x20;
* **Guardrails**: content filters priced at **$0.15 / 1K text units** after 2024 price cut.&#x20;
* **Flows**: **$0.035 / 1K node transitions** + any underlying services (e.g., Guardrails, model calls).&#x20;
* **Prompt Caching**: cached-token reads up to **90% discount**; model-specific write rules.&#x20;

> Model prices, context limits, and availability vary by Region/provider — always validate on the official pricing page for your models.

***

### ⏱️ Automated management

* **Budgets/alerts** on **tokens**, **PT hours**, **KB ingestion/retrieval**, and **Flows transitions**.
* **Quotas** per team (daily tokens, max output length).
* **EventBridge** jobs to rotate **prompt templates**, auto-pause PT at end of trials, and archive logs.
* Tag **app/team/env** in request metadata → feed **Cost Explorer/CUR** dashboards.&#x20;

***

### 🔒 Security & compliance

* Access Bedrock and AgentCore privately with **VPC interface endpoints** (PrivateLink).&#x20;
* Apply **Guardrails** for safety/PII/topic and grounding; monitor interventions.&#x20;
* Keep data in-Region and encrypted (KMS). Review retention settings per model/provider.

***

### 📊 Monitoring & tools

* **CloudWatch/CloudTrail** around Bedrock APIs, KB ingestion, Agent actions, and **Flows**.
* **Cost Explorer / CUR + Budgets**: segment by **model / provider / Region / app tag**.&#x20;
* **Evaluations**: track **faithfulness, correctness, completeness** during A/Bs.
* Ops dashboards: latency, time-to-first-token, output length, error rates, **agent step counts**, **Flows transitions**.

***

### ✅ Bedrock FinOps Checklist

* [ ] Choose **On-Demand vs Batch vs PT** by **workload shape**; set **per-workload budgets**.&#x20;
* [ ] Standardize on **Converse/ConverseStream** for portability & streaming.
* [ ] Use **Knowledge Bases** to trim context; tune **chunking/topK/filters**.
* [ ] Put **Guardrails** in front of expensive calls; log blocked prompts/responses.&#x20;
* [ ] For steady QPS, **price out PT** and **commit only what you’ll use**; monitor utilization.&#x20;
* [ ] Turn on **Prompt Caching** for repeat prefixes; target cacheable prompt budgets.&#x20;
* [ ] Watch **agent step counts** and **Flows transitions**; cap recursion/timeouts.
* [ ] Tag every call (app/team/env); build dashboards for **tokens**, **PT**, **KB retrieval**, **Flows**.&#x20;

***

### 🧠 AWS Bedrock Cost Optimization Challenges (Q\&A)

Cloud GenAI spend tends to balloon from **tokens**, **under-utilized PT**, **RAG retrieval/ingestion**, **agent loops**, and **Flows node transitions**. Here are the toughest issues teams hit — and how to fix them fast.

***

#### **Q1: Our token bill exploded right after launch. Where do we start?**

**✅ Solution**

* Instrument **prompt length, context size, and output tokens** per request.
* Enforce **`maxOutputTokens`**, trim system prompts, remove boilerplate, and switch to **structured output** (JSON schemas) to constrain verbosity.
* Start on a **smaller/cheaper model**; route only hard cases to a larger model (router or classifier in front).
* Turn on **Prompt Caching** for repeated prefixes/system prompts to cut repeat token costs.

***

#### **Q2: We bought Provisioned Throughput (PT), but it’s not cheaper than on-demand.**

**✅ Solution**

* Calculate **PT breakeven** from p95 concurrency × tokens/min; right-size PT **units** to that line.
* Use **on-demand spillover** for peaks; avoid paying for idle PT hours.
* Set **utilization SLOs** and alerts; pause or reduce PT when traffic drops.

***

#### **Q3: RAG (Knowledge Bases) costs are higher than expected.**

**✅ Solution**

* Tune **chunk size/overlap**, **topK**, and **filters**; retrieve fewer, more relevant chunks.
* **Pre-compute embeddings in batch**; re-embed only changed docs; dedupe and compress sources.
* Cache frequent answers; enforce **context budgets** (max tokens from RAG per query).
* Prefer **Retrieve → filtered context** over dumping full documents into the prompt.

***

#### **Q4: Agents keep looping and burning tokens/tool calls.**

**✅ Solution**

* Cap **max steps/time**, add **guard states**, and prefer **deterministic plans** over open-ended reasoning.
* Reduce the **toolset** to essentials; cache intermediate results across steps.
* Terminate on ambiguous goals; add user confirmation checkpoints for risky branches.

***

#### **Q5: Guardrails add cost without obvious benefit.**

**✅ Solution**

* Run **Guardrails before** expensive model calls; blocked requests incur guardrail cost only and **avoid model tokens**.
* Log **blocked categories**; fix noisy sources (spam inputs, malformed payloads) upstream.
* Keep guardrail policies **modular** (turn on only what you need).

***

#### **Q6: Model choice feels like guesswork and we overpay for “premium” models.**

**✅ Solution**

* Use **Evaluations** with a representative dataset; A/B **small vs. large** models and pick the **cheapest that meets SLOs**.
* Separate use cases: **classification/extraction/routing** on small models; escalate only when needed.
* Track **cost per 1K tokens** and **quality metrics** together; re-evaluate after prompt/RAG changes.

***

#### **Q7: Latency SLOs push us to larger (pricier) models.**

**✅ Solution**

* Use **streaming (ConverseStream)** to improve **time-to-first-token** without upgrading models.
* Shorten prompts; use **few-shot** with compact exemplars; remove irrelevant context.
* Move heavy retrieval/enrichment to **Batch** or background jobs where possible.

***

#### **Q8: Flows orchestration is convenient, but node transitions are adding up.**

**✅ Solution**

* Consolidate nodes; combine lightweight transforms; **memoize** outputs within a flow (reuse results).
* Avoid unnecessary fan-out; prune dead branches; set **fail-fast** conditions.
* Monitor **node transitions per request**; budget transitions like tokens.

***

#### **Q9: Embeddings/vector store costs are creeping up.**

**✅ Solution**

* Batch embedding jobs; **re-embed only diffs** (changed pages/sections).
* Use **lower-dimensional** embeddings that meet recall needs; prune duplicate or low-value content.
* Apply **retention policies** on the vector store; tier old sources to cheaper storage.

***

#### **Q10: Finance wants proof of savings and unit economics.**

**✅ Solution**

* Report before/after **cost per 1K tokens**, **avg prompt length**, **avg output length**, **KB hit rate**, **agent steps**, **Flows transitions**, and **PT utilization**.
* Tag every request (**app/team/env**) and break down costs by model/provider/Region.
* Keep a monthly **change log** (prompts, models, RAG settings) with measured deltas.

***

### ⚙️ Quick Wins (Bedrock)

* **Cap Output + Tighten Prompts**\
  Enforce `maxOutputTokens`, switch to **structured JSON** outputs, and trim system prompts/boilerplate.\
  \&#xNAN;*Impact*: 20–50% fewer output tokens in hours.
* **Route to Smaller Models First**\
  Use a light classifier/router to send only hard cases to premium models.\
  \&#xNAN;*Impact*: 25–60% token cost reduction on mixed workloads.
* **Turn On Prompt Caching**\
  Cache repeated system/preamble tokens for chat/workflows.\
  \&#xNAN;*Impact*: Up to \~90% discount on cached prefix tokens + faster latency.
* **Right-Size PT (or Turn It Off)**\
  Compute PT breakeven; set utilization SLOs; spill peaks to on-demand; pause or resize underused PT units.\
  \&#xNAN;*Impact*: Eliminates idle-hour waste immediately.
* **Batch for Offline Work**\
  Move backfills, enrichment, and scoring to **Batch** where supported.\
  \&#xNAN;*Impact*: \~50% lower inference unit cost vs on-demand.
* **RAG: Fewer, Better Chunks**\
  Reduce chunk size/overlap, tune `topK`, dedupe sources, and set a **context budget** (max RAG tokens/query).\
  \&#xNAN;*Impact*: 20–40% less context, fewer model tokens.
* **Embed in Bulk, Only the Diffs**\
  Batch embeddings; re-embed changed pages only; use lower-dim vectors that meet recall targets.\
  \&#xNAN;*Impact*: Cuts ingestion + vector store growth.
* **Guardrails Before Expensive Calls**\
  Filter inputs/outputs up front so blocked requests skip model tokens.\
  \&#xNAN;*Impact*: Prevents pure-waste generations.
* **Agents: Cap Steps & Memoize**\
  Set hard limits on steps/time, prune toolset, and cache intermediate results between steps.\
  \&#xNAN;*Impact*: 20–40% fewer tool/model calls on agent flows.
* **Flows: Watch Node Transitions**\
  Consolidate trivial transforms, remove dead branches, fail fast; budget transitions per request.\
  \&#xNAN;*Impact*: Immediate savings on orchestration charges.
* **Dashboards, Tags, Budgets**\
  Tag every call (`app/team/env`). Add budgets/alerts for **tokens**, **PT hours**, **KB retrieval**, and **Flows transitions**.\
  \&#xNAN;*Impact*: Faster detection of regressions & anomalies.
* **Separation of Use Cases**\
  Split workloads (routing/extraction vs reasoning/chat) so each uses the **cheapest capable model**.\
  \&#xNAN;*Impact*: Keeps “premium” models off commodity tasks.

***

### 📚 References

* [Amazon Bedrock Documentation](https://docs.aws.amazon.com/bedrock/)
* [What is Amazon Bedrock?](https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-bedrock.html)
* [Get started with Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/getting-started.html)
* [Amazon Bedrock API Reference](https://docs.aws.amazon.com/bedrock/latest/APIReference/welcome.html)
* [Tutorial: Building a simple Amazon Bedrock agent](https://docs.aws.amazon.com/bedrock/latest/userguide/agent-tutorial.html)
* [Get started with the API](https://docs.aws.amazon.com/bedrock/latest/userguide/getting-started-api.html)
* [Setting up AWS Bedrock for API-based text inference](https://medium.com/%40harangpeter/setting-up-aws-bedrock-for-api-based-text-inference-dc25ab2b216b)
