# Amazon Bedrock

### 🔗 **Quicklinks (Bookmark):**

**Coming Soon**

**Amazon Bedrock** is AWS’s fully managed platform for building generative-AI apps with a catalog of foundation models (FMs) plus managed RAG (Knowledge Bases), Agents, **Flows**, Guardrails, Evaluations, and a unified Converse API. It’s powerful — but it’s easy to overspend on **tokens in/out**, **Provisioned Throughput (PT)** hours, **RAG ingestion/retrieval**, **agent/tool loops**, and **Flows node transitions**.

→ What you’re using&#x20;

→ What you’re paying&#x20;

→ What you should be doing&#x20;

→ AWS-native tools to make it happen.

***

### 🚀 What is Bedrock?

**Amazon Bedrock** gives you one API/console to access models from multiple providers (e.g., Amazon Nova/Titan, Anthropic, Meta, Mistral, AI21, Cohere, Stability) and tools to customize, evaluate, secure, and operate gen-AI workloads. Pricing for inference is **token-based** (for text) or **per-asset** (for images/video), with **Batch** and **PT** options for specific workload shapes.&#x20;

**Features**

* **Model choice via one API** (text, vision, image, embeddings; some video).&#x20;
* **Knowledge Bases** for managed RAG (Retrieve / RetrieveAndGenerate).&#x20;
* **Agents / AgentCore** for tool use and multi-step workflows.
* **Flows** for visually orchestrating prompts/agents/KBs/Guardrails/AWS services — **priced per node transition**.&#x20;
* **Guardrails** (safety/PII/topic grounding) with reduced pricing for content filters.
* **Prompt Caching** & **Intelligent Prompt Routing** to cut token costs and latency.&#x20;

***

### 🧩 Components — pick the right ones

| Component                     | What it does                                           | Best for                         | Cost signals                                                                                                      |
| ----------------------------- | ------------------------------------------------------ | -------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
| **Foundation Models (FMs)**   | Text/chat, vision, image, embeddings (some video)      | Core LLM tasks                   | **Tokens in/out** (text) or per-asset (image/video), per model/Region.                                            |
| **Knowledge Bases (RAG)**     | Managed retrieval + `Retrieve` / `RetrieveAndGenerate` | Ground responses on private data | Pay for **ingestion/retrieval**, vector store, and any inference; the KB feature itself isn’t separately metered. |
| **Agents / AgentCore**        | Tool calling & multi-step plans                        | Automations & API workflows      | Each step can trigger **model calls + tools**; watch recursion.                                                   |
| **Flows**                     | Visual builder for agentic apps                        | Orchestrating complex workflows  | **$0.035 / 1K node transitions** + underlying services (effective 2025-02-01).                                    |
| **Guardrails**                | Safety/PII/topic/grounding filters                     | Enterprise controls              | **$0.15 / 1K text units** (content filters) after 2024 price cut; blocked inputs avoid downstream model charges.  |
| **Evaluations**               | Auto evals for prompts/models/RAG                      | Model selection & tuning         | Pay for generator/evaluator tokens and any KB retrieval.                                                          |
| **Converse / ConverseStream** | Unified chat API & streaming                           | Portability across providers     | No separate fee; standard model pricing applies.                                                                  |

***

### ⚙️ Inference & deployment options

| Mode                                  | When to use                 | Notes                                                                                                      |
| ------------------------------------- | --------------------------- | ---------------------------------------------------------------------------------------------------------- |
| **On-Demand**                         | Prototyping; variable usage | Billed by **tokens** (text) or **per image/video/embedding**.                                              |
| **Batch**                             | Large offline jobs          | Select models available at **\~50% lower** than on-demand; outputs to S3.                                  |
| **Provisioned Throughput (PT)**       | Steady, high TPS with SLOs  | **Hourly per model unit** with 1- or 6-month commitments; discount vs on-demand; under-utilization wastes. |
| **Custom Model Import / Fine-tuning** | Domain-specific apps        | Import free; pay for inference (on-demand or PT) and any customization/storage.                            |

***

### 🧠 Bedrock optimization strategy (FinOps + reliability)

**Pick the right model & mode**

* Start with **smaller/cheaper models** for classify/extract/routing; escalate only where impact is proven.
* For steady traffic, compute **PT breakeven** from tokens/min & concurrency; for bulk jobs, trial **Batch** (often \~50% less).&#x20;

**Control tokens**

* Shorten system/prompt/context; enforce **`maxOutputTokens`**; favor **structured outputs**.
* Use **Knowledge Bases** to retrieve only what’s needed; tune **chunking/topK/filters** to minimize context size.&#x20;
* Turn on **Prompt Caching** for repeat prefixes (up to **90% cost reduction** on cached tokens).&#x20;

**Guardrails first**

* Run **Guardrails** before expensive model calls; blocked prompts incur **guardrail cost only** and save inference spend.&#x20;

**Agents/Flows with budgets**

* Keep toolsets minimal; cap steps/time; avoid recursive loops; prefer deterministic plans; monitor **node transitions** in Flows.&#x20;

**Evaluate continuously**

* Use **Evaluations** to A/B prompts/models/RAG; keep a **champion/challenger** cadence.&#x20;

***

### 💸 Pricing model — where the money goes

* **Model inference**: **input/output tokens** (text) or **per-asset** (image/video); embeddings by tokens; varies by provider/Region.&#x20;
* **Batch**: \~**50% lower** vs on-demand for supported models (Anthropic, Meta, Mistral, Amazon).&#x20;
* **Provisioned Throughput**: **hourly per model unit**, 1 or 6-month terms; discounted vs on-demand.&#x20;
* **Knowledge Bases**: pay for **ingestion/retrieval** and vector store; KB feature itself not billed separately.&#x20;
* **Guardrails**: content filters priced at **$0.15 / 1K text units** after 2024 price cut.&#x20;
* **Flows**: **$0.035 / 1K node transitions** + any underlying services (e.g., Guardrails, model calls).&#x20;
* **Prompt Caching**: cached-token reads up to **90% discount**; model-specific write rules.&#x20;

> Model prices, context limits, and availability vary by Region/provider — always validate on the official pricing page for your models.

***

### ⏱️ Automated management

* **Budgets/alerts** on **tokens**, **PT hours**, **KB ingestion/retrieval**, and **Flows transitions**.
* **Quotas** per team (daily tokens, max output length).
* **EventBridge** jobs to rotate **prompt templates**, auto-pause PT at end of trials, and archive logs.
* Tag **app/team/env** in request metadata → feed **Cost Explorer/CUR** dashboards.&#x20;

***

### 🔒 Security & compliance

* Access Bedrock and AgentCore privately with **VPC interface endpoints** (PrivateLink).&#x20;
* Apply **Guardrails** for safety/PII/topic and grounding; monitor interventions.&#x20;
* Keep data in-Region and encrypted (KMS). Review retention settings per model/provider.

***

### 📊 Monitoring & tools

* **CloudWatch/CloudTrail** around Bedrock APIs, KB ingestion, Agent actions, and **Flows**.
* **Cost Explorer / CUR + Budgets**: segment by **model / provider / Region / app tag**.&#x20;
* **Evaluations**: track **faithfulness, correctness, completeness** during A/Bs.
* Ops dashboards: latency, time-to-first-token, output length, error rates, **agent step counts**, **Flows transitions**.

***

### ✅ Bedrock FinOps Checklist

* [ ] Choose **On-Demand vs Batch vs PT** by **workload shape**; set **per-workload budgets**.&#x20;
* [ ] Standardize on **Converse/ConverseStream** for portability & streaming.
* [ ] Use **Knowledge Bases** to trim context; tune **chunking/topK/filters**.
* [ ] Put **Guardrails** in front of expensive calls; log blocked prompts/responses.&#x20;
* [ ] For steady QPS, **price out PT** and **commit only what you’ll use**; monitor utilization.&#x20;
* [ ] Turn on **Prompt Caching** for repeat prefixes; target cacheable prompt budgets.&#x20;
* [ ] Watch **agent step counts** and **Flows transitions**; cap recursion/timeouts.
* [ ] Tag every call (app/team/env); build dashboards for **tokens**, **PT**, **KB retrieval**, **Flows**.&#x20;

***

### 🧠 AWS Bedrock Cost Optimization Challenges (Q\&A)

Cloud GenAI spend tends to balloon from **tokens**, **under-utilized PT**, **RAG retrieval/ingestion**, **agent loops**, and **Flows node transitions**. Here are the toughest issues teams hit — and how to fix them fast.

***

#### **Q1: Our token bill exploded right after launch. Where do we start?**

**✅ Solution**

* Instrument **prompt length, context size, and output tokens** per request.
* Enforce **`maxOutputTokens`**, trim system prompts, remove boilerplate, and switch to **structured output** (JSON schemas) to constrain verbosity.
* Start on a **smaller/cheaper model**; route only hard cases to a larger model (router or classifier in front).
* Turn on **Prompt Caching** for repeated prefixes/system prompts to cut repeat token costs.

***

#### **Q2: We bought Provisioned Throughput (PT), but it’s not cheaper than on-demand.**

**✅ Solution**

* Calculate **PT breakeven** from p95 concurrency × tokens/min; right-size PT **units** to that line.
* Use **on-demand spillover** for peaks; avoid paying for idle PT hours.
* Set **utilization SLOs** and alerts; pause or reduce PT when traffic drops.

***

#### **Q3: RAG (Knowledge Bases) costs are higher than expected.**

**✅ Solution**

* Tune **chunk size/overlap**, **topK**, and **filters**; retrieve fewer, more relevant chunks.
* **Pre-compute embeddings in batch**; re-embed only changed docs; dedupe and compress sources.
* Cache frequent answers; enforce **context budgets** (max tokens from RAG per query).
* Prefer **Retrieve → filtered context** over dumping full documents into the prompt.

***

#### **Q4: Agents keep looping and burning tokens/tool calls.**

**✅ Solution**

* Cap **max steps/time**, add **guard states**, and prefer **deterministic plans** over open-ended reasoning.
* Reduce the **toolset** to essentials; cache intermediate results across steps.
* Terminate on ambiguous goals; add user confirmation checkpoints for risky branches.

***

#### **Q5: Guardrails add cost without obvious benefit.**

**✅ Solution**

* Run **Guardrails before** expensive model calls; blocked requests incur guardrail cost only and **avoid model tokens**.
* Log **blocked categories**; fix noisy sources (spam inputs, malformed payloads) upstream.
* Keep guardrail policies **modular** (turn on only what you need).

***

#### **Q6: Model choice feels like guesswork and we overpay for “premium” models.**

**✅ Solution**

* Use **Evaluations** with a representative dataset; A/B **small vs. large** models and pick the **cheapest that meets SLOs**.
* Separate use cases: **classification/extraction/routing** on small models; escalate only when needed.
* Track **cost per 1K tokens** and **quality metrics** together; re-evaluate after prompt/RAG changes.

***

#### **Q7: Latency SLOs push us to larger (pricier) models.**

**✅ Solution**

* Use **streaming (ConverseStream)** to improve **time-to-first-token** without upgrading models.
* Shorten prompts; use **few-shot** with compact exemplars; remove irrelevant context.
* Move heavy retrieval/enrichment to **Batch** or background jobs where possible.

***

#### **Q8: Flows orchestration is convenient, but node transitions are adding up.**

**✅ Solution**

* Consolidate nodes; combine lightweight transforms; **memoize** outputs within a flow (reuse results).
* Avoid unnecessary fan-out; prune dead branches; set **fail-fast** conditions.
* Monitor **node transitions per request**; budget transitions like tokens.

***

#### **Q9: Embeddings/vector store costs are creeping up.**

**✅ Solution**

* Batch embedding jobs; **re-embed only diffs** (changed pages/sections).
* Use **lower-dimensional** embeddings that meet recall needs; prune duplicate or low-value content.
* Apply **retention policies** on the vector store; tier old sources to cheaper storage.

***

#### **Q10: Finance wants proof of savings and unit economics.**

**✅ Solution**

* Report before/after **cost per 1K tokens**, **avg prompt length**, **avg output length**, **KB hit rate**, **agent steps**, **Flows transitions**, and **PT utilization**.
* Tag every request (**app/team/env**) and break down costs by model/provider/Region.
* Keep a monthly **change log** (prompts, models, RAG settings) with measured deltas.

***

### ⚙️ Quick Wins (Bedrock)

* **Cap Output + Tighten Prompts**\
  Enforce `maxOutputTokens`, switch to **structured JSON** outputs, and trim system prompts/boilerplate.\
  \&#xNAN;*Impact*: 20–50% fewer output tokens in hours.
* **Route to Smaller Models First**\
  Use a light classifier/router to send only hard cases to premium models.\
  \&#xNAN;*Impact*: 25–60% token cost reduction on mixed workloads.
* **Turn On Prompt Caching**\
  Cache repeated system/preamble tokens for chat/workflows.\
  \&#xNAN;*Impact*: Up to \~90% discount on cached prefix tokens + faster latency.
* **Right-Size PT (or Turn It Off)**\
  Compute PT breakeven; set utilization SLOs; spill peaks to on-demand; pause or resize underused PT units.\
  \&#xNAN;*Impact*: Eliminates idle-hour waste immediately.
* **Batch for Offline Work**\
  Move backfills, enrichment, and scoring to **Batch** where supported.\
  \&#xNAN;*Impact*: \~50% lower inference unit cost vs on-demand.
* **RAG: Fewer, Better Chunks**\
  Reduce chunk size/overlap, tune `topK`, dedupe sources, and set a **context budget** (max RAG tokens/query).\
  \&#xNAN;*Impact*: 20–40% less context, fewer model tokens.
* **Embed in Bulk, Only the Diffs**\
  Batch embeddings; re-embed changed pages only; use lower-dim vectors that meet recall targets.\
  \&#xNAN;*Impact*: Cuts ingestion + vector store growth.
* **Guardrails Before Expensive Calls**\
  Filter inputs/outputs up front so blocked requests skip model tokens.\
  \&#xNAN;*Impact*: Prevents pure-waste generations.
* **Agents: Cap Steps & Memoize**\
  Set hard limits on steps/time, prune toolset, and cache intermediate results between steps.\
  \&#xNAN;*Impact*: 20–40% fewer tool/model calls on agent flows.
* **Flows: Watch Node Transitions**\
  Consolidate trivial transforms, remove dead branches, fail fast; budget transitions per request.\
  \&#xNAN;*Impact*: Immediate savings on orchestration charges.
* **Dashboards, Tags, Budgets**\
  Tag every call (`app/team/env`). Add budgets/alerts for **tokens**, **PT hours**, **KB retrieval**, and **Flows transitions**.\
  \&#xNAN;*Impact*: Faster detection of regressions & anomalies.
* **Separation of Use Cases**\
  Split workloads (routing/extraction vs reasoning/chat) so each uses the **cheapest capable model**.\
  \&#xNAN;*Impact*: Keeps “premium” models off commodity tasks.

***

### 📚 References

* [Amazon Bedrock Documentation](https://docs.aws.amazon.com/bedrock/)
* [What is Amazon Bedrock?](https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-bedrock.html)
* [Get started with Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/getting-started.html)
* [Amazon Bedrock API Reference](https://docs.aws.amazon.com/bedrock/latest/APIReference/welcome.html)
* [Tutorial: Building a simple Amazon Bedrock agent](https://docs.aws.amazon.com/bedrock/latest/userguide/agent-tutorial.html)
* [Get started with the API](https://docs.aws.amazon.com/bedrock/latest/userguide/getting-started-api.html)
* [Setting up AWS Bedrock for API-based text inference](https://medium.com/%40harangpeter/setting-up-aws-bedrock-for-api-based-text-inference-dc25ab2b216b)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://aws.cloudshim.com/aws-top-services/amazon-bedrock.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
