# Amazon Redshift

## Amazon Redshift

Amazon Redshift is AWS’s cloud data warehousing workhorse—distributed SQL analytics at (multi-)petabyte scale with tight AWS integrations. It’s also easy to overspend without discipline—**compute left running**, **managed storage growth**, **Spectrum scans on raw files**, and **inefficient queries** can snowball.

→ What you’re using → What you’re paying → What you should be doing → AWS-native tools to make it happen.

***

### 🚀 What is Redshift?

**Amazon Redshift** is a fully managed, columnar, MPP (massively parallel processing) data warehouse. It runs ANSI SQL, plugs into mainstream BI/ETL tools, and supports both **provisioned** clusters and **Serverless** (auto-scaling) workgroups.

**Highlights**

* **Workload isolation**: multiple warehouses/workgroups & data sharing to separate ELT, BI, and data science.
* **Zero-ETL & native connectors**: ingest from Aurora/RDS, DynamoDB, Kinesis/MSK streams, S3, and more.
* **Open table/query**: query open formats in **S3 via Spectrum**; integrate with **Lake Formation**.
* **Performance features**: **AQUA** (hardware-accelerated caching), **materialized views**, **result cache**, **SQA/Concurrency Scaling**.
* **ML & GenAI hooks**: Redshift ML (SageMaker under the hood) and integrations for LLM workloads.
* **Serverless**: no cluster sizing; pay for actual compute seconds and managed storage.

***

### ⚙️ Node types — pick the right hammer

Choose based on how independently you need to scale **compute** vs **storage**, and whether your dataset is small & hot or large & growing.

| Family                                                      | Best for                                                         | Notes                                                                                                                                                        |
| ----------------------------------------------------------- | ---------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **RA3** (e.g., `ra3.xlplus`, `ra3.4xlarge`, `ra3.16xlarge`) | Most new deployments; variable or growing storage                | **Redshift Managed Storage (RMS)** separates storage from compute (SSD + S3 hierarchy). Elastic resize; AQUA eligible. Storage billed separately from nodes. |
| **DC2** (e.g., `dc2.large`, `dc2.8xlarge`)                  | Small, compute-bound warehouses with <\~TB-scale compressed data | Local NVMe SSD; fixed storage per node. Good for tight, hot datasets—otherwise prefer RA3.                                                                   |

> New build? Start with **RA3** unless you have a strong reason not to.

***

### 🧬 Deployment & topology

| Mode                      | Architecture                           | Use when                                     | Notes                                                                                   |
| ------------------------- | -------------------------------------- | -------------------------------------------- | --------------------------------------------------------------------------------------- |
| **Provisioned (cluster)** | Leader node + compute nodes            | Steady/known workloads, fine-grained control | **Elastic resize** (classic/elastic) to scale; **pause/resume** to save cost off-hours. |
| **Serverless**            | No nodes to manage; per-second compute | Spiky/unknown demand, many small teams       | Auto-scales; isolate by **workgroup**; set **base capacity** & **cost controls**.       |
| **Single-node**           | Leader+compute combined                | Dev/test, POCs                               | Not for prod.                                                                           |
| **Multi-node**            | 1 leader + N compute                   | Production                                   | Distribution/Sort Keys still matter (even with ATO).                                    |

> Use **data sharing** to expose curated schemas to other accounts/workgroups without copying.

***

### 🧠 Redshift optimization strategy (FinOps + performance)

**Quick wins**

* **Pause** non-prod clusters nightly/weekends; for Serverless, set **concurrency/cost limits**.
* **Right-size compute**: on provisioned, prefer **elastic resize** over permanent upsizing; on Serverless, tune **base capacity**.
* **Separate workloads**: isolate ELT from BI with **data sharing** or dedicated workgroups to avoid noisy neighbors.

**Tables & storage**

* Let **Automatic Table Optimization (ATO)** pick **distribution & sort keys** and compression; override only when profile proves better.
* Regularly **VACUUM/ANALYZE** (or verify auto-maintenance is doing its job), especially after large deletes/copies.
* Use **materialized views** for heavy joins/aggregations; enable **auto-refresh** on incremental loads.
* Prune managed storage: drop/partition cold data; unload snapshots to S3 (Parquet) and query via Spectrum.

**Spectrum & lake**

* Store S3 data as **columnar** (Parquet/ORC) with **partitioning** and compression; avoid tiny files; push down predicates.
* Control scans with proper **WHERE** clauses and partition filters; use **Lake Formation** for centralized permissions.

**Concurrency & query control**

* Enable **Short Query Acceleration (SQA)** for sub-second BI queries.
* Use **Workload Management (WLM)** & **Query Monitoring Rules (QMR)** to throttle outliers and protect SLAs.
* Watch the **result cache** hit rate; design dashboards to re-use results.

**Governance**

* Track warehouse usage by **tags** (team/app/env); alert on cost/throughput spikes via Budgets + CloudWatch.

***

### 💸 Purchase model optimization

| Model                       | Use when                                        | How to save                                                                                              |
| --------------------------- | ----------------------------------------------- | -------------------------------------------------------------------------------------------------------- |
| **On-Demand (provisioned)** | Variable but mostly steady                      | Combine **pause/resume** with **elastic resize**.                                                        |
| **Reserved**                | Predictable baselines (months+ of steady usage) | 1- or 3-year commitments can significantly reduce compute cost; keep a fraction on On-Demand for bursts. |
| **Serverless**              | Spiky, unpredictable, many small tenants        | Pay per compute second & RMS; set **workgroup limits** and budgets.                                      |
| **Concurrency Scaling**     | Short peaks                                     | You accrue free credits daily; after that it bills per-second—monitor usage.                             |
| **Spectrum**                | Query S3 data directly                          | Charged by **data scanned**—use Parquet/ORC + partitions to cut scan costs dramatically.                 |

> Avoid hard-coding prices in docs; model with **AWS Pricing Calculator** and keep Region-specific sheets.

***

### 🧩 Cluster configurations & migration tips

* **DC2 → RA3**: migrate when storage growth or mixed hot/cold data makes local SSD a constraint. Re-test distribution keys; enable **AQUA** where available.
* **Single → multi-node**: for HA and parallelism; mind **DISTKEY/SORTKEY** and skew.
* **Provisioned → Serverless**: great for labs or multi-team self-service; set guardrails first.
* **Multiple warehouses**: use **data sharing** to fan out read-only access across teams without duplicating data.

***

### 📊 Additional optimization tips

* **ETL/ELT**: prefer **COPY** from staged Parquet in S3; batch writes; avoid single-row INSERT loops.
* **Compression & encoding**: keep column encodings fresh (ATO/`ANALYZE COMPRESSION`).
* **Skew busting**: check `SVV_TABLE_INFO`, `SVL_QUERY_SUMMARY` for distribution & query skew.
* **Result set limits**: cap BI exports/SELECT \* patterns; paginate.
* **Cost hygiene**: auto-purge orphaned snapshots, unload rarely used tables to S3, and retire stale materialized views.

***

### 🔒 Security & compliance

* **Encryption**: at rest with KMS; in transit with SSL.
* **Network**: VPC-only; **Enhanced VPC Routing** for deterministic egress paths.
* **Access control**: IAM federation + Redshift RBAC; **row-level** and **column-level** security; data masking for sensitive columns.
* **Audit**: system tables & STL/SVL logs; export to S3/CloudWatch; integrate with Lake Formation for lake-wide governance.

***

### 📈 Monitoring & tools

* **Amazon Redshift Console**: performance insights, Advisor, auto-tuning status.
* **CloudWatch metrics/alarms**: CPU %, WLM queue length, query runtime, read/write IOPS, Spectrum bytes scanned.
* **System views**: `STL_QUERY`, `SVL_QLOG`, `SVV_TABLE_INFO`, `SVL_S3QUERY_SUMMARY`, `SVL_STATEMENTTEXT`.
* **Cost Explorer / CUR**: tag by team/workload; watch **RMS GB-months**, **Spectrum TB scanned**, **Concurrency Scaling seconds**.
* **AWS Budgets**: caps/alerts per cluster or workgroup.
* **dbt/Glue**: model governance; auto-docs, tests, and lineage.

***

### 🧪 Practical selection cheat-sheet

* **Small, hot dataset; steady BI** → **DC2** (if it truly fits) or small **RA3**; materialized views + SQA.
* **Growing lake + warehouse** → **RA3 + Spectrum**; store raw in Parquet; data share curated marts.
* **Spiky analytics / many teams** → **Serverless** with workgroup guardrails and budgets.
* **Heavy joins & wide scans** → Consider **AQUA**, sort keys, and MVs; verify distribution style and skew.

***

### ✅ Checklist

* [ ] Start on **RA3** (or Serverless); only pick DC2 for tiny, fixed, compute-bound datasets.
* [ ] Turn on **pause/resume** (provisioned) or **workgroup limits** (Serverless).
* [ ] Use **Parquet + partitions** in S3; keep files moderately sized (128–512 MB).
* [ ] Let **ATO** manage dist/sort/encoding; override only with evidence.
* [ ] Enable **SQA**, set **WLM** queues, and add **QMR** for runaway queries.
* [ ] Track **RMS growth**, **Spectrum scan bytes**, **Concurrency Scaling** seconds, and snapshot counts.
* [ ] Share data via **data sharing**, not copies.
* [ ] Tag everything (team/app/env) and wire **Budgets** alerts.

***

### References (add your org’s canonical links)

* Redshift pricing & calculator (provisioned, Serverless, RMS, Spectrum)
* Best practices: distribution/sort keys, ATO, VACUUM/ANALYZE, materialized views
* Spectrum & Lake Formation setup (partitions, Parquet/ORC, governance)
* Concurrency Scaling, SQA, WLM & QMR runbooks
* Redshift Advisor & performance dashboards; dbt project template

> *Features and limits evolve—always validate in the AWS console/docs for your Region before rollout.*
