Amazon Redshift
Amazon Redshift
Amazon Redshift is AWS’s cloud data warehousing workhorse—distributed SQL analytics at (multi-)petabyte scale with tight AWS integrations. It’s also easy to overspend without discipline—compute left running, managed storage growth, Spectrum scans on raw files, and inefficient queries can snowball.
→ What you’re using → What you’re paying → What you should be doing → AWS-native tools to make it happen.
🚀 What is Redshift?
Amazon Redshift is a fully managed, columnar, MPP (massively parallel processing) data warehouse. It runs ANSI SQL, plugs into mainstream BI/ETL tools, and supports both provisioned clusters and Serverless (auto-scaling) workgroups.
Highlights
Workload isolation: multiple warehouses/workgroups & data sharing to separate ELT, BI, and data science.
Zero-ETL & native connectors: ingest from Aurora/RDS, DynamoDB, Kinesis/MSK streams, S3, and more.
Open table/query: query open formats in S3 via Spectrum; integrate with Lake Formation.
Performance features: AQUA (hardware-accelerated caching), materialized views, result cache, SQA/Concurrency Scaling.
ML & GenAI hooks: Redshift ML (SageMaker under the hood) and integrations for LLM workloads.
Serverless: no cluster sizing; pay for actual compute seconds and managed storage.
⚙️ Node types — pick the right hammer
Choose based on how independently you need to scale compute vs storage, and whether your dataset is small & hot or large & growing.
RA3 (e.g., ra3.xlplus
, ra3.4xlarge
, ra3.16xlarge
)
Most new deployments; variable or growing storage
Redshift Managed Storage (RMS) separates storage from compute (SSD + S3 hierarchy). Elastic resize; AQUA eligible. Storage billed separately from nodes.
DC2 (e.g., dc2.large
, dc2.8xlarge
)
Small, compute-bound warehouses with <~TB-scale compressed data
Local NVMe SSD; fixed storage per node. Good for tight, hot datasets—otherwise prefer RA3.
New build? Start with RA3 unless you have a strong reason not to.
🧬 Deployment & topology
Provisioned (cluster)
Leader node + compute nodes
Steady/known workloads, fine-grained control
Elastic resize (classic/elastic) to scale; pause/resume to save cost off-hours.
Serverless
No nodes to manage; per-second compute
Spiky/unknown demand, many small teams
Auto-scales; isolate by workgroup; set base capacity & cost controls.
Single-node
Leader+compute combined
Dev/test, POCs
Not for prod.
Multi-node
1 leader + N compute
Production
Distribution/Sort Keys still matter (even with ATO).
Use data sharing to expose curated schemas to other accounts/workgroups without copying.
🧠 Redshift optimization strategy (FinOps + performance)
Quick wins
Pause non-prod clusters nightly/weekends; for Serverless, set concurrency/cost limits.
Right-size compute: on provisioned, prefer elastic resize over permanent upsizing; on Serverless, tune base capacity.
Separate workloads: isolate ELT from BI with data sharing or dedicated workgroups to avoid noisy neighbors.
Tables & storage
Let Automatic Table Optimization (ATO) pick distribution & sort keys and compression; override only when profile proves better.
Regularly VACUUM/ANALYZE (or verify auto-maintenance is doing its job), especially after large deletes/copies.
Use materialized views for heavy joins/aggregations; enable auto-refresh on incremental loads.
Prune managed storage: drop/partition cold data; unload snapshots to S3 (Parquet) and query via Spectrum.
Spectrum & lake
Store S3 data as columnar (Parquet/ORC) with partitioning and compression; avoid tiny files; push down predicates.
Control scans with proper WHERE clauses and partition filters; use Lake Formation for centralized permissions.
Concurrency & query control
Enable Short Query Acceleration (SQA) for sub-second BI queries.
Use Workload Management (WLM) & Query Monitoring Rules (QMR) to throttle outliers and protect SLAs.
Watch the result cache hit rate; design dashboards to re-use results.
Governance
Track warehouse usage by tags (team/app/env); alert on cost/throughput spikes via Budgets + CloudWatch.
💸 Purchase model optimization
On-Demand (provisioned)
Variable but mostly steady
Combine pause/resume with elastic resize.
Reserved
Predictable baselines (months+ of steady usage)
1- or 3-year commitments can significantly reduce compute cost; keep a fraction on On-Demand for bursts.
Serverless
Spiky, unpredictable, many small tenants
Pay per compute second & RMS; set workgroup limits and budgets.
Concurrency Scaling
Short peaks
You accrue free credits daily; after that it bills per-second—monitor usage.
Spectrum
Query S3 data directly
Charged by data scanned—use Parquet/ORC + partitions to cut scan costs dramatically.
Avoid hard-coding prices in docs; model with AWS Pricing Calculator and keep Region-specific sheets.
🧩 Cluster configurations & migration tips
DC2 → RA3: migrate when storage growth or mixed hot/cold data makes local SSD a constraint. Re-test distribution keys; enable AQUA where available.
Single → multi-node: for HA and parallelism; mind DISTKEY/SORTKEY and skew.
Provisioned → Serverless: great for labs or multi-team self-service; set guardrails first.
Multiple warehouses: use data sharing to fan out read-only access across teams without duplicating data.
📊 Additional optimization tips
ETL/ELT: prefer COPY from staged Parquet in S3; batch writes; avoid single-row INSERT loops.
Compression & encoding: keep column encodings fresh (ATO/
ANALYZE COMPRESSION
).Skew busting: check
SVV_TABLE_INFO
,SVL_QUERY_SUMMARY
for distribution & query skew.Result set limits: cap BI exports/SELECT * patterns; paginate.
Cost hygiene: auto-purge orphaned snapshots, unload rarely used tables to S3, and retire stale materialized views.
🔒 Security & compliance
Encryption: at rest with KMS; in transit with SSL.
Network: VPC-only; Enhanced VPC Routing for deterministic egress paths.
Access control: IAM federation + Redshift RBAC; row-level and column-level security; data masking for sensitive columns.
Audit: system tables & STL/SVL logs; export to S3/CloudWatch; integrate with Lake Formation for lake-wide governance.
📈 Monitoring & tools
Amazon Redshift Console: performance insights, Advisor, auto-tuning status.
CloudWatch metrics/alarms: CPU %, WLM queue length, query runtime, read/write IOPS, Spectrum bytes scanned.
System views:
STL_QUERY
,SVL_QLOG
,SVV_TABLE_INFO
,SVL_S3QUERY_SUMMARY
,SVL_STATEMENTTEXT
.Cost Explorer / CUR: tag by team/workload; watch RMS GB-months, Spectrum TB scanned, Concurrency Scaling seconds.
AWS Budgets: caps/alerts per cluster or workgroup.
dbt/Glue: model governance; auto-docs, tests, and lineage.
🧪 Practical selection cheat-sheet
Small, hot dataset; steady BI → DC2 (if it truly fits) or small RA3; materialized views + SQA.
Growing lake + warehouse → RA3 + Spectrum; store raw in Parquet; data share curated marts.
Spiky analytics / many teams → Serverless with workgroup guardrails and budgets.
Heavy joins & wide scans → Consider AQUA, sort keys, and MVs; verify distribution style and skew.
✅ Checklist
References (add your org’s canonical links)
Redshift pricing & calculator (provisioned, Serverless, RMS, Spectrum)
Best practices: distribution/sort keys, ATO, VACUUM/ANALYZE, materialized views
Spectrum & Lake Formation setup (partitions, Parquet/ORC, governance)
Concurrency Scaling, SQA, WLM & QMR runbooks
Redshift Advisor & performance dashboards; dbt project template
Features and limits evolve—always validate in the AWS console/docs for your Region before rollout.
Last updated