Amazon Cloudwatch
Amazon CloudWatch
Amazon CloudWatch is AWSβs observability backboneβmetrics, logs, traces, alarms, dashboards, Synthetics, and RUM in one place. Itβs powerful, but costs can spike fast from custom-metric cardinality, log ingestion & Logs Insights scans, per-alarm charges, and βnice-to-haveβ features left on forever. This page blends Grokβs highlights with a pragmatic, FinOps-oriented playbook.
π What is CloudWatch?
CloudWatch collects and visualizes metrics, logs, and traces; alerts via alarms; powers dashboards; runs Synthetics canaries and Real User Monitoring (RUM); and integrates with EventBridge for automation. Use it to detect issues, trigger actions, and build SLO dashboards across AWS, hybrid, and on-prem.
Core building blocks
Metrics β native service metrics + custom metrics you publish.
Logs β centralized log ingestion, storage, Live Tail, and Logs Insights (SQL-like queries).
Alarms β threshold & anomaly detection; composite alarms to reduce noise.
Dashboards β service/team KPIs and executive views.
Traces β X-Ray/ServiceLens for distributed tracing.
Synthetics β headless/browser/API checks; synthetic journeys.
RUM β client-side performance & UX telemetry for web apps.
Application Signals / Application Insights β faster app observability setup (auto metrics/SLOs; legacy app monitors).
π Quicklinks (bookmark these)
Pricing for Metrics, Logs, Alarms, Synthetics, RUM, Traces
Logs cost levers: log classes, retention, Insights scanning
Real-time metrics/logs & Metric Streams
Cross-account observability (centralized viewing)
Data protection for logs (PII detection/masking)
(Keep org-specific links here to your runbooks, dashboards, and cost guardrails.)
βοΈ Components β pick the right one
Metrics
Infra KPIs (CPU, mem, net), app KPIs
Billing is per metric/time series; watch dimensions (cardinality).
Logs
App/server/platform logs, access/error/debug
Pay to ingest (uncompressed), store (compressed), and scan (Insights).
Alarms
Paging, autoscaling, remediation triggers
Anomaly detection consumes multiple internal series; use where it pays off.
Dashboards
Team & exec views
Priced per dashboard beyond a small free allowance.
Traces (X-Ray)
Microservices & dependency analysis
Sampling controls cost; first cross-account trace copy may be included.
Synthetics
API/browser canaries
Bill per run + supporting Lambda/Logs/Metrics.
RUM
Web UX telemetry
Bill per event; sample to control volume.
Contributor Insights
Top-N patterns from logs
Bill per matched event; great for hot keys/actors.
Metric Streams
Push metrics to Firehose/partners
Bill per metric update; good when GetMetricData
polling is heavy.
ποΈ Logs classes & retention
Standard
Operational logs you alert on
Full features (metric filters, alarming, Live Tail, data protection).
Infrequent Access (IA)
Keep-but-rarely-use archives
Lower ingest price with feature trade-offs (no Live Tail/filters/alarming).
Retention tips Set per-group retention (e.g., 7β30 days for noisy app logs, longer for audit). Export very long-term history to S3 and query with Athena to avoid high Insights scan costs.
𧬠Resolution & retention knobs
Metric resolution
Basic (5-min), Detailed (1-min), High-res (1s)
Use 1-min for key resources; reserve 1-s for spiky SLOs and short windows.
Logs retention
1 day β infinite
Shorten non-prod; keep prod tight; archive to S3 if needed.
Trace sampling & retention
Control % sampled & days kept
Start low (e.g., 5β10%), raise on incident or critical paths.
ποΈ Integrations & ingestion
CloudWatch Agent
EC2/On-prem metrics & logs
Unified agent; supports EMF (Embedded Metric Format) for low-cardinality custom metrics.
ADOT/OpenTelemetry
Standardized metrics/traces
Use for polyglot microservices; export to CW + partner backends.
PutMetricData / EMF
App-emitted KPIs
Batch & aggregate to cap cardinality; avoid request-ID dimensions.
Logs subscription filters
Stream logs to Kinesis/Firehose/Lambda
Offload analytics or real-time processing; mind downstream costs.
EventBridge (formerly CW Events)
Event-driven automation
Schedules, rules, cross-service triggers for auto-remediation.
Cross-account observability lets you view many accounts/Regions from a single βmonitoringβ account without duplicating data.
π§ CloudWatch FinOps playbook
Metrics (cardinality killers)
Design dimensions intentionally (service, endpoint, status class) β never user/request IDs.
Prefer metric math & percentiles over emitting many near-duplicate series.
Downsample where 1-min is enough; avoid 1-s except where itβs proven necessary.
Consider Metric Streams if polling (
GetMetricData
) is heavy/expensive.
Logs (volume + scan)
Set retention per log group; donβt keep debug forever.
Use IA class for keep-but-rarely-use streams; keep alert-worthy streams in Standard.
Partition groups by app/env/Region so Insights scans stay small; always scope time windows & fields.
Drop noise at the agent (filters/sampling) before ingestion where safe.
Export archives to S3 and query with Athena for long-term analytics.
Alarms & analytics
Use composite alarms to reduce pages; gate noisy series behind a single βservice healthβ alarm.
Reserve anomaly detection for seasonally noisy metrics where static thresholds fail.
Scope real-time logs and Live Tail to bursts; donβt leave on by default.
Org layout
Centralize views with cross-account observability; standardize dashboards & alarms via IaC.
Track spend in Cost Explorer/CUR by namespace/log group; add Budgets alerts for ingestion and Insights scans.
πΈ Pricing model & common gotchas
Metrics: pay per custom metric/time series (dimensions explode cost); API requests bill beyond free allowances.
Logs: pay to ingest (uncompressed), store (compressed), and scan (Insights). IA class lowers ingest price but removes some features.
Alarms: per alarm; anomaly detection meters multiple internal series.
Synthetics/RUM: per run/per event; sample deliberately.
Vended logs credits: some services credit part of log delivery (reduces Logs charges).
Regional & tiered pricing varies β always model with your Regionβs pricing page or AWS Pricing Calculator.
Rule of thumb: Donβt hard-code prices in docs. Keep links to pricing and your internal calculator/runbooks.
β±οΈ Automation patterns
Retention-as-code: set default retention per account/OU; shorten non-prod.
Lifecycle to IA: move cold groups to Infrequent Access; expire fast-churn logs quickly.
EventBridge + Lambda: auto-remediate when ingestion spikes, when new high-cardinality dimensions appear, or when logs go unencrypted.
Pipelines: auto-create dashboards/alarms from tags; version them with Terraform/CloudFormation/CDK.
π Security & compliance
Encryption: Logs are encrypted at rest; use KMS CMKs per env/app where policy requires.
Data protection for logs: targeted PII detection/masking (priced per GB scanned) β enable only on the groups that need it.
Least privilege: scope IAM on
PutMetricData
,PutLogEvents
,GetMetricData
,StartQuery
, and KMS actions.Private access: use VPC endpoints for private ingestion and queries.
π Monitoring & tools
CloudWatch Metrics & Alarms β golden signals (latency, errors, traffic, saturation).
Dashboards β SLOs & cost owner views; keep to essentials to control dashboard charges.
Logs Insights β ad-hoc queries; always narrow time & fields to cut GB scanned.
ServiceLens/X-Ray β dependency maps & traces for incident drill-downs.
Cost Explorer/CUR + Budgets β monthly review of metric counts, log ingest vs retention, Insights scans, Synthetics/RUM volume.
π§ͺ Practical selection cheat-sheet
Infra basics: native service metrics + a handful of custom metrics β standard alarms + a team dashboard.
Heavy logs: keep prod/alerted streams Standard; move bulk debug to IA; 7β30d retention; archive to S3.
API SLOs: Application Signals + anomaly alarms where needed; 1-min metrics, 1-s only for critical hot spots.
User experience: RUM (sampled) + a few Synthetics on checkout/login/search.
Multi-account: turn on cross-account observability; central dashboards/alarms; enforce guardrails via SCP/Config.
β
Checklist
References (fill with your orgβs canonical links)
CloudWatch pricing & AWS Pricing Calculator
Logs classes, retention, and Logs Insights best practices
Cross-account observability & ServiceLens/X-Ray
Data protection for logs
Metric Streams, Application Signals/Insights
Internal runbooks: dimension standards, retention defaults, cost guardrails
Features & prices evolve. Validate in your Region before production changes.
Last updated