Amazon Cloudwatch

Amazon CloudWatch

Amazon CloudWatch is AWS’s observability backbone—metrics, logs, traces, alarms, dashboards, Synthetics, and RUM in one place. It’s powerful, but costs can spike fast from custom-metric cardinality, log ingestion & Logs Insights scans, per-alarm charges, and “nice-to-have” features left on forever. This page blends Grok’s highlights with a pragmatic, FinOps-oriented playbook.

🚀 What is CloudWatch?

CloudWatch collects and visualizes metrics, logs, and traces; alerts via alarms; powers dashboards; runs Synthetics canaries and Real User Monitoring (RUM); and integrates with EventBridge for automation. Use it to detect issues, trigger actions, and build SLO dashboards across AWS, hybrid, and on-prem.

Core building blocks

Metrics — native service metrics + custom metrics you publish.
Logs — centralized log ingestion, storage, Live Tail, and Logs Insights (SQL-like queries).
Alarms — threshold & anomaly detection; composite alarms to reduce noise.
Dashboards — service/team KPIs and executive views.
Traces — X-Ray/ServiceLens for distributed tracing.
Synthetics — headless/browser/API checks; synthetic journeys.
RUM — client-side performance & UX telemetry for web apps.
Application Signals / Application Insights — faster app observability setup (auto metrics/SLOs; legacy app monitors).

🔗 Quicklinks (bookmark these)

Pricing for Metrics, Logs, Alarms, Synthetics, RUM, Traces
Logs cost levers: log classes, retention, Insights scanning
Real-time metrics/logs & Metric Streams
Cross-account observability (centralized viewing)
Data protection for logs (PII detection/masking)

(Keep org-specific links here to your runbooks, dashboards, and cost guardrails.)

⚙️ Components — pick the right one

Component

Use cases

Notes

Metrics

Infra KPIs (CPU, mem, net), app KPIs

Billing is per metric/time series; watch dimensions (cardinality).

Logs

App/server/platform logs, access/error/debug

Pay to ingest (uncompressed), store (compressed), and scan (Insights).

Alarms

Paging, autoscaling, remediation triggers

Anomaly detection consumes multiple internal series; use where it pays off.

Dashboards

Team & exec views

Priced per dashboard beyond a small free allowance.

Traces (X-Ray)

Microservices & dependency analysis

Sampling controls cost; first cross-account trace copy may be included.

Synthetics

API/browser canaries

Bill per run + supporting Lambda/Logs/Metrics.

RUM

Web UX telemetry

Bill per event; sample to control volume.

Contributor Insights

Top-N patterns from logs

Bill per matched event; great for hot keys/actors.

Metric Streams

Push metrics to Firehose/partners

Bill per metric update; good when GetMetricData polling is heavy.

🗂️ Logs classes & retention

Choice

Best for

Key behaviors

Standard

Operational logs you alert on

Full features (metric filters, alarming, Live Tail, data protection).

Infrequent Access (IA)

Keep-but-rarely-use archives

Lower ingest price with feature trade-offs (no Live Tail/filters/alarming).

Retention tips Set per-group retention (e.g., 7–30 days for noisy app logs, longer for audit). Export very long-term history to S3 and query with Athena to avoid high Insights scan costs.

🧬 Resolution & retention knobs

Knob

What it does

Practical guidance

Metric resolution

Basic (5-min), Detailed (1-min), High-res (1s)

Use 1-min for key resources; reserve 1-s for spiky SLOs and short windows.

Logs retention

1 day → infinite

Shorten non-prod; keep prod tight; archive to S3 if needed.

Trace sampling & retention

Control % sampled & days kept

Start low (e.g., 5–10%), raise on incident or critical paths.

🏛️ Integrations & ingestion

Option

When to use

Notes

CloudWatch Agent

EC2/On-prem metrics & logs

Unified agent; supports EMF (Embedded Metric Format) for low-cardinality custom metrics.

ADOT/OpenTelemetry

Standardized metrics/traces

Use for polyglot microservices; export to CW + partner backends.

PutMetricData / EMF

App-emitted KPIs

Batch & aggregate to cap cardinality; avoid request-ID dimensions.

Logs subscription filters

Stream logs to Kinesis/Firehose/Lambda

Offload analytics or real-time processing; mind downstream costs.

EventBridge (formerly CW Events)

Event-driven automation

Schedules, rules, cross-service triggers for auto-remediation.

Cross-account observability lets you view many accounts/Regions from a single “monitoring” account without duplicating data.

🧠 CloudWatch FinOps playbook

Metrics (cardinality killers)

Design dimensions intentionally (service, endpoint, status class) — never user/request IDs.
Prefer metric math & percentiles over emitting many near-duplicate series.
Downsample where 1-min is enough; avoid 1-s except where it’s proven necessary.
Consider Metric Streams if polling (GetMetricData) is heavy/expensive.

Logs (volume + scan)

Set retention per log group; don’t keep debug forever.
Use IA class for keep-but-rarely-use streams; keep alert-worthy streams in Standard.
Partition groups by app/env/Region so Insights scans stay small; always scope time windows & fields.
Drop noise at the agent (filters/sampling) before ingestion where safe.
Export archives to S3 and query with Athena for long-term analytics.

Alarms & analytics

Use composite alarms to reduce pages; gate noisy series behind a single “service health” alarm.
Reserve anomaly detection for seasonally noisy metrics where static thresholds fail.
Scope real-time logs and Live Tail to bursts; don’t leave on by default.

Org layout

Centralize views with cross-account observability; standardize dashboards & alarms via IaC.
Track spend in Cost Explorer/CUR by namespace/log group; add Budgets alerts for ingestion and Insights scans.

💸 Pricing model & common gotchas

Metrics: pay per custom metric/time series (dimensions explode cost); API requests bill beyond free allowances.
Logs: pay to ingest (uncompressed), store (compressed), and scan (Insights). IA class lowers ingest price but removes some features.
Alarms: per alarm; anomaly detection meters multiple internal series.
Synthetics/RUM: per run/per event; sample deliberately.
Vended logs credits: some services credit part of log delivery (reduces Logs charges).
Regional & tiered pricing varies — always model with your Region’s pricing page or AWS Pricing Calculator.

Rule of thumb: Don’t hard-code prices in docs. Keep links to pricing and your internal calculator/runbooks.

⏱️ Automation patterns

Retention-as-code: set default retention per account/OU; shorten non-prod.
Lifecycle to IA: move cold groups to Infrequent Access; expire fast-churn logs quickly.
EventBridge + Lambda: auto-remediate when ingestion spikes, when new high-cardinality dimensions appear, or when logs go unencrypted.
Pipelines: auto-create dashboards/alarms from tags; version them with Terraform/CloudFormation/CDK.

🔒 Security & compliance

Encryption: Logs are encrypted at rest; use KMS CMKs per env/app where policy requires.
Data protection for logs: targeted PII detection/masking (priced per GB scanned) — enable only on the groups that need it.
Least privilege: scope IAM on PutMetricData, PutLogEvents, GetMetricData, StartQuery, and KMS actions.
Private access: use VPC endpoints for private ingestion and queries.

📊 Monitoring & tools

CloudWatch Metrics & Alarms — golden signals (latency, errors, traffic, saturation).
Dashboards — SLOs & cost owner views; keep to essentials to control dashboard charges.
Logs Insights — ad-hoc queries; always narrow time & fields to cut GB scanned.
ServiceLens/X-Ray — dependency maps & traces for incident drill-downs.
Cost Explorer/CUR + Budgets — monthly review of metric counts, log ingest vs retention, Insights scans, Synthetics/RUM volume.

🧪 Practical selection cheat-sheet

Infra basics: native service metrics + a handful of custom metrics → standard alarms + a team dashboard.
Heavy logs: keep prod/alerted streams Standard; move bulk debug to IA; 7–30d retention; archive to S3.
API SLOs: Application Signals + anomaly alarms where needed; 1-min metrics, 1-s only for critical hot spots.
User experience: RUM (sampled) + a few Synthetics on checkout/login/search.
Multi-account: turn on cross-account observability; central dashboards/alarms; enforce guardrails via SCP/Config.

✅ Checklist

Define metric naming & dimensions (avoid high cardinality).
Set log retention defaults; route cold groups to IA; export archives to S3.
Budget Logs Insights scans (time-bounded queries).
Use composite / anomaly alarms selectively.
Encrypt logs with KMS where required; use VPC endpoints.
Centralize via cross-account observability.
Review monthly: metric counts, log ingest/storage, Insights scans, canary/RUM volume.

References (fill with your org’s canonical links)

CloudWatch pricing & AWS Pricing Calculator
Logs classes, retention, and Logs Insights best practices
Cross-account observability & ServiceLens/X-Ray
Data protection for logs
Metric Streams, Application Signals/Insights
Internal runbooks: dimension standards, retention defaults, cost guardrails

Features & prices evolve. Validate in your Region before production changes.

PreviousAmazon EBS NextAmazon VPC

Last updated 12 days ago