Amazon Cloudwatch

Amazon CloudWatch

Amazon CloudWatch is AWS’s observability backboneβ€”metrics, logs, traces, alarms, dashboards, Synthetics, and RUM in one place. It’s powerful, but costs can spike fast from custom-metric cardinality, log ingestion & Logs Insights scans, per-alarm charges, and β€œnice-to-have” features left on forever. This page blends Grok’s highlights with a pragmatic, FinOps-oriented playbook.


πŸš€ What is CloudWatch?

CloudWatch collects and visualizes metrics, logs, and traces; alerts via alarms; powers dashboards; runs Synthetics canaries and Real User Monitoring (RUM); and integrates with EventBridge for automation. Use it to detect issues, trigger actions, and build SLO dashboards across AWS, hybrid, and on-prem.

Core building blocks

  • Metrics β€” native service metrics + custom metrics you publish.

  • Logs β€” centralized log ingestion, storage, Live Tail, and Logs Insights (SQL-like queries).

  • Alarms β€” threshold & anomaly detection; composite alarms to reduce noise.

  • Dashboards β€” service/team KPIs and executive views.

  • Traces β€” X-Ray/ServiceLens for distributed tracing.

  • Synthetics β€” headless/browser/API checks; synthetic journeys.

  • RUM β€” client-side performance & UX telemetry for web apps.

  • Application Signals / Application Insights β€” faster app observability setup (auto metrics/SLOs; legacy app monitors).


  • Pricing for Metrics, Logs, Alarms, Synthetics, RUM, Traces

  • Logs cost levers: log classes, retention, Insights scanning

  • Real-time metrics/logs & Metric Streams

  • Cross-account observability (centralized viewing)

  • Data protection for logs (PII detection/masking)

(Keep org-specific links here to your runbooks, dashboards, and cost guardrails.)


βš™οΈ Components β€” pick the right one

Component
Use cases
Notes

Metrics

Infra KPIs (CPU, mem, net), app KPIs

Billing is per metric/time series; watch dimensions (cardinality).

Logs

App/server/platform logs, access/error/debug

Pay to ingest (uncompressed), store (compressed), and scan (Insights).

Alarms

Paging, autoscaling, remediation triggers

Anomaly detection consumes multiple internal series; use where it pays off.

Dashboards

Team & exec views

Priced per dashboard beyond a small free allowance.

Traces (X-Ray)

Microservices & dependency analysis

Sampling controls cost; first cross-account trace copy may be included.

Synthetics

API/browser canaries

Bill per run + supporting Lambda/Logs/Metrics.

RUM

Web UX telemetry

Bill per event; sample to control volume.

Contributor Insights

Top-N patterns from logs

Bill per matched event; great for hot keys/actors.

Metric Streams

Push metrics to Firehose/partners

Bill per metric update; good when GetMetricData polling is heavy.


πŸ—‚οΈ Logs classes & retention

Choice
Best for
Key behaviors

Standard

Operational logs you alert on

Full features (metric filters, alarming, Live Tail, data protection).

Infrequent Access (IA)

Keep-but-rarely-use archives

Lower ingest price with feature trade-offs (no Live Tail/filters/alarming).

Retention tips Set per-group retention (e.g., 7–30 days for noisy app logs, longer for audit). Export very long-term history to S3 and query with Athena to avoid high Insights scan costs.


🧬 Resolution & retention knobs

Knob
What it does
Practical guidance

Metric resolution

Basic (5-min), Detailed (1-min), High-res (1s)

Use 1-min for key resources; reserve 1-s for spiky SLOs and short windows.

Logs retention

1 day β†’ infinite

Shorten non-prod; keep prod tight; archive to S3 if needed.

Trace sampling & retention

Control % sampled & days kept

Start low (e.g., 5–10%), raise on incident or critical paths.


πŸ›οΈ Integrations & ingestion

Option
When to use
Notes

CloudWatch Agent

EC2/On-prem metrics & logs

Unified agent; supports EMF (Embedded Metric Format) for low-cardinality custom metrics.

ADOT/OpenTelemetry

Standardized metrics/traces

Use for polyglot microservices; export to CW + partner backends.

PutMetricData / EMF

App-emitted KPIs

Batch & aggregate to cap cardinality; avoid request-ID dimensions.

Logs subscription filters

Stream logs to Kinesis/Firehose/Lambda

Offload analytics or real-time processing; mind downstream costs.

EventBridge (formerly CW Events)

Event-driven automation

Schedules, rules, cross-service triggers for auto-remediation.

Cross-account observability lets you view many accounts/Regions from a single β€œmonitoring” account without duplicating data.


🧠 CloudWatch FinOps playbook

Metrics (cardinality killers)

  • Design dimensions intentionally (service, endpoint, status class) β€” never user/request IDs.

  • Prefer metric math & percentiles over emitting many near-duplicate series.

  • Downsample where 1-min is enough; avoid 1-s except where it’s proven necessary.

  • Consider Metric Streams if polling (GetMetricData) is heavy/expensive.

Logs (volume + scan)

  • Set retention per log group; don’t keep debug forever.

  • Use IA class for keep-but-rarely-use streams; keep alert-worthy streams in Standard.

  • Partition groups by app/env/Region so Insights scans stay small; always scope time windows & fields.

  • Drop noise at the agent (filters/sampling) before ingestion where safe.

  • Export archives to S3 and query with Athena for long-term analytics.

Alarms & analytics

  • Use composite alarms to reduce pages; gate noisy series behind a single β€œservice health” alarm.

  • Reserve anomaly detection for seasonally noisy metrics where static thresholds fail.

  • Scope real-time logs and Live Tail to bursts; don’t leave on by default.

Org layout

  • Centralize views with cross-account observability; standardize dashboards & alarms via IaC.

  • Track spend in Cost Explorer/CUR by namespace/log group; add Budgets alerts for ingestion and Insights scans.


πŸ’Έ Pricing model & common gotchas

  • Metrics: pay per custom metric/time series (dimensions explode cost); API requests bill beyond free allowances.

  • Logs: pay to ingest (uncompressed), store (compressed), and scan (Insights). IA class lowers ingest price but removes some features.

  • Alarms: per alarm; anomaly detection meters multiple internal series.

  • Synthetics/RUM: per run/per event; sample deliberately.

  • Vended logs credits: some services credit part of log delivery (reduces Logs charges).

  • Regional & tiered pricing varies β€” always model with your Region’s pricing page or AWS Pricing Calculator.

Rule of thumb: Don’t hard-code prices in docs. Keep links to pricing and your internal calculator/runbooks.


⏱️ Automation patterns

  • Retention-as-code: set default retention per account/OU; shorten non-prod.

  • Lifecycle to IA: move cold groups to Infrequent Access; expire fast-churn logs quickly.

  • EventBridge + Lambda: auto-remediate when ingestion spikes, when new high-cardinality dimensions appear, or when logs go unencrypted.

  • Pipelines: auto-create dashboards/alarms from tags; version them with Terraform/CloudFormation/CDK.


πŸ”’ Security & compliance

  • Encryption: Logs are encrypted at rest; use KMS CMKs per env/app where policy requires.

  • Data protection for logs: targeted PII detection/masking (priced per GB scanned) β€” enable only on the groups that need it.

  • Least privilege: scope IAM on PutMetricData, PutLogEvents, GetMetricData, StartQuery, and KMS actions.

  • Private access: use VPC endpoints for private ingestion and queries.


πŸ“Š Monitoring & tools

  • CloudWatch Metrics & Alarms β€” golden signals (latency, errors, traffic, saturation).

  • Dashboards β€” SLOs & cost owner views; keep to essentials to control dashboard charges.

  • Logs Insights β€” ad-hoc queries; always narrow time & fields to cut GB scanned.

  • ServiceLens/X-Ray β€” dependency maps & traces for incident drill-downs.

  • Cost Explorer/CUR + Budgets β€” monthly review of metric counts, log ingest vs retention, Insights scans, Synthetics/RUM volume.


πŸ§ͺ Practical selection cheat-sheet

  • Infra basics: native service metrics + a handful of custom metrics β†’ standard alarms + a team dashboard.

  • Heavy logs: keep prod/alerted streams Standard; move bulk debug to IA; 7–30d retention; archive to S3.

  • API SLOs: Application Signals + anomaly alarms where needed; 1-min metrics, 1-s only for critical hot spots.

  • User experience: RUM (sampled) + a few Synthetics on checkout/login/search.

  • Multi-account: turn on cross-account observability; central dashboards/alarms; enforce guardrails via SCP/Config.


βœ… Checklist


  • CloudWatch pricing & AWS Pricing Calculator

  • Logs classes, retention, and Logs Insights best practices

  • Cross-account observability & ServiceLens/X-Ray

  • Data protection for logs

  • Metric Streams, Application Signals/Insights

  • Internal runbooks: dimension standards, retention defaults, cost guardrails

Features & prices evolve. Validate in your Region before production changes.

Last updated