Amazon EC2

🔗 Quicklinks (Bookmark):

Cost Explorer: AWS EC2 by Instance type and Running hours
Reservation Coverage: AWS EC2 RI coverage
Savings Plan Coverage: AWS EC2 SP coverage
Compute Rightsizing: AWS Compute Optimizer Rightsizing
Idle Compute: AWS Compute Optimizer Idle
EC2 Pricing table: AWS EC2 Pricing
EC2 CUR Queries: Query CUR on Athena

Amazon EC2 is the backbone of AWS compute, scalable, customizable, and dangerously easy to overspend on.

Let’s break it down by:

→ What you’re using → What you’re paying → What you should be doing → And the AWS-native tools to make it happen.

🚀 What is EC2?

Amazon Elastic Compute Cloud (EC2) provides resizable virtual servers in the cloud.

Available in every AWS region
Billed by the second or hour
Can run Linux, Windows, or custom AMIs
Comes in dozens of instance families across generations

⚙️ Instance Families — Pick the Right Hammer

Family

Use Case

Notes

t (Burstable)

Dev, test, low-traffic apps

Great for idle workloads

m (General Purpose)

Web apps, small services

Good default starting point

c (Compute Optimized)

High CPU workloads

Perfect for encoding, ML inference

r / x (Memory Optimized)

DBs, caches, SAP

Watch memory:cost ratio

i / d (Storage)

OLTP, NoSQL, logs, IOPS-heavy

High EBS throughput

g, inf, p (Accelerated)

AI/ML, HPC

GPU-backed, very expensive

🧬 Instance Generations

Generation

Architecture

OS Support

⚠️ Caveats

Graviton (g6, t4g, etc.)

ARM

✅ Linux only

❌ No Windows, may need recompiled apps

x86 Intel/AMD (m5, c6a, etc.)

x86

✅ Linux,

✅ Windows

More costly, but universal compatibility

Run Linux? Try Graviton. Run Windows or legacy binaries? Stick to x86.

🏛️ Tenancy Options

Tenancy Type

Use When

Notes

Shared

Default

✅ Best for 90% of workloads

Dedicated Instance

You need isolation

⚠️ Slightly more expensive

Dedicated Host

BYOL licensing

💸 Most expensive, per-socket billing possible

🧠 EC2 Rightsizing Strategy

Strategy

What to Do

Tools / Notes

✅ Quick Wins

Find underutilized instances (e.g. CPU < 10%)

Compute Optimizer, Rightsizing in Cost Explorer

🔁 Same-Family Resize

Downsize within current instance family (e.g. m5.2xlarge → m5.large)

No re-architecture needed

🔄 Cross-Family Change

Migrate to cost-effective families (e.g. m5 → t3 or m5 → m6g)

Use Graviton for Linux (⚠️ no Windows support)

💤 Shut Down Idle

Stop non-prod or idle EC2s automatically during off-hours

Use tags + Instance Scheduler

💡 Review and adjust sizing monthly — usage changes, so should your provisioning.

💸 Purchase Model Optimization

Model

Savings

Best For

Risk

On-Demand

Dev/test, unpredictable workloads

💸 High cost

Savings Plans

30–66%

Steady-state compute

⚠️ Locked 1–3 yrs

Reserved Instances

30–72%

Predictable, type-specific workloads

⚠️ Less flexibility

Spot

70–90%

Fault-tolerant, stateless apps

⚠️ Can be interrupted anytime

➡ Use Savings Plans for baseline. ➡ Use Spot for scale-out workers.

⏱ Scheduled Usage

Stop non-prod resources when not in use.

Tools:

AWS Instance Scheduler
Lambda + EventBridge + Tags

🔍 EC2 FinOps Toolbox

Tool

Purpose

Link

Cost Explorer

Analyze trends, tags, reservations

Open

Compute Optimizer

Rightsize + Graviton tips

Open

Trusted Advisor

Idle EC2, EBS, Elastic IPs

Open

Savings Plans Console

Commit to usage, save up to 66%

Open

Reserved Instances

Buy fixed-term EC2 savings

Open

CUR + Athena

Deep cost analytics

CUR Guide

📉 Cost View in Cost Explorer

In Cost Explorer:

Filter → Service = EC2
Group by → Instance Type, Region, Tag, or Linked Account
Use RI Coverage, SP Utilization, and Forecasting

🔗 Open Cost Explorer

Cost Explorer: Fast-Triage Usage Types 🔍

When you load EC2 in Cost Explorer or in CUR, watch for these usage types and what they often indicate:

Usage Type Pattern

Likely Meaning

BoxUsage:*

Base EC2 instance hours — the main compute cost bucket

CPUCredits:*

T-family instances earning unused CPU credits

EBSOptimized:*

EC2-Other surcharge for instance type EBS optimization

DataTransfer-*

Network egress (inter-AZ, cross-AZ, internet)

ElasticIP:*

Idle or unattached Elastic IPs, incurring cost

Action Tips:

Filter by low vCPU-hours but non-zero BoxUsage to find idle instances.
High CPUCredits accumulation suggests your T-class is over-provisioned.
Use tag filters (project, team) to group and triage waste quickly.

📊 Deep Dive with CUR

When querying CUR for EC2 insights, these are your go-to columns:

line_item_resource_id — the EC2 instance ID
product_instance_type — the instance family and size
line_item_usage_type — e.g. BoxUsage, CPUCredits, DataTransfer
line_item_operation — start/stop, resize, etc.
resourceTags/* — your team/project tag dimensions
line_item_unblended_cost / line_item_blended_cost — cost values

Example Query Prompt: Find t3 instances with low vCPU-hours and high CPUCredits — candidate for downsizing or retirement.

🔗 CUR Setup

⚠️ Data Transfer & EBS Callouts

Inter-AZ traffic between EC2 instances is billable; intra-AZ is free (still monitor).
Cross-region transfers and internet egress can dominate cost in chatty applications.
EBS is tightly coupled — most storage cost lives under EBS volumes and snapshots. Migrate gp2 → gp3, right-size throughput/IOPS, clean up orphaned volumes.
Co-locate high-traffic tiers (API + DB, worker + storage) in same AZ or use private link constructs to reduce transfer cost.

🔮 Advanced Tactics

Strategy

Why It Matters

Graviton Migration

Save 20–40% for Linux workloads

Mixed-Instance ASG

Use cheapest family type across AZs

Spot + On-Demand fallback

Scale with resilience

Instance Scheduler

Shut down dev/test nights/weekends

Tagging

Enables showback by team/project

Convertible RIs

Switch types during term

Auto Scaling Right

Prevent zombie capacity

Forecasting via CE

Plan future RI/SP purchases

Spot Strategy — next level

Use MixedInstancesPolicy in ASGs with multiple families and sizes to increase availability.
Define interruption budget (e.g. allow 10 % of capacity to be interrupted) to trade lower cost vs. reliability.
Use dynamic max price caps (e.g. set to 70–90 % of on-demand) and fallback to On-Demand when Spot is reclaimed.
Monitor spot interruption events and automate instance drainage/shutdown gracefully.

🚨 Security & Compliance for EC2

Ensure latest AMI patching cadence, automate image refresh.
Enforce IMDSv2 usage and disable IMDSv1 to soften SSRF risks.
Limit public IP access; use NAT/Load Balancers + security groups.
Use SCPs / Guardrails to prevent unapproved instance types or regions.
Enforce SSM Patch Manager and logging agents for visibility and drift detection.

✅ EC2 FinOps Checklist

Rightsize with Compute Optimizer
Schedule non-prod instance shutdowns
Migrate eligible workloads to Graviton
Buy RIs or SPs for steady workloads
Track RI/SP coverage & utilization
Audit unused EBS volumes and Elastic IPs
Set up CUR and run Athena queries
Monitor EC2 cost trends monthly

🧠 EC2 Cost Optimization Challenges

A Q&A-style deep dive into the most persistent, high-impact AWS EC2 cost problems — and actionable solutions that go beyond “just rightsize it.”

Q1: Why do EC2 bills spiral from over-provisioning or bad pricing choices?

Because workloads evolve, but instance sizes and pricing models don’t. Teams keep on-demand instances running 24/7, even when utilization hovers below 20%.

✅ Solution:

Run AWS Compute Optimizer and Cost Explorer weekly.
Shift predictable loads to Savings Plans / Reserved Instances (up to 72% off).
Use Spot Instances for fault-tolerant or batch workloads (up to 90% off).
Implement instance schedules to stop non-prod workloads after hours.

Q2: Why is committing to Savings Plans or RIs so confusing?

Because predicting your baseline usage is part science, part art. Misjudging it either locks in waste or misses savings.

✅ Solution:

Default to Compute Savings Plans for flexibility.
Use Reserved Instances only where you need guaranteed capacity.
Monitor coverage vs utilization KPIs monthly and rebalance quarterly.

Q3: What’s behind random slowdowns on burstable (T-family) instances?

CPU credits. Once burst credits run out, throttling hits, silently killing performance.

✅ Solution:

Monitor CPUCreditBalance via CloudWatch alarms.
Switch to Unlimited mode (with awareness of extra cost) or scale out horizontally.
Move sustained loads to M/C/R/Graviton families.

Q4: Why do EBS volumes cause unpredictable slowness and high costs?

Older gp2 volumes tie IOPS to size, forcing over-provisioning for performance.

✅ Solution:

Migrate to gp3 (decouples size and performance).
Allocate precise IOPS/throughput.
For critical workloads, use io2 / io2 Block Express and enable EBS-optimized instances.

Q5: How does using the wrong instance family burn money?

Running compute-heavy workloads on general-purpose (M-family) instances or vice versa leads to underutilization or overpayment.

✅ Solution:

Let Compute Optimizer recommend the right family.
Benchmark using sysbench or internal metrics.
Try Graviton (ARM) instances — 15–40% better price-performance, after verifying compatibility.

Q6: Why does networking architecture silently inflate EC2 costs?

Cross-AZ chatter, poor placement, and hairpin NAT traffic increase latency and data transfer costs.

✅ Solution:

Group chatty microservices in cluster Placement Groups.
Use VPC Endpoints (S3, DynamoDB) to bypass NAT.
Deploy Global Accelerator or CloudFront for edge proximity.

Q7: Why do memory-heavy workloads (ML/analytics) overrun budgets?

Memory leaks and over-sized R-instances hide behind “just working” apps.

✅ Solution:

Choose R-family or Graviton memory-optimized instances.
Use CloudWatch mem metrics to rightsize.
For AI workloads, add KV caching, quantization, or batching.

Q8: How can I safely use Spot Instances without chaos from interruptions?

Spot can save 70–90%, but interruptions kill unprepared apps.

✅ Solution:

Mix Spot + On-Demand in Auto Scaling Groups using attribute-based selection.
Implement checkpointing and handle 2-minute interruption notices.
Enable capacity rebalancing for smarter recovery.

Q9: Why do self-managed databases on EC2 eat into cost savings?

DIY databases accumulate inefficiencies: missing indexes, old AMIs, I/O-heavy storage.

✅ Solution:

Audit queries using Performance Insights or pg_stat_statements.
Move to Amazon RDS/Aurora when possible.
For EC2 DBs: use gp3/io2, tune auto-vacuum, and monitor read/write IOPS.

Q10: Why does Auto Scaling waste resources or fail to respond fast enough?

Bad scaling signals or cooldowns cause over-provisioning or late scaling events.

✅ Solution:

Use Target Tracking policies with metrics like CPU, queue depth, or requests/sec.
Mix instance types with attribute-based selection and capacity rebalancing.
Add warm pools for near-instant scale-out.

⚙️ Quick Wins

Migrate all gp2 → gp3 volumes.
Cover steady baselines with Savings Plans.
Implement instance scheduling for non-prod.
Pilot Graviton instances for 20–30% better price/performance.
Add Spot diversification and cost alarms for accountability.

📚 References

PreviousAWS FinOps NextAmazon RDS

Last updated 4 months ago

hashtag🔗 Quicklinks (Bookmark):

hashtag🚀 What is EC2?

hashtag⚙️ Instance Families — Pick the Right Hammer

hashtag🧬 Instance Generations

hashtag🏛️ Tenancy Options

hashtag🧠 EC2 Rightsizing Strategy

hashtag💸 Purchase Model Optimization

hashtag⏱ Scheduled Usage

hashtag🔍 EC2 FinOps Toolbox

hashtag📉 Cost View in Cost Explorer

hashtag📊 Deep Dive with CUR

hashtag⚠️ Data Transfer & EBS Callouts

hashtag🔮 Advanced Tactics

hashtag🚨 Security & Compliance for EC2

hashtag✅ EC2 FinOps Checklist

hashtag🧠 EC2 Cost Optimization Challenges

hashtagQ1: Why do EC2 bills spiral from over-provisioning or bad pricing choices?

hashtagQ2: Why is committing to Savings Plans or RIs so confusing?

hashtagQ3: What’s behind random slowdowns on burstable (T-family) instances?

hashtagQ4: Why do EBS volumes cause unpredictable slowness and high costs?

hashtagQ5: How does using the wrong instance family burn money?

hashtagQ6: Why does networking architecture silently inflate EC2 costs?

hashtagQ7: Why do memory-heavy workloads (ML/analytics) overrun budgets?

hashtagQ8: How can I safely use Spot Instances without chaos from interruptions?

hashtagQ9: Why do self-managed databases on EC2 eat into cost savings?

hashtagQ10: Why does Auto Scaling waste resources or fail to respond fast enough?

hashtag⚙️ Quick Wins

hashtag📚 References

🔗 Quicklinks (Bookmark):

🚀 What is EC2?

⚙️ Instance Families — Pick the Right Hammer

🧬 Instance Generations

🏛️ Tenancy Options

🧠 EC2 Rightsizing Strategy

💸 Purchase Model Optimization

⏱ Scheduled Usage

🔍 EC2 FinOps Toolbox

📉 Cost View in Cost Explorer

📊 Deep Dive with CUR

⚠️ Data Transfer & EBS Callouts

🔮 Advanced Tactics

🚨 Security & Compliance for EC2

✅ EC2 FinOps Checklist

🧠 EC2 Cost Optimization Challenges

Q1: Why do EC2 bills spiral from over-provisioning or bad pricing choices?

Q2: Why is committing to Savings Plans or RIs so confusing?

Q3: What’s behind random slowdowns on burstable (T-family) instances?

Q4: Why do EBS volumes cause unpredictable slowness and high costs?

Q5: How does using the wrong instance family burn money?

Q6: Why does networking architecture silently inflate EC2 costs?

Q7: Why do memory-heavy workloads (ML/analytics) overrun budgets?

Q8: How can I safely use Spot Instances without chaos from interruptions?

Q9: Why do self-managed databases on EC2 eat into cost savings?

Q10: Why does Auto Scaling waste resources or fail to respond fast enough?

⚙️ Quick Wins

📚 References