desktopAmazon EC2

Amazon EC2 is the backbone of AWS compute, scalable, customizable, and dangerously easy to overspend on.

Let’s break it down by:

→ What you’re using → What you’re paying → What you should be doing → And the AWS-native tools to make it happen.


🚀 What is EC2?

Amazon Elastic Compute Cloud (EC2) provides resizable virtual servers in the cloud.

  • Available in every AWS region

  • Billed by the second or hour

  • Can run Linux, Windows, or custom AMIs

  • Comes in dozens of instance families across generations


⚙️ Instance Families — Pick the Right Hammer

Family
Use Case
Notes

t (Burstable)

Dev, test, low-traffic apps

Great for idle workloads

m (General Purpose)

Web apps, small services

Good default starting point

c (Compute Optimized)

High CPU workloads

Perfect for encoding, ML inference

r / x (Memory Optimized)

DBs, caches, SAP

Watch memory:cost ratio

i / d (Storage)

OLTP, NoSQL, logs, IOPS-heavy

High EBS throughput

g, inf, p (Accelerated)

AI/ML, HPC

GPU-backed, very expensive


🧬 Instance Generations

Generation
Architecture
OS Support
⚠️ Caveats

Graviton (g6, t4g, etc.)

ARM

✅ Linux only

❌ No Windows, may need recompiled apps

x86 Intel/AMD (m5, c6a, etc.)

x86

✅ Linux,

✅ Windows

More costly, but universal compatibility

Run Linux? Try Graviton. Run Windows or legacy binaries? Stick to x86.


🏛️ Tenancy Options

Tenancy Type
Use When
Notes

Shared

Default

✅ Best for 90% of workloads

Dedicated Instance

You need isolation

⚠️ Slightly more expensive

Dedicated Host

BYOL licensing

💸 Most expensive, per-socket billing possible


🧠 EC2 Rightsizing Strategy

Strategy
What to Do
Tools / Notes

Quick Wins

Find underutilized instances (e.g. CPU < 10%)

🔁 Same-Family Resize

Downsize within current instance family (e.g. m5.2xlarge → m5.large)

No re-architecture needed

🔄 Cross-Family Change

Migrate to cost-effective families (e.g. m5 → t3 or m5 → m6g)

Use Graviton for Linux (⚠️ no Windows support)

💤 Shut Down Idle

Stop non-prod or idle EC2s automatically during off-hours

💡 Review and adjust sizing monthly — usage changes, so should your provisioning.


💸 Purchase Model Optimization

Model
Savings
Best For
Risk

On-Demand

0%

Dev/test, unpredictable workloads

💸 High cost

Savings Plans

30–66%

Steady-state compute

⚠️ Locked 1–3 yrs

Reserved Instances

30–72%

Predictable, type-specific workloads

⚠️ Less flexibility

Spot

70–90%

Fault-tolerant, stateless apps

⚠️ Can be interrupted anytime

➡ Use Savings Plans for baseline. ➡ Use Spot for scale-out workers.


⏱ Scheduled Usage

Stop non-prod resources when not in use.

Tools:


🔍 EC2 FinOps Toolbox

Tool
Purpose
Link

Cost Explorer

Analyze trends, tags, reservations

Compute Optimizer

Rightsize + Graviton tips

Trusted Advisor

Idle EC2, EBS, Elastic IPs

Savings Plans Console

Commit to usage, save up to 66%

Reserved Instances

Buy fixed-term EC2 savings

CUR + Athena

Deep cost analytics


📉 Cost View in Cost Explorer

In Cost Explorer:

  • Filter → Service = EC2

  • Group by → Instance Type, Region, Tag, or Linked Account

  • Use RI Coverage, SP Utilization, and Forecasting

🔗 Open Cost Explorerarrow-up-right


Cost Explorer: Fast-Triage Usage Types 🔍

When you load EC2 in Cost Explorer or in CUR, watch for these usage types and what they often indicate:

Usage Type Pattern
Likely Meaning

BoxUsage:*

Base EC2 instance hours — the main compute cost bucket

CPUCredits:*

T-family instances earning unused CPU credits

EBSOptimized:*

EC2-Other surcharge for instance type EBS optimization

DataTransfer-*

Network egress (inter-AZ, cross-AZ, internet)

ElasticIP:*

Idle or unattached Elastic IPs, incurring cost

Action Tips:

  • Filter by low vCPU-hours but non-zero BoxUsage to find idle instances.

  • High CPUCredits accumulation suggests your T-class is over-provisioned.

  • Use tag filters (project, team) to group and triage waste quickly.


📊 Deep Dive with CUR

When querying CUR for EC2 insights, these are your go-to columns:

  • line_item_resource_id — the EC2 instance ID

  • product_instance_type — the instance family and size

  • line_item_usage_type — e.g. BoxUsage, CPUCredits, DataTransfer

  • line_item_operation — start/stop, resize, etc.

  • resourceTags/* — your team/project tag dimensions

  • line_item_unblended_cost / line_item_blended_cost — cost values

Example Query Prompt: Find t3 instances with low vCPU-hours and high CPUCredits — candidate for downsizing or retirement.

🔗 CUR Setuparrow-up-right


⚠️ Data Transfer & EBS Callouts

  • Inter-AZ traffic between EC2 instances is billable; intra-AZ is free (still monitor).

  • Cross-region transfers and internet egress can dominate cost in chatty applications.

  • EBS is tightly coupled — most storage cost lives under EBS volumes and snapshots. Migrate gp2 → gp3, right-size throughput/IOPS, clean up orphaned volumes.

  • Co-locate high-traffic tiers (API + DB, worker + storage) in same AZ or use private link constructs to reduce transfer cost.


🔮 Advanced Tactics

Strategy
Why It Matters

Graviton Migration

Save 20–40% for Linux workloads

Mixed-Instance ASG

Use cheapest family type across AZs

Spot + On-Demand fallback

Scale with resilience

Instance Scheduler

Shut down dev/test nights/weekends

Tagging

Enables showback by team/project

Convertible RIs

Switch types during term

Auto Scaling Right

Prevent zombie capacity

Forecasting via CE

Plan future RI/SP purchases

Spot Strategy — next level

  • Use MixedInstancesPolicy in ASGs with multiple families and sizes to increase availability.

  • Define interruption budget (e.g. allow 10 % of capacity to be interrupted) to trade lower cost vs. reliability.

  • Use dynamic max price caps (e.g. set to 70–90 % of on-demand) and fallback to On-Demand when Spot is reclaimed.

  • Monitor spot interruption events and automate instance drainage/shutdown gracefully.


🚨 Security & Compliance for EC2

  • Ensure latest AMI patching cadence, automate image refresh.

  • Enforce IMDSv2 usage and disable IMDSv1 to soften SSRF risks.

  • Limit public IP access; use NAT/Load Balancers + security groups.

  • Use SCPs / Guardrails to prevent unapproved instance types or regions.

  • Enforce SSM Patch Manager and logging agents for visibility and drift detection.


✅ EC2 FinOps Checklist


🧠 EC2 Cost Optimization Challenges

A Q&A-style deep dive into the most persistent, high-impact AWS EC2 cost problems — and actionable solutions that go beyond “just rightsize it.”


Q1: Why do EC2 bills spiral from over-provisioning or bad pricing choices?

Because workloads evolve, but instance sizes and pricing models don’t. Teams keep on-demand instances running 24/7, even when utilization hovers below 20%.

✅ Solution:

  • Run AWS Compute Optimizer and Cost Explorer weekly.

  • Shift predictable loads to Savings Plans / Reserved Instances (up to 72% off).

  • Use Spot Instances for fault-tolerant or batch workloads (up to 90% off).

  • Implement instance schedules to stop non-prod workloads after hours.


Q2: Why is committing to Savings Plans or RIs so confusing?

Because predicting your baseline usage is part science, part art. Misjudging it either locks in waste or misses savings.

✅ Solution:

  • Default to Compute Savings Plans for flexibility.

  • Use Reserved Instances only where you need guaranteed capacity.

  • Monitor coverage vs utilization KPIs monthly and rebalance quarterly.


Q3: What’s behind random slowdowns on burstable (T-family) instances?

CPU credits. Once burst credits run out, throttling hits, silently killing performance.

✅ Solution:

  • Monitor CPUCreditBalance via CloudWatch alarms.

  • Switch to Unlimited mode (with awareness of extra cost) or scale out horizontally.

  • Move sustained loads to M/C/R/Graviton families.


Q4: Why do EBS volumes cause unpredictable slowness and high costs?

Older gp2 volumes tie IOPS to size, forcing over-provisioning for performance.

✅ Solution:

  • Migrate to gp3 (decouples size and performance).

  • Allocate precise IOPS/throughput.

  • For critical workloads, use io2 / io2 Block Express and enable EBS-optimized instances.


Q5: How does using the wrong instance family burn money?

Running compute-heavy workloads on general-purpose (M-family) instances or vice versa leads to underutilization or overpayment.

✅ Solution:

  • Let Compute Optimizer recommend the right family.

  • Benchmark using sysbench or internal metrics.

  • Try Graviton (ARM) instances — 15–40% better price-performance, after verifying compatibility.


Q6: Why does networking architecture silently inflate EC2 costs?

Cross-AZ chatter, poor placement, and hairpin NAT traffic increase latency and data transfer costs.

✅ Solution:

  • Group chatty microservices in cluster Placement Groups.

  • Use VPC Endpoints (S3, DynamoDB) to bypass NAT.

  • Deploy Global Accelerator or CloudFront for edge proximity.


Q7: Why do memory-heavy workloads (ML/analytics) overrun budgets?

Memory leaks and over-sized R-instances hide behind “just working” apps.

✅ Solution:

  • Choose R-family or Graviton memory-optimized instances.

  • Use CloudWatch mem metrics to rightsize.

  • For AI workloads, add KV caching, quantization, or batching.


Q8: How can I safely use Spot Instances without chaos from interruptions?

Spot can save 70–90%, but interruptions kill unprepared apps.

✅ Solution:

  • Mix Spot + On-Demand in Auto Scaling Groups using attribute-based selection.

  • Implement checkpointing and handle 2-minute interruption notices.

  • Enable capacity rebalancing for smarter recovery.


Q9: Why do self-managed databases on EC2 eat into cost savings?

DIY databases accumulate inefficiencies: missing indexes, old AMIs, I/O-heavy storage.

✅ Solution:

  • Audit queries using Performance Insights or pg_stat_statements.

  • Move to Amazon RDS/Aurora when possible.

  • For EC2 DBs: use gp3/io2, tune auto-vacuum, and monitor read/write IOPS.


Q10: Why does Auto Scaling waste resources or fail to respond fast enough?

Bad scaling signals or cooldowns cause over-provisioning or late scaling events.

✅ Solution:

  • Use Target Tracking policies with metrics like CPU, queue depth, or requests/sec.

  • Mix instance types with attribute-based selection and capacity rebalancing.

  • Add warm pools for near-instant scale-out.


⚙️ Quick Wins

  • Migrate all gp2 → gp3 volumes.

  • Cover steady baselines with Savings Plans.

  • Implement instance scheduling for non-prod.

  • Pilot Graviton instances for 20–30% better price/performance.

  • Add Spot diversification and cost alarms for accountability.


📚 References


Last updated