Design Metrics Monitoring & Alerting System: System Design Interview (Stripe & Amazon Offers)

TechPrep

5 chapters7 takeaways18 key terms6 questions

Overview

This video details the design of a scalable metrics monitoring and alerting system, similar to DataDog and Prometheus, crucial for understanding infrastructure and application health. It covers functional requirements like metric ingestion, querying, visualization, and alerting, alongside non-functional requirements such as massive scalability (10M+ data points/sec), high availability, low-latency alerting, fast query performance, and storage efficiency. The design employs a polyglot persistence strategy using Cassandra for time-series data, PostgreSQL for metadata, Redis for caching, Kafka for buffering, and S3 for cold storage. Key deep dives include advanced compression techniques (Gorilla, delta-of-delta, XOR), real-time alerting with Flink, data downsampling for retention, and strategies for handling cardinality explosions and sharding.

How was this?

Save this permanently with flashcards, quizzes, and AI chat

Chapters

Functional requirements include metric ingestion (CPU usage, latency) with tags, querying/visualization for dashboards, alerting based on thresholds (e.g., CPU > 90% for 5 min), and notifications to external channels (Slack, PagerDuty).
Non-functional requirements emphasize massive scalability (10M+ data points/sec), high write availability (99.99%), low-latency alerting (seconds), fast query performance (<1 sec for dashboards), and extreme storage efficiency via compression and downsampling.
A polyglot persistence strategy is essential: Cassandra/HBase for time-series data (write-optimized), PostgreSQL for metadata (user accounts, rules), Redis for caching and alert state, Kafka as a durable ingestion buffer, and S3 for long-term cold storage of downsampled data.

Understanding these requirements and the data model choices is fundamental to designing a system that can handle the immense scale and diverse needs of modern infrastructure monitoring.

Alerting if CPU usage on 'host A' in the 'production' environment exceeds 90% for 5 minutes.

The API provides endpoints for ingesting metrics (POST /v1/series) with timestamps, values, and tags, querying metrics (GET /v1/query) with time ranges and expressions, and configuring alerts (POST /v1/alerts) with rules and notification targets.
The high-level architecture includes distributed agents pushing metrics to a load balancer/gateway, an ingestion service validating requests against Redis quotas, Kafka for buffering, TSDB writers compressing and storing data in Cassandra, and background Spark jobs for downsampling to S3.
Real-time alerting uses Flink, a stream processor directly connected to Kafka, maintaining sliding windows and evaluating rules. A notification service checks Redis for deduplication before dispatching alerts.
The query path routes requests to Cassandra for recent data or S3 for historical, downsampled data via a query engine. A metadata service manages alert configurations stored in PostgreSQL, with CDC pushing updates to Flink.

This outlines the system's core components and data flow, illustrating how different services interact to handle ingestion, alerting, and querying at scale.

A POST request to `/v1/series` containing metric data like 'CPU idle' with tags 'env:production' and 'host:web01'.

Storing raw metrics at scale requires extreme compression; Gorilla compression achieves 10-12x ratios.
Timestamps are compressed using 'delta of delta': storing the difference between consecutive timestamps, and then the difference between those differences, resulting in many zeros that can be stored efficiently.
Values are compressed using XOR encoding: XORing the current value with the previous one results in many leading/trailing zeros, significantly reducing storage needs for metrics that change slowly.
LSM trees and SS tables optimize write-heavy workloads by buffering data in memory (memtable) and flushing it sequentially to immutable disk files (SSTables), avoiding random I/O and enabling high write throughput.

These compression techniques are critical for managing the massive storage requirements and reducing costs associated with high-volume telemetry data.

Storing timestamp differences like (10, 10, 10, 12) and then their differences (0, 0, 2) allows most changes to be represented by single bits.

Alerting is done via stream processing (Flink) on the Kafka data stream, bringing rules to the data rather than querying the TSDB repeatedly.
An event-driven approach using Change Data Capture (CDC) from PostgreSQL ensures Flink workers always have the latest alerting rules in their local memory without polling.
Flink uses stateful sliding windows to track metric states over time (e.g., last 5 minutes) and watermarks to handle late-arriving data gracefully, ensuring accurate alert evaluation.
The notification service uses Redis to check the state of an alert (rule ID, host) and suppresses duplicate notifications for ongoing incidents, improving user experience.

This approach enables low-latency alerting by processing data and rules in-memory, ensuring timely notifications for critical issues.

Flink maintains a 5-minute window of CPU data for 'host A' and uses watermarks to correctly process a metric that arrives 3 seconds late due to network issues.

A roll-up pipeline (using Spark) periodically aggregates raw data (10-sec granularity) into lower resolutions (1-min, 1-hour) for long-term storage.
A tiered retention strategy keeps raw data for a short period (e.g., 7 days), aggregated data for longer (e.g., 30 days), and highly aggregated data for a year or more in cheaper object storage (S3).
Querying historical data intelligently routes requests to the appropriate downsampled data in S3, drastically reducing the data processed and ensuring fast dashboard loads.
Cardinality explosions (billions of unique time series due to excessive tagging) are managed by sharding Kafka/TSDB nodes by tenant ID and metric name, and implementing quotas/circuit breakers in the ingestion service to rate-limit tenants exceeding limits.

These strategies are crucial for managing storage costs and ensuring query performance over long time horizons, while also protecting the system from misconfigurations that could lead to instability.

A 6-month trend query is automatically rewritten to fetch data from 1-hour roll-up tables in S3 instead of billions of raw 10-second data points.

Key takeaways

1Designing a high-throughput metrics system requires a polyglot persistence approach, leveraging specialized databases and message queues for different tasks.
2Advanced compression techniques like delta-of-delta and XOR are essential for reducing storage costs and I/O for time-series data.
3Stream processing engines like Flink are key to achieving low-latency alerting by processing data in real-time against rules held in memory.
4Data downsampling and tiered retention are necessary to balance query performance for historical data with storage costs.
5Robust strategies for handling cardinality explosions, such as sharding and rate limiting, are vital for system stability.
6Decoupling ingestion from storage and alerting using Kafka is a fundamental pattern for handling traffic spikes and ensuring availability.
7Understanding the trade-offs between different storage solutions (e.g., Cassandra vs. PostgreSQL vs. S3) is critical for system design.

Key terms

Metrics IngestionTime Series Database (TSDB)Polyglot PersistenceKafkaGorilla CompressionDelta of DeltaXOR EncodingLSM TreesSSTablesFlinkStream ProcessingSliding WindowWatermarksDownsamplingTiered RetentionCardinality ExplosionShardingCircuit Breaker

Test your understanding

1What are the primary functional and non-functional requirements for a metrics monitoring and alerting system?
2How does a polyglot persistence strategy address the conflicting demands of time-series data ingestion and metadata management?
3Explain the core principles behind Gorilla compression, including delta-of-delta for timestamps and XOR for values.
4Why is a stream processing engine like Flink preferred over traditional database queries for real-time alerting?
5How does the system handle data downsampling and tiered retention to manage storage costs and ensure fast query performance for historical data?
6What is a cardinality explosion, and what strategies can be employed to mitigate its impact on the system?