
Design Metrics Monitoring & Alerting System: System Design Interview (Stripe & Amazon Offers)
TechPrep
Overview
This video details the design of a scalable metrics monitoring and alerting system, similar to DataDog and Prometheus, crucial for understanding infrastructure and application health. It covers functional requirements like metric ingestion, querying, visualization, and alerting, alongside non-functional requirements such as massive scalability (10M+ data points/sec), high availability, low-latency alerting, fast query performance, and storage efficiency. The design employs a polyglot persistence strategy using Cassandra for time-series data, PostgreSQL for metadata, Redis for caching, Kafka for buffering, and S3 for cold storage. Key deep dives include advanced compression techniques (Gorilla, delta-of-delta, XOR), real-time alerting with Flink, data downsampling for retention, and strategies for handling cardinality explosions and sharding.
Save this permanently with flashcards, quizzes, and AI chat
Chapters
- Functional requirements include metric ingestion (CPU usage, latency) with tags, querying/visualization for dashboards, alerting based on thresholds (e.g., CPU > 90% for 5 min), and notifications to external channels (Slack, PagerDuty).
- Non-functional requirements emphasize massive scalability (10M+ data points/sec), high write availability (99.99%), low-latency alerting (seconds), fast query performance (<1 sec for dashboards), and extreme storage efficiency via compression and downsampling.
- A polyglot persistence strategy is essential: Cassandra/HBase for time-series data (write-optimized), PostgreSQL for metadata (user accounts, rules), Redis for caching and alert state, Kafka as a durable ingestion buffer, and S3 for long-term cold storage of downsampled data.
- The API provides endpoints for ingesting metrics (POST /v1/series) with timestamps, values, and tags, querying metrics (GET /v1/query) with time ranges and expressions, and configuring alerts (POST /v1/alerts) with rules and notification targets.
- The high-level architecture includes distributed agents pushing metrics to a load balancer/gateway, an ingestion service validating requests against Redis quotas, Kafka for buffering, TSDB writers compressing and storing data in Cassandra, and background Spark jobs for downsampling to S3.
- Real-time alerting uses Flink, a stream processor directly connected to Kafka, maintaining sliding windows and evaluating rules. A notification service checks Redis for deduplication before dispatching alerts.
- The query path routes requests to Cassandra for recent data or S3 for historical, downsampled data via a query engine. A metadata service manages alert configurations stored in PostgreSQL, with CDC pushing updates to Flink.
- Storing raw metrics at scale requires extreme compression; Gorilla compression achieves 10-12x ratios.
- Timestamps are compressed using 'delta of delta': storing the difference between consecutive timestamps, and then the difference between those differences, resulting in many zeros that can be stored efficiently.
- Values are compressed using XOR encoding: XORing the current value with the previous one results in many leading/trailing zeros, significantly reducing storage needs for metrics that change slowly.
- LSM trees and SS tables optimize write-heavy workloads by buffering data in memory (memtable) and flushing it sequentially to immutable disk files (SSTables), avoiding random I/O and enabling high write throughput.
- Alerting is done via stream processing (Flink) on the Kafka data stream, bringing rules to the data rather than querying the TSDB repeatedly.
- An event-driven approach using Change Data Capture (CDC) from PostgreSQL ensures Flink workers always have the latest alerting rules in their local memory without polling.
- Flink uses stateful sliding windows to track metric states over time (e.g., last 5 minutes) and watermarks to handle late-arriving data gracefully, ensuring accurate alert evaluation.
- The notification service uses Redis to check the state of an alert (rule ID, host) and suppresses duplicate notifications for ongoing incidents, improving user experience.
- A roll-up pipeline (using Spark) periodically aggregates raw data (10-sec granularity) into lower resolutions (1-min, 1-hour) for long-term storage.
- A tiered retention strategy keeps raw data for a short period (e.g., 7 days), aggregated data for longer (e.g., 30 days), and highly aggregated data for a year or more in cheaper object storage (S3).
- Querying historical data intelligently routes requests to the appropriate downsampled data in S3, drastically reducing the data processed and ensuring fast dashboard loads.
- Cardinality explosions (billions of unique time series due to excessive tagging) are managed by sharding Kafka/TSDB nodes by tenant ID and metric name, and implementing quotas/circuit breakers in the ingestion service to rate-limit tenants exceeding limits.
Key takeaways
- Designing a high-throughput metrics system requires a polyglot persistence approach, leveraging specialized databases and message queues for different tasks.
- Advanced compression techniques like delta-of-delta and XOR are essential for reducing storage costs and I/O for time-series data.
- Stream processing engines like Flink are key to achieving low-latency alerting by processing data in real-time against rules held in memory.
- Data downsampling and tiered retention are necessary to balance query performance for historical data with storage costs.
- Robust strategies for handling cardinality explosions, such as sharding and rate limiting, are vital for system stability.
- Decoupling ingestion from storage and alerting using Kafka is a fundamental pattern for handling traffic spikes and ensuring availability.
- Understanding the trade-offs between different storage solutions (e.g., Cassandra vs. PostgreSQL vs. S3) is critical for system design.
Key terms
Test your understanding
- What are the primary functional and non-functional requirements for a metrics monitoring and alerting system?
- How does a polyglot persistence strategy address the conflicting demands of time-series data ingestion and metadata management?
- Explain the core principles behind Gorilla compression, including delta-of-delta for timestamps and XOR for values.
- Why is a stream processing engine like Flink preferred over traditional database queries for real-time alerting?
- How does the system handle data downsampling and tiered retention to manage storage costs and ensure fast query performance for historical data?
- What is a cardinality explosion, and what strategies can be employed to mitigate its impact on the system?