Post

Webhook System Design - One Pager

Webhook System Design - One Pager

Overview

A scalable webhook delivery system that enables users to register webhooks for specific events and reliably delivers event notifications to registered endpoints with comprehensive monitoring and retry capabilities. Webhook Diagram

Functional Requirements

FR1: User Webhook Registration

  • Users can register webhooks with custom URLs, authentication methods, and event types
  • Support for webhook configuration including expiration time and retry policies
  • Ability to manage (create, update, delete) webhook subscriptions

FR2: Event-Triggered Webhook Callbacks

  • System automatically triggers webhook callbacks when registered events occur
  • Asynchronous processing of webhook deliveries
  • Support for multiple event types and selective webhook routing

FR3: Webhook Observability

  • Users can monitor webhook delivery status and execution history
  • Detailed execution logs with success/failure tracking
  • Real-time status updates and historical analytics

Non-Functional Requirements

NFR1: At-Least-Once Delivery Guarantee

  • Implement retry mechanisms with configurable retry strategies
  • Persistent execution queue to prevent message loss
  • Idempotency considerations for duplicate deliveries

NFR2: High Availability & Horizontal Scalability

  • Distributed worker architecture for processing webhook deliveries
  • Load balancer for distributing traffic across services
  • Stateless services to enable easy scaling

NFR3: Security

  • Authentication support for webhook endpoints
  • Secure storage of webhook credentials
  • Request signing and validation mechanisms

System Architecture

Core Components

1. Webhook Manager Service

  • Responsibility: Handle webhook CRUD operations
  • API Endpoints:
    • POST /webhooks - Register new webhook
    • GET /webhooks/{id} - Retrieve webhook details
    • PUT /webhooks/{id} - Update webhook configuration
    • DELETE /webhooks/{id} - Remove webhook
  • Data Access: Reads/writes to Webhook Metadata DB

2. Event Trigger Service

  • Responsibility: Receive events and initiate webhook processing
  • Flow:
    1. Receive events from event queue
    2. Query Webhook Metadata DB for matching webhooks
    3. Create execution tasks and enqueue to execution queue
  • Caching: Uses cache layer for frequently accessed webhook configurations

3. Execution Queue

  • Technology: Message queue (e.g., Kafka, RabbitMQ, SQS)
  • Purpose: Decouple event reception from webhook delivery
  • Features:
    • Persistent storage for durability
    • Support for message prioritization
    • Dead letter queue for failed deliveries

4. Worker Pool

  • Responsibility: Process webhook deliveries
  • Characteristics:
    • Horizontally scalable
    • Pull messages from execution queue
    • Make HTTP requests to webhook endpoints
    • Record execution results to delivery log DB
  • Error Handling:
    • Implement retry logic based on webhook configuration
    • Update delivery status in logs
    • Route permanently failed deliveries to monitoring system

5. Retry Service

  • Responsibility: Handle failed webhook deliveries
  • Retry Strategies:
    • Exponential backoff
    • Fixed interval
    • Custom retry schedules
  • Configuration: Per-webhook retry policy settings

6. Monitoring Manager

  • Responsibility: Provide observability into webhook operations
  • Features:
    • Dashboard for webhook status
    • Execution metrics and analytics
    • Alert generation for failures
    • Historical trend analysis

7. Load Balancer

  • Purpose: Distribute incoming requests across service instances
  • Targets:
    • Webhook Manager API
    • Event Trigger Service
    • Monitoring endpoints

Data Stores

Webhook Metadata Database

Schema:

1
2
3
4
5
6
7
8
9
10
11
12
Webhook {
  web_hook_id: UUID (PK)
  url: String
  auth: JSON (credentials, tokens)
  event_type: String[]
  user_id: UUID
  expire_at: Timestamp
  status: Enum (active, inactive, expired)
  retry_type: Enum (exponential, fixed, custom)
  created_at: Timestamp
  updated_at: Timestamp
}

Characteristics:

  • Relational DB for ACID properties
  • Indexed on user_id, event_type for fast queries

Webhook Delivery Log Database

Schema:

1
2
3
4
5
6
7
8
9
10
11
execution_log {
  event_id: UUID (PK)
  event_type: String
  webhook: JSON (snapshot of webhook config)
  status: Enum (pending, success, failure, retrying)
  response_code: Integer
  response_body: Text
  attempt_count: Integer
  executed_at: Timestamp
  next_retry_at: Timestamp
}

Characteristics:

  • High write throughput optimization
  • Time-series optimized storage
  • Partitioned by date for efficient querying

Cache Layer

  • Purpose: Reduce database load for frequently accessed webhooks
  • Cached Data:
    • Active webhook configurations
    • Event-to-webhook mappings
  • Cache Invalidation: On webhook updates/deletions

Data Flow

Webhook Registration Flow

1
2
3
User → Load Balancer → Webhook Manager → Webhook Metadata DB
                                      ↓
                                    Cache (update)

Event Processing Flow

1
2
3
4
5
6
7
8
9
Event Source → Events Queue → Event Trigger Service
                                    ↓
                          Query Cache/Metadata DB
                                    ↓
                            Execution Queue → Worker Pool → Webhook Endpoint
                                                    ↓
                                          Delivery Log DB
                                                    ↓
                                    (if failure) → Retry Service

Monitoring Flow

1
2
3
User → Load Balancer → Monitoring Manager → Delivery Log DB
                                                  ↓
                                         (analytics & metrics)

Key Design Decisions

1. Queue-Based Architecture

  • Rationale: Decouples event generation from delivery, enabling better scalability and fault tolerance
  • Trade-off: Adds latency but improves reliability

2. Worker Pool Pattern

  • Rationale: Allows independent scaling of webhook delivery capacity
  • Implementation: Workers compete for messages from execution queue

3. Execution Log Persistence

  • Rationale: Enables audit trails, debugging, and monitoring
  • Optimization: Consider time-based archival for old logs

4. Cache Integration

  • Rationale: Reduces database load for hot webhook configurations
  • Strategy: Cache-aside pattern with TTL-based expiration

5. Retry Service Separation

  • Rationale: Specialized handling of failed deliveries without blocking main worker pool
  • Pattern: Scheduled retry jobs with exponential backoff

Scaling Considerations

Horizontal Scaling Components

  • Webhook Manager (stateless API)
  • Event Trigger Service
  • Worker Pool (most critical for throughput)
  • Monitoring Manager

Bottleneck Mitigation

  • Database: Read replicas for webhook metadata, sharding for delivery logs
  • Queue: Partitioned topics/queues for parallel processing
  • Workers: Auto-scaling based on queue depth

Security Measures

  1. Authentication:
    • Support for multiple auth methods (API keys, OAuth, JWT)
    • Encrypted storage of credentials
  2. Request Validation:
    • HMAC signatures for webhook payloads
    • Timestamp validation to prevent replay attacks
  3. Rate Limiting:
    • Per-webhook delivery rate limits
    • Circuit breaker pattern for problematic endpoints
  4. Data Protection:
    • TLS for all communications
    • PII handling compliance

Monitoring & Alerting

Key Metrics

  • Delivery success rate
  • Average delivery latency
  • Queue depth
  • Worker utilization
  • Retry rate
  • Endpoint availability

Alerts

  • High failure rate for specific webhooks
  • Queue backup/overflow
  • Worker pool saturation
  • Database connection issues

Future Enhancements

  1. Advanced Filtering: Allow complex event filtering based on payload content
  2. Webhook Templates: Pre-configured webhook patterns for common use cases
  3. Delivery Guarantee Options: Support for at-most-once and exactly-once semantics
  4. Payload Transformation: Allow users to customize webhook payload format
  5. Multi-region Deployment: Geographic distribution for lower latency
  6. Webhook Testing: Sandbox environment for webhook validation

Document Version: 1.0
Last Updated: November 1, 2025
Author: Copilot

This post is licensed under CC BY 4.0 by the author.