Designing a Feature Flag System (ChatGPT Interview)

💡 The Problem Statement

"Design a Feature Flag System like LaunchDarkly that allows teams to dynamically enable or disable features without redeploying code."

🔥 Functional Requirements

✅ Toggle Features Dynamically – Features can be turned ON/OFF without redeploying.

✅ Targeting Rules – Flags can be enabled per user, per region, or percentage-based rollouts.

✅ Real-Time Updates – Changes should take effect instantly across all users.

✅ Client SDKs – Apps should be able to fetch feature flag values via an API.

✅ Audit Logs – Track who changed what & when.

⚙️ Non-Functional Requirements

🚀 Low Latency – Feature flag evaluation must be fast (<5ms per request).

⚖ High Availability – The system should handle millions of requests per second.

🔄 Eventual Consistency – Flags can propagate with a slight delay but should be eventually consistent.

📈 Scalability – Must work for millions of users & feature flags.

⏳ Constraints & Assumptions

📌 Feature flags are read-heavy (90% reads, 10% writes).

📌 Some flags require real-time updates, while others can be cached.

📌 System should support SDK-based polling & server-side evaluation.

High-Level Architecture

The system should support a high number of feature flags, efficient flag evaluations, and real-time updates to ensure that the most recent changes are reflected across all clients. The architecture should focus on scalability, high availability, low latency, and eventual consistency.

Components:

API Layer:
- RESTful API to interact with feature flags, allowing clients to create, update, delete, and fetch feature flags.
- The API layer is stateless and serves requests from clients (e.g., SDKs).
- The API layer is deployed on a Kubernetes cluster for high availability and scalability. Multiple replicas of the API are deployed for horizontal scaling.
Database:
- The PostgreSQL database stores all feature flags, flag configurations, user targeting data, and change tracking.
- For performance reasons, read operations will be handled by replicas (read-only), while write operations (like creating and updating flags) will be handled by a primary (write) database.
Caching Layer:
- Redis will be used as an in-memory cache to store the most recent flag values to reduce latency on read requests.
- Cached data will be updated instantly after any changes are made to feature flags, ensuring fast retrieval. If Redis is unavailable, the API will fall back to reading from the PostgreSQL database.
Real-Time Updates:
- WebSockets or Server-Sent Events (SSE) are used for pushing feature flag changes in real time to all connected clients (SDKs).
- A message queue (e.g., Kafka) will be used to broadcast flag change events across the system to keep all components in sync and ensure eventual consistency.
SDKs:
- Client SDKs will interact with the API to fetch feature flags.
- The SDKs will either poll for flag changes or use WebSockets/SSE for real-time updates.
Audit Logs:
- All changes to feature flags (enable/disable, target rule changes, etc.) will be tracked in the audit log, capturing the change type, timestamp, user ID, and a description of the change.

Detailed Breakdown - Database

We will use PostgreSQL for storing persistent data, and Redis for caching. Here’s how the tables and structures might look:

Feature Flags Table:

This table contains all the flags and their configurations.

CREATE TABLE feature_flags (
  flag_id SERIAL PRIMARY KEY,
  name VARCHAR(255) NOT NULL,
  description TEXT,
  type ENUM('boolean', 'string', 'number') NOT NULL,
  filter_key ENUM('user', 'region', 'percentage') NOT NULL,
  filter_value JSONB,  -- This stores targeting rules in JSON format
  is_active BOOLEAN DEFAULT TRUE,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

filter_key and filter_value will store targeting rules (e.g., "user", "region", or "percentage").
is_active indicates whether the flag is enabled or disabled.

Change Tracking Table (Audit Logs):

This table logs all changes to feature flags for audit purposes.

CREATE TABLE feature_flag_changes (
  change_id SERIAL PRIMARY KEY,
  flag_id INT REFERENCES feature_flags(flag_id),
  user_id INT NOT NULL,  -- User who made the change
  change_type ENUM('created', 'updated', 'deleted') NOT NULL,
  change_description TEXT,
  timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Flag Targeting Table:

This table stores specific targeting rules for flags, which can be user-specific, region-specific, or percentage-based.

CREATE TABLE flag_targeting (
  targeting_id SERIAL PRIMARY KEY,
  flag_id INT REFERENCES feature_flags(flag_id),
  filter_key ENUM('user', 'region', 'percentage') NOT NULL,
  filter_value JSONB,  -- Specific targeting data (e.g., a list of users, regions, or percentage)
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Real-Time Flag Updates (Cache):
- Redis stores the most recent flag state for fast retrieval and avoids hitting the database on each request.
Redis keys might look like:
- feature_flag:{flag_id}: Stores the latest state of the flag (enabled/disabled).
- feature_flags_all: Stores a cached list of all active flags.
Message Queue (Kafka/RabbitMQ):
- When a feature flag is updated, we publish an event to the message queue. This ensures all microservices (including the API and caching layer) stay in sync with the latest flag state.

API Layer (RESTful)

GET /flags:

Fetches all feature flags, with optional filtering (e.g., by type or environment). This will be backed by Redis to cache results.

Example response:

[
  { "flag_id": 1, "name": "NewFeatureX", "is_active": true, "description": "Enable new feature" },
  { "flag_id": 2, "name": "BetaFeature", "is_active": false, "description": "Beta testing feature" }
]

POST /flag:
- Creates a new feature flag with configuration, type, and targeting rules.
- Writes the flag data to PostgreSQL and updates the Redis cache.
- Publishes a message to Kafka to notify clients and services.
PUT /flag/{flagId}:
- Updates an existing feature flag's configuration or targeting rules.
- Updates both PostgreSQL and Redis.
- Publishes a message to Kafka to notify clients and services.
DELETE /flag/{flagId}:
- Deletes a feature flag.
- Updates the database and Redis cache, and publishes the change via Kafka.

Real-Time Updates (WebSockets/SSE)

WebSockets/SSE:
- The API provides an endpoint to connect for real-time updates:
- When a feature flag is changed (enabled, disabled, or targeting rules are updated), a WebSocket event or SSE notification is sent to the client.
- Clients receive updates instantly and are able to refresh their flag state accordingly.

Consistency and Availability

Eventual Consistency: Since feature flags are read-heavy, we can afford eventual consistency for non-critical flags (those that don’t need real-time updates). The message queue (e.g., Kafka) ensures that all parts of the system (DB, cache, SDKs) stay in sync over time.
Failover and Replication:
Scalability:

Audit Logs:

Every change to the flags will be logged in the feature_flag_changes table. This provides transparency and traceability of flag modifications (who made the change, when, and why).

Further Reading