Skip to content

Rate Limiting — Architecture

Audience: Architects, tech leads, senior engineers.


Context and Purpose

Swisper faces three distinct abuse vectors: API request flooding, LLM token exhaustion, and brute-force login attempts. Each requires a different rate limiting strategy with different storage characteristics.

The architectural goals are:

  • Low-latency enforcement — Redis-based limiters add sub-millisecond overhead to the request path
  • Accurate token budgets — PostgreSQL-based token limiting uses the actual persisted usage data, avoiding Redis TTL drift
  • Fail-open resilience — Infrastructure failures must not block legitimate users
  • Runtime configurability — Token limits should be adjustable without redeployment

Architecture Overview

graph TD
    subgraph Request["Incoming Request"]
        REQ["HTTP Request"]
    end

    subgraph Auth["Auth Rate Limiting"]
        MW["AuthRateLimitMiddleware"]
        ERL["EndpointRateLimiter"]
        REDIS1["Redis Sorted Set"]
    end

    subgraph Token["Token Rate Limiting"]
        TRL["TokenRateLimiter"]
        PG["PostgreSQL: token_usages + background_job_token_usages"]
        CFG["Config Service: configuration table"]
    end

    subgraph Login["Login Rate Limiting"]
        LRL["LoginRateLimiter"]
        REDIS2["Redis Sorted Set"]
    end

    REQ -->|All auth requests| MW --> ERL --> REDIS1
    REQ -->|Chat/message| TRL --> PG
    TRL --> CFG
    REQ -->|Login| LRL --> REDIS2

Component Responsibilities

Component File Responsibility
EndpointRateLimiter services/endpoint_rate_limiter.py Sliding window implementation over Redis sorted sets. Tracks by {endpoint}:user:{user_id} and {endpoint}:ip:{ip_address}
AuthRateLimitMiddleware api/middleware/auth_rate_limit_middleware.py FastAPI middleware: extracts user from Bearer token, applies EndpointRateLimiter with endpoint name auth. Returns 429 with Retry-After on reject
TokenRateLimiter services/token_rate_limiter.py Sliding window over PostgreSQL. Sums total_tokens from token_usages and background_job_token_usages within the window. Reads limits from Config Service
LoginRateLimiter services/login_rate_limiter.py Sliding window over Redis sorted sets. Tracks by login_rate_limit:email:{email} and login_rate_limit:ip:{ip}. Singleton via get_login_rate_limiter()

Algorithms

Sliding Window (Redis — Endpoint & Login)

All Redis-based limiters use the same algorithm with Redis sorted sets:

  1. Clean expired entries: ZREMRANGEBYSCORE key -inf window_start removes entries older than the window
  2. Count current entries: ZCARD key returns the count of entries within the window
  3. Check limit: If count >= max_requests, reject with 429
  4. Record request: ZADD key timestamp member adds the current request
  5. Set expiry: EXPIRE key (window_seconds + 60) ensures cleanup of idle keys

Member format: {timestamp}:{identifier} (ensures uniqueness within the sorted set).

Sliding Window (PostgreSQL — Token)

  1. Query usage: SELECT SUM(total_tokens) FROM token_usages WHERE user_id = ? AND created_at > (NOW - window_hours), union with same query on background_job_token_usages
  2. Apply burst: Effective limit = max_tokens × burst_allowance (e.g., 1,500,000 × 1.10 = 1,650,000)
  3. Check limit: If total_usage >= effective_limit, reject

Configuration

Auth Endpoint Rate Limiter

Configured via environment variables (read from settings):

Setting Default Description
AUTH_RATE_LIMIT_USER_MAX 100 Max requests per user per window
AUTH_RATE_LIMIT_IP_MAX 80 Max requests per IP per window
AUTH_RATE_LIMIT_WINDOW_SECONDS 20 Sliding window size
AUTH_RATE_LIMIT_ENABLED True Global enable/disable

Token Rate Limiter

Configured via the Config Service (database configuration table), adjustable at runtime:

Config Key Default Value Description
TOKEN_LIMIT_ENABLED true Global enable/disable
TOKEN_LIMIT_WINDOW_HOURS 3 Sliding window size in hours
TOKEN_LIMIT_MAX_TOKENS 1,500,000 Max tokens per user per window
TOKEN_LIMIT_BURST_ALLOWANCE 1.10 Multiplier for burst (10% over limit)

Login Rate Limiter

Hardcoded in get_login_rate_limiter() singleton:

Parameter Value Description
email_max_attempts 5 Max attempts per email per window
ip_max_attempts 30 Max attempts per IP per window
window_seconds 900 (15 min) Sliding window size

Redis Key Patterns

Limiter Key Pattern Example
Auth endpoint (user) endpoint_rate_limit:{endpoint}:user:{user_id} endpoint_rate_limit:auth:user:550e8400-...
Auth endpoint (IP) endpoint_rate_limit:{endpoint}:ip:{ip_address} endpoint_rate_limit:auth:ip:192.168.1.1
Login (email) login_rate_limit:email:{email_lower} login_rate_limit:email:alice@example.com
Login (IP) login_rate_limit:ip:{ip_address} login_rate_limit:ip:10.0.0.1

Key Design Decisions

Decision: Three independent limiters instead of a unified system

  • Chosen: Separate EndpointRateLimiter, TokenRateLimiter, and LoginRateLimiter with different storage backends
  • Rejected: Unified rate limiter with configurable dimensions
  • Rationale: Each limiter protects against a different threat with different characteristics. Endpoint and login limiting require sub-millisecond latency (Redis). Token limiting requires accurate historical totals (PostgreSQL). Combining them would force compromises on storage or accuracy.

Decision: Redis sorted sets for sliding window

  • Chosen: ZRANGEBYSCORE + ZCARD pattern for precise sliding window counting
  • Rejected: Fixed window counters (INCR with TTL), token bucket algorithm
  • Rationale: Sorted sets give exact sliding window semantics without the burst-at-boundary problem of fixed windows. The storage overhead is small (one entry per request within the window).

Decision: Fail-open on all limiters

  • Chosen: Catch storage exceptions and allow the request through
  • Rejected: Fail-closed (block request on storage failure)
  • Rationale: Rate limiting is a protective measure, not a security boundary. Blocking all users because Redis is temporarily unavailable would cause a worse outage than allowing a few extra requests through.

Decision: Login counter cleared on success

  • Chosen: Successful login clears the email-based counter (but not IP-based)
  • Rejected: Counter persists until window expires
  • Rationale: Clearing on success prevents legitimate users from being locked out after a few typos followed by a correct password. The IP counter remains to protect against credential-stuffing attacks that cycle through emails.

Integration Points

Integration Direction Purpose
AuthRateLimitMiddlewareEndpointRateLimiter Inbound Middleware extracts user/IP, calls limiter for every authenticated request
chats.pyTokenRateLimiter Inbound Chat creation and message posting check token budget before processing
streaming.pyTokenRateLimiter Outbound Streaming response includes rate limit status in final chunk
login.pyLoginRateLimiter Inbound Login endpoint checks rate limit before credential validation
TokenRateLimitertoken_usages Query Reads persisted token usage from the Token Analytics tables
TokenRateLimiter → Config Service Query Reads current limit configuration from the configuration table

Known Trade-offs and Debt

Item Impact Remediation
No per-endpoint differentiation All authenticated API requests share the same auth bucket. A burst of lightweight GETs can exhaust the limit, blocking expensive POSTs Add per-route buckets for critical endpoints (e.g., /chats, /messages)
Token budget includes background jobs Background job token usage reduces a user's interactive budget for the same window Separate interactive and background job budgets, or weight background tokens differently
Login limits are hardcoded Changing login rate limits requires a code change and redeployment Move login limits to the Config Service or environment variables
No rate limit headers on non-429 responses Clients cannot see their remaining budget until they hit the limit Add X-RateLimit-Remaining and X-RateLimit-Reset headers to all responses