Rate Limiting — Architecture¶

Audience: Architects, tech leads, senior engineers.

Context and Purpose¶

Swisper faces three distinct abuse vectors: API request flooding, LLM token exhaustion, and brute-force login attempts. Each requires a different rate limiting strategy with different storage characteristics.

The architectural goals are:

Low-latency enforcement — Redis-based limiters add sub-millisecond overhead to the request path
Accurate token budgets — PostgreSQL-based token limiting uses the actual persisted usage data, avoiding Redis TTL drift
Fail-open resilience — Infrastructure failures must not block legitimate users
Runtime configurability — Token limits should be adjustable without redeployment

Architecture Overview¶

graph TD
    subgraph Request["Incoming Request"]
        REQ["HTTP Request"]
    end

    subgraph Auth["Auth Rate Limiting"]
        MW["AuthRateLimitMiddleware"]
        ERL["EndpointRateLimiter"]
        REDIS1["Redis Sorted Set"]
    end

    subgraph Token["Token Rate Limiting"]
        TRL["TokenRateLimiter"]
        PG["PostgreSQL: token_usages + background_job_token_usages"]
        CFG["Config Service: configuration table"]
    end

    subgraph Login["Login Rate Limiting"]
        LRL["LoginRateLimiter"]
        REDIS2["Redis Sorted Set"]
    end

    REQ -->|All auth requests| MW --> ERL --> REDIS1
    REQ -->|Chat/message| TRL --> PG
    TRL --> CFG
    REQ -->|Login| LRL --> REDIS2

Component Responsibilities¶

Component	File	Responsibility
EndpointRateLimiter	`services/endpoint_rate_limiter.py`	Sliding window implementation over Redis sorted sets. Tracks by `{endpoint}:user:{user_id}` and `{endpoint}:ip:{ip_address}`
AuthRateLimitMiddleware	`api/middleware/auth_rate_limit_middleware.py`	FastAPI middleware: extracts user from Bearer token, applies `EndpointRateLimiter` with endpoint name `auth`. Returns 429 with `Retry-After` on reject
TokenRateLimiter	`services/token_rate_limiter.py`	Sliding window over PostgreSQL. Sums `total_tokens` from `token_usages` and `background_job_token_usages` within the window. Reads limits from Config Service
LoginRateLimiter	`services/login_rate_limiter.py`	Sliding window over Redis sorted sets. Tracks by `login_rate_limit:email:{email}` and `login_rate_limit:ip:{ip}`. Singleton via `get_login_rate_limiter()`

Algorithms¶

All Redis-based limiters use the same algorithm with Redis sorted sets:

Clean expired entries: ZREMRANGEBYSCORE key -inf window_start removes entries older than the window
Count current entries: ZCARD key returns the count of entries within the window
Check limit: If count >= max_requests, reject with 429
Record request: ZADD key timestamp member adds the current request
Set expiry: EXPIRE key (window_seconds + 60) ensures cleanup of idle keys

Member format: {timestamp}:{identifier} (ensures uniqueness within the sorted set).

Sliding Window (PostgreSQL — Token)¶

Query usage: SELECT SUM(total_tokens) FROM token_usages WHERE user_id = ? AND created_at > (NOW - window_hours), union with same query on background_job_token_usages
Apply burst: Effective limit = max_tokens × burst_allowance (e.g., 1,500,000 × 1.10 = 1,650,000)
Check limit: If total_usage >= effective_limit, reject

Configuration¶

Auth Endpoint Rate Limiter¶

Configured via environment variables (read from settings):

Setting	Default	Description
`AUTH_RATE_LIMIT_USER_MAX`	100	Max requests per user per window
`AUTH_RATE_LIMIT_IP_MAX`	80	Max requests per IP per window
`AUTH_RATE_LIMIT_WINDOW_SECONDS`	20	Sliding window size
`AUTH_RATE_LIMIT_ENABLED`	True	Global enable/disable

Token Rate Limiter¶

Configured via the Config Service (database configuration table), adjustable at runtime:

Config Key	Default Value	Description
`TOKEN_LIMIT_ENABLED`	`true`	Global enable/disable
`TOKEN_LIMIT_WINDOW_HOURS`	`3`	Sliding window size in hours
`TOKEN_LIMIT_MAX_TOKENS`	`1,500,000`	Max tokens per user per window
`TOKEN_LIMIT_BURST_ALLOWANCE`	`1.10`	Multiplier for burst (10% over limit)

Hardcoded in get_login_rate_limiter() singleton:

Parameter	Value	Description
`email_max_attempts`	5	Max attempts per email per window
`ip_max_attempts`	30	Max attempts per IP per window
`window_seconds`	900 (15 min)	Sliding window size

Redis Key Patterns¶

Limiter	Key Pattern	Example
Auth endpoint (user)	`endpoint_rate_limit:{endpoint}:user:{user_id}`	`endpoint_rate_limit:auth:user:550e8400-...`
Auth endpoint (IP)	`endpoint_rate_limit:{endpoint}:ip:{ip_address}`	`endpoint_rate_limit:auth:ip:192.168.1.1`
Login (email)	`login_rate_limit:email:{email_lower}`	`login_rate_limit:email:alice@example.com`
Login (IP)	`login_rate_limit:ip:{ip_address}`	`login_rate_limit:ip:10.0.0.1`

Key Design Decisions¶

Decision: Three independent limiters instead of a unified system

Chosen: Separate EndpointRateLimiter, TokenRateLimiter, and LoginRateLimiter with different storage backends
Rejected: Unified rate limiter with configurable dimensions
Rationale: Each limiter protects against a different threat with different characteristics. Endpoint and login limiting require sub-millisecond latency (Redis). Token limiting requires accurate historical totals (PostgreSQL). Combining them would force compromises on storage or accuracy.

Decision: Redis sorted sets for sliding window

Chosen: ZRANGEBYSCORE + ZCARD pattern for precise sliding window counting
Rejected: Fixed window counters (INCR with TTL), token bucket algorithm
Rationale: Sorted sets give exact sliding window semantics without the burst-at-boundary problem of fixed windows. The storage overhead is small (one entry per request within the window).

Decision: Fail-open on all limiters

Chosen: Catch storage exceptions and allow the request through
Rejected: Fail-closed (block request on storage failure)
Rationale: Rate limiting is a protective measure, not a security boundary. Blocking all users because Redis is temporarily unavailable would cause a worse outage than allowing a few extra requests through.

Decision: Login counter cleared on success

Chosen: Successful login clears the email-based counter (but not IP-based)
Rejected: Counter persists until window expires
Rationale: Clearing on success prevents legitimate users from being locked out after a few typos followed by a correct password. The IP counter remains to protect against credential-stuffing attacks that cycle through emails.

Integration Points¶

Integration	Direction	Purpose
`AuthRateLimitMiddleware` → `EndpointRateLimiter`	Inbound	Middleware extracts user/IP, calls limiter for every authenticated request
`chats.py` → `TokenRateLimiter`	Inbound	Chat creation and message posting check token budget before processing
`streaming.py` → `TokenRateLimiter`	Outbound	Streaming response includes rate limit status in final chunk
`login.py` → `LoginRateLimiter`	Inbound	Login endpoint checks rate limit before credential validation
`TokenRateLimiter` → `token_usages`	Query	Reads persisted token usage from the Token Analytics tables
`TokenRateLimiter` → Config Service	Query	Reads current limit configuration from the `configuration` table

Known Trade-offs and Debt¶

Item	Impact	Remediation
No per-endpoint differentiation	All authenticated API requests share the same `auth` bucket. A burst of lightweight GETs can exhaust the limit, blocking expensive POSTs	Add per-route buckets for critical endpoints (e.g., `/chats`, `/messages`)
Token budget includes background jobs	Background job token usage reduces a user's interactive budget for the same window	Separate interactive and background job budgets, or weight background tokens differently
Login limits are hardcoded	Changing login rate limits requires a code change and redeployment	Move login limits to the Config Service or environment variables
No rate limit headers on non-429 responses	Clients cannot see their remaining budget until they hit the limit	Add `X-RateLimit-Remaining` and `X-RateLimit-Reset` headers to all responses