Rate Limiting — Architecture¶
Audience: Architects, tech leads, senior engineers.
Context and Purpose¶
Swisper faces three distinct abuse vectors: API request flooding, LLM token exhaustion, and brute-force login attempts. Each requires a different rate limiting strategy with different storage characteristics.
The architectural goals are:
- Low-latency enforcement — Redis-based limiters add sub-millisecond overhead to the request path
- Accurate token budgets — PostgreSQL-based token limiting uses the actual persisted usage data, avoiding Redis TTL drift
- Fail-open resilience — Infrastructure failures must not block legitimate users
- Runtime configurability — Token limits should be adjustable without redeployment
Architecture Overview¶
graph TD
subgraph Request["Incoming Request"]
REQ["HTTP Request"]
end
subgraph Auth["Auth Rate Limiting"]
MW["AuthRateLimitMiddleware"]
ERL["EndpointRateLimiter"]
REDIS1["Redis Sorted Set"]
end
subgraph Token["Token Rate Limiting"]
TRL["TokenRateLimiter"]
PG["PostgreSQL: token_usages + background_job_token_usages"]
CFG["Config Service: configuration table"]
end
subgraph Login["Login Rate Limiting"]
LRL["LoginRateLimiter"]
REDIS2["Redis Sorted Set"]
end
REQ -->|All auth requests| MW --> ERL --> REDIS1
REQ -->|Chat/message| TRL --> PG
TRL --> CFG
REQ -->|Login| LRL --> REDIS2
Component Responsibilities¶
| Component | File | Responsibility |
|---|---|---|
| EndpointRateLimiter | services/endpoint_rate_limiter.py |
Sliding window implementation over Redis sorted sets. Tracks by {endpoint}:user:{user_id} and {endpoint}:ip:{ip_address} |
| AuthRateLimitMiddleware | api/middleware/auth_rate_limit_middleware.py |
FastAPI middleware: extracts user from Bearer token, applies EndpointRateLimiter with endpoint name auth. Returns 429 with Retry-After on reject |
| TokenRateLimiter | services/token_rate_limiter.py |
Sliding window over PostgreSQL. Sums total_tokens from token_usages and background_job_token_usages within the window. Reads limits from Config Service |
| LoginRateLimiter | services/login_rate_limiter.py |
Sliding window over Redis sorted sets. Tracks by login_rate_limit:email:{email} and login_rate_limit:ip:{ip}. Singleton via get_login_rate_limiter() |
Algorithms¶
Sliding Window (Redis — Endpoint & Login)¶
All Redis-based limiters use the same algorithm with Redis sorted sets:
- Clean expired entries:
ZREMRANGEBYSCORE key -inf window_startremoves entries older than the window - Count current entries:
ZCARD keyreturns the count of entries within the window - Check limit: If
count >= max_requests, reject with 429 - Record request:
ZADD key timestamp memberadds the current request - Set expiry:
EXPIRE key (window_seconds + 60)ensures cleanup of idle keys
Member format: {timestamp}:{identifier} (ensures uniqueness within the sorted set).
Sliding Window (PostgreSQL — Token)¶
- Query usage:
SELECT SUM(total_tokens) FROM token_usages WHERE user_id = ? AND created_at > (NOW - window_hours), union with same query onbackground_job_token_usages - Apply burst: Effective limit =
max_tokens × burst_allowance(e.g., 1,500,000 × 1.10 = 1,650,000) - Check limit: If
total_usage >= effective_limit, reject
Configuration¶
Auth Endpoint Rate Limiter¶
Configured via environment variables (read from settings):
| Setting | Default | Description |
|---|---|---|
AUTH_RATE_LIMIT_USER_MAX |
100 | Max requests per user per window |
AUTH_RATE_LIMIT_IP_MAX |
80 | Max requests per IP per window |
AUTH_RATE_LIMIT_WINDOW_SECONDS |
20 | Sliding window size |
AUTH_RATE_LIMIT_ENABLED |
True | Global enable/disable |
Token Rate Limiter¶
Configured via the Config Service (database configuration table), adjustable at runtime:
| Config Key | Default Value | Description |
|---|---|---|
TOKEN_LIMIT_ENABLED |
true |
Global enable/disable |
TOKEN_LIMIT_WINDOW_HOURS |
3 |
Sliding window size in hours |
TOKEN_LIMIT_MAX_TOKENS |
1,500,000 |
Max tokens per user per window |
TOKEN_LIMIT_BURST_ALLOWANCE |
1.10 |
Multiplier for burst (10% over limit) |
Login Rate Limiter¶
Hardcoded in get_login_rate_limiter() singleton:
| Parameter | Value | Description |
|---|---|---|
email_max_attempts |
5 | Max attempts per email per window |
ip_max_attempts |
30 | Max attempts per IP per window |
window_seconds |
900 (15 min) | Sliding window size |
Redis Key Patterns¶
| Limiter | Key Pattern | Example |
|---|---|---|
| Auth endpoint (user) | endpoint_rate_limit:{endpoint}:user:{user_id} |
endpoint_rate_limit:auth:user:550e8400-... |
| Auth endpoint (IP) | endpoint_rate_limit:{endpoint}:ip:{ip_address} |
endpoint_rate_limit:auth:ip:192.168.1.1 |
| Login (email) | login_rate_limit:email:{email_lower} |
login_rate_limit:email:alice@example.com |
| Login (IP) | login_rate_limit:ip:{ip_address} |
login_rate_limit:ip:10.0.0.1 |
Key Design Decisions¶
Decision: Three independent limiters instead of a unified system
- Chosen: Separate
EndpointRateLimiter,TokenRateLimiter, andLoginRateLimiterwith different storage backends - Rejected: Unified rate limiter with configurable dimensions
- Rationale: Each limiter protects against a different threat with different characteristics. Endpoint and login limiting require sub-millisecond latency (Redis). Token limiting requires accurate historical totals (PostgreSQL). Combining them would force compromises on storage or accuracy.
Decision: Redis sorted sets for sliding window
- Chosen:
ZRANGEBYSCORE+ZCARDpattern for precise sliding window counting - Rejected: Fixed window counters (
INCRwith TTL), token bucket algorithm - Rationale: Sorted sets give exact sliding window semantics without the burst-at-boundary problem of fixed windows. The storage overhead is small (one entry per request within the window).
Decision: Fail-open on all limiters
- Chosen: Catch storage exceptions and allow the request through
- Rejected: Fail-closed (block request on storage failure)
- Rationale: Rate limiting is a protective measure, not a security boundary. Blocking all users because Redis is temporarily unavailable would cause a worse outage than allowing a few extra requests through.
Decision: Login counter cleared on success
- Chosen: Successful login clears the email-based counter (but not IP-based)
- Rejected: Counter persists until window expires
- Rationale: Clearing on success prevents legitimate users from being locked out after a few typos followed by a correct password. The IP counter remains to protect against credential-stuffing attacks that cycle through emails.
Integration Points¶
| Integration | Direction | Purpose |
|---|---|---|
AuthRateLimitMiddleware → EndpointRateLimiter |
Inbound | Middleware extracts user/IP, calls limiter for every authenticated request |
chats.py → TokenRateLimiter |
Inbound | Chat creation and message posting check token budget before processing |
streaming.py → TokenRateLimiter |
Outbound | Streaming response includes rate limit status in final chunk |
login.py → LoginRateLimiter |
Inbound | Login endpoint checks rate limit before credential validation |
TokenRateLimiter → token_usages |
Query | Reads persisted token usage from the Token Analytics tables |
TokenRateLimiter → Config Service |
Query | Reads current limit configuration from the configuration table |
Known Trade-offs and Debt¶
| Item | Impact | Remediation |
|---|---|---|
| No per-endpoint differentiation | All authenticated API requests share the same auth bucket. A burst of lightweight GETs can exhaust the limit, blocking expensive POSTs |
Add per-route buckets for critical endpoints (e.g., /chats, /messages) |
| Token budget includes background jobs | Background job token usage reduces a user's interactive budget for the same window | Separate interactive and background job budgets, or weight background tokens differently |
| Login limits are hardcoded | Changing login rate limits requires a code change and redeployment | Move login limits to the Config Service or environment variables |
| No rate limit headers on non-429 responses | Clients cannot see their remaining budget until they hit the limit | Add X-RateLimit-Remaining and X-RateLimit-Reset headers to all responses |