Skip to content

Rate Limiting — Overview

Audience: Business stakeholders, product owners, analysts, new team members.


What This Module Does

Rate limiting prevents abuse and controls costs by restricting how frequently users can make requests and how many LLM tokens they can consume. Swisper implements three independent rate limiting layers, each protecting against a different threat:

Layer What It Protects Against Storage
Auth Endpoint Rapid-fire API requests from authenticated users (DoS, scripting) Redis
Token Budget Excessive LLM consumption by a single user (cost control) PostgreSQL
Login Brute-force authentication attempts Redis

All three layers use a sliding window algorithm and fail open — if the storage backend (Redis or PostgreSQL) is unavailable, requests are allowed through to avoid blocking legitimate users during infrastructure issues.


Who It Serves

Persona Need
End users Protection from rate limiting under normal usage. Clear feedback (HTTP 429 with Retry-After) when limits are hit
Platform administrators Configurable limits per layer, ability to adjust token budgets without code changes
Security team Brute-force protection on login, per-IP rate limiting on API endpoints
Finance Per-user token budgets that cap LLM cost exposure

Key Capabilities

  • Three independent layers — Each layer operates independently with its own storage, configuration, and failure mode. A request can pass endpoint rate limiting but be blocked by token rate limiting, or vice versa.
  • Dual key tracking — Both the auth endpoint limiter and the login limiter track requests by two keys simultaneously (user ID + IP address, or email + IP address). Both limits must pass for the request to proceed.
  • Configurable token budgets — Token rate limits are stored in the database (configuration table) and can be updated at runtime without redeployment. The default is 1,500,000 tokens per 3-hour window.
  • Burst allowance — The token rate limiter allows a 10% burst above the configured limit (e.g., 1,650,000 tokens on a 1,500,000 limit) to avoid penalizing users who are slightly over budget.
  • Fail-open design — All limiters catch storage errors and allow the request through. This prioritizes availability over strict enforcement.
  • Rate limit status in responses — The streaming response and session refresh endpoints include rate limit status, allowing the frontend to show remaining budget to the user.

How It Fits in the Platform

  • Auth Middleware — The endpoint rate limiter is applied as FastAPI middleware to all authenticated requests before they reach the route handler.
  • Chat & Streaming — The token rate limiter is checked when creating a chat and posting a message. Rate limit status is included in the final streaming chunk.
  • Login — The login rate limiter is checked before validating credentials. Successful login clears the email-based counter.
  • Token Analytics — The token rate limiter queries the same token_usages and background_job_token_usages tables used by the analytics module.

Default Limits

Layer Dimension Limit Window
Auth Endpoint Per user 100 requests 20 seconds
Auth Endpoint Per IP 80 requests 20 seconds
Token Budget Per user 1,500,000 tokens (+ 10% burst) 3 hours
Login Per email 5 attempts 15 minutes
Login Per IP 30 attempts 15 minutes

Limits and Edge Cases

  • No per-endpoint differentiation. The auth endpoint rate limiter uses a single auth endpoint key for all authenticated routes. A user making many lightweight GET requests counts the same as one making expensive POST requests.
  • Token budget includes background jobs. If a user's background jobs (email ingestion, classification) consume many tokens, it reduces their interactive token budget for the same window.
  • Login rate limit clears on success. A successful login clears the email-based counter but not the IP-based counter. An attacker cycling through emails from the same IP would still be rate limited.

FAQ

Q: What happens when a user hits a rate limit? A: They receive an HTTP 429 response with a Retry-After header indicating when they can try again.

Q: Can I change the token budget without redeploying? A: Yes. The token rate limit parameters (TOKEN_LIMIT_MAX_TOKENS, TOKEN_LIMIT_WINDOW_HOURS, TOKEN_LIMIT_BURST_ALLOWANCE) are stored in the configuration table and read at runtime.

Q: What does "fail open" mean? A: If Redis or PostgreSQL is temporarily unavailable, the rate limiter allows the request through instead of blocking it. This avoids outages caused by the rate limiting infrastructure itself.

Q: Does rate limiting apply to admin/superuser requests? A: The auth endpoint limiter applies to all authenticated requests. The token rate limiter is checked for all users when creating chats and posting messages.