Token Usage & Analytics — Architecture¶
Audience: Architects, tech leads, senior engineers.
Context and Purpose¶
LLM API calls are the primary operational cost in Swisper. Without tracking, it is impossible to attribute costs to users, features, or specific pipeline nodes. The Token Usage system provides this attribution by instrumenting every LLM call with token-count recording, then persisting the data for analytics.
The driving architectural requirements are:
- Zero impact on request latency — Token recording must not add database writes to the critical path. This is achieved via Redis accumulation during the request.
- Per-node granularity — It must be possible to determine how many tokens each graph node consumes, not just the total per request.
- Separate job tracking — Background job token usage must be tracked separately from interactive usage, since jobs run outside the user request lifecycle.
Architecture Overview¶
graph TD
subgraph Request["Request Lifecycle"]
SI["session_init"] -->|initialize_tracking| RH["Redis Hash"]
LLM1["LLM Call 1"] -->|record_llm_usage| RH
LLM2["LLM Call N"] -->|record_llm_usage| RH
STREAM["streaming.py"] -->|persist_to_postgres| PG["PostgreSQL: token_usages"]
end
subgraph Job["Background Job Lifecycle"]
JI["Job start"] -->|initialize_tracking| RH2["Redis Hash"]
JLLM["LLM Calls"] -->|record_llm_usage| RH2
JE["Job end"] -->|persist_background_job_to_postgres| PG2["PostgreSQL: background_job_token_usages"]
end
subgraph Analytics["Analytics API"]
API["GET /analytics/token-usage/..."] --> PG & PG2
end
Component Responsibilities¶
| Component | File | Responsibility |
|---|---|---|
| Token Usage Tracking | services/token_usage_tracking.py |
Core tracking logic: initialize_tracking(), record_llm_usage(), persist_to_postgres(), persist_background_job_to_postgres() |
| Token Usage Analytics | services/token_usage.py |
Query service: per-user summaries, paginated user lists, node-type aggregations, LLM-type aggregations |
| Analytics API | api/routes/token_usage.py |
Six admin-only endpoints under /api/v1/analytics/ |
| SwisperLLMAdapter | gateways/llm/adapter.py |
Calls record_llm_usage() after each LLM invocation (structured output, streaming, embedding) |
| TokenTrackingLLMAdapter | gateways/llm/legacy/token_tracking.py |
Legacy adapter wrapper that instruments get_structured_output, stream_message_from_LLM, and embed_documents |
Data Model¶
TokenUsage (table: token_usages)¶
One row per interactive request (conversation turn).
| Field | Type | Purpose |
|---|---|---|
id |
UUID | Primary key |
user_id |
UUID | User who triggered the request |
correlation_id |
str | Links to logs and Redis hash |
chat_id |
UUID | Conversation this request belongs to |
structured_output_tokens |
int | Tokens from structured output calls |
streaming_tokens |
int | Tokens from streaming calls |
embedding_tokens |
int | Tokens from embedding calls |
total_prompt_tokens |
int | Sum of all prompt tokens |
total_completion_tokens |
int | Sum of all completion tokens |
total_tokens |
int | Grand total |
llm_call_count |
int | Number of LLM invocations in this request |
request_duration_ms |
int | Wall-clock time from init to persist |
node_type_breakdown |
JSONB | {"node_name": {"t": total, "p": prompt, "c": completion}} |
created_at |
datetime | When the request completed |
Indexes: created_at, chat_id, user_id + created_at (composite), node_type_breakdown (GIN).
BackgroundJobTokenUsage (table: background_job_token_usages)¶
One row per background job run.
| Field | Type | Purpose |
|---|---|---|
id |
UUID | Primary key |
user_id |
UUID | User whose data was processed |
correlation_id |
str | Links to logs |
job_type |
str | Job name (e.g., ingest_emails, classify_emails) |
total_tokens |
int | Grand total tokens consumed |
llm_call_count |
int | Number of LLM invocations |
processing_duration_ms |
int | Wall-clock time |
llm_type_breakdown |
JSONB | Per-LLM-type token counts |
created_at |
datetime | When the job run completed |
Indexes: user_id + created_at (composite), job_type, llm_type_breakdown (GIN).
Two-Tier Storage¶
Redis (hot tier)¶
- Key pattern:
token_usage:{correlation_id} - Type: Redis hash
- Fields:
user_id,chat_id,structured_output_tokens,streaming_tokens,embedding_tokens,total_prompt_tokens,total_completion_tokens,total_tokens,llm_call_count,node_type_breakdown(JSON string),start_time - TTL: 3600 seconds (1 hour)
- Purpose: Fast incremental updates during request processing. Each
record_llm_usage()call increments counters viaHINCRBYand updates the JSONB breakdown.
PostgreSQL (cold tier)¶
- Tables:
token_usages,background_job_token_usages - Write timing: Once per request/job, at completion
- Purpose: Permanent storage for analytics queries. Redis key is deleted after successful persistence.
Analytics API¶
All endpoints are under /api/v1/analytics/ and require superuser authentication.
| Endpoint | Response | Purpose |
|---|---|---|
GET /token-usage/users/{user_id} |
UserTokenUsageSummary |
Total usage for one user in time range |
GET /token-usage/users |
PaginatedUserTokenUsage |
All users, paginated (default page size 100, max 200) |
GET /token-usage/background-jobs/users/{user_id} |
BackgroundJobUserTokenUsageSummary |
Background job usage for one user |
GET /token-usage/background-jobs/users |
PaginatedBackgroundJobUserTokenUsage |
All users' job usage, paginated |
GET /token-usage/aggregated/node-types |
AggregatedNodeTypeUsage |
Token breakdown by graph node across all requests |
GET /token-usage/aggregated/llm-types |
AggregatedLLMTypeUsage |
Token breakdown by LLM type across all job runs |
All endpoints accept optional start and end datetime query parameters for time-range filtering.
Key Design Decisions¶
Decision: Redis accumulation during request, PostgreSQL on completion
- Chosen: Accumulate in Redis hash, flush to PostgreSQL once per request
- Rejected: Direct PostgreSQL writes per LLM call, in-memory accumulation
- Rationale: Direct DB writes would add latency to every LLM call. In-memory accumulation would lose data on process crash. Redis provides fast atomic increments with durability acceptable for the 1-hour window, and a single PostgreSQL write on completion keeps the transaction count low.
Decision: JSONB for node-type breakdown
- Chosen: Store per-node token usage as a JSONB field on each row
- Rejected: Separate child table with one row per node per request
- Rationale: JSONB avoids a high-cardinality child table (dozens of nodes × thousands of requests). PostgreSQL GIN indexes make JSONB aggregations efficient. The trade-off is that schema changes to the breakdown format require careful migration.
Decision: Separate table for background job usage
- Chosen:
background_job_token_usageswithjob_typeandllm_type_breakdown - Rejected: Shared table with nullable
chat_id/job_type - Rationale: Job and interactive usage have different metadata (jobs have
job_typeandprocessing_duration_ms; interactive haschat_idandrequest_duration_ms). Separate tables avoid nullable columns and allow independent indexing strategies.
Known Trade-offs and Debt¶
| Item | Impact | Remediation |
|---|---|---|
| No model name per call | Cannot determine cost per call without cross-referencing llm_node_configuration |
Add model field to the tracking hash and persistence |
| No dollar-cost calculation | Administrators must manually compute cost from token counts and model pricing | Add a pricing table and compute cost at persistence time |
| Redis TTL risk | If a request exceeds 1 hour, accumulated data is lost before persistence | Increase TTL or add a periodic flush for long-running operations |
| No real-time dashboard | Analytics are query-based; no streaming dashboard for live monitoring | Add WebSocket-based live tracking or integrate with an observability platform |