Token Usage & Analytics — Architecture¶

Audience: Architects, tech leads, senior engineers.

Context and Purpose¶

LLM API calls are the primary operational cost in Swisper. Without tracking, it is impossible to attribute costs to users, features, or specific pipeline nodes. The Token Usage system provides this attribution by instrumenting every LLM call with token-count recording, then persisting the data for analytics.

The driving architectural requirements are:

Zero impact on request latency — Token recording must not add database writes to the critical path. This is achieved via Redis accumulation during the request.
Per-node granularity — It must be possible to determine how many tokens each graph node consumes, not just the total per request.
Separate job tracking — Background job token usage must be tracked separately from interactive usage, since jobs run outside the user request lifecycle.

Architecture Overview¶

graph TD
    subgraph Request["Request Lifecycle"]
        SI["session_init"] -->|initialize_tracking| RH["Redis Hash"]
        LLM1["LLM Call 1"] -->|record_llm_usage| RH
        LLM2["LLM Call N"] -->|record_llm_usage| RH
        STREAM["streaming.py"] -->|persist_to_postgres| PG["PostgreSQL: token_usages"]
    end

    subgraph Job["Background Job Lifecycle"]
        JI["Job start"] -->|initialize_tracking| RH2["Redis Hash"]
        JLLM["LLM Calls"] -->|record_llm_usage| RH2
        JE["Job end"] -->|persist_background_job_to_postgres| PG2["PostgreSQL: background_job_token_usages"]
    end

    subgraph Analytics["Analytics API"]
        API["GET /analytics/token-usage/..."] --> PG & PG2
    end

Component Responsibilities¶

Component	File	Responsibility
Token Usage Tracking	`services/token_usage_tracking.py`	Core tracking logic: `initialize_tracking()`, `record_llm_usage()`, `persist_to_postgres()`, `persist_background_job_to_postgres()`
Token Usage Analytics	`services/token_usage.py`	Query service: per-user summaries, paginated user lists, node-type aggregations, LLM-type aggregations
Analytics API	`api/routes/token_usage.py`	Six admin-only endpoints under `/api/v1/analytics/`
SwisperLLMAdapter	`gateways/llm/adapter.py`	Calls `record_llm_usage()` after each LLM invocation (structured output, streaming, embedding)
TokenTrackingLLMAdapter	`gateways/llm/legacy/token_tracking.py`	Legacy adapter wrapper that instruments `get_structured_output`, `stream_message_from_LLM`, and `embed_documents`

Data Model¶

TokenUsage (table: `token_usages`)¶

One row per interactive request (conversation turn).

Field	Type	Purpose
`id`	UUID	Primary key
`user_id`	UUID	User who triggered the request
`correlation_id`	str	Links to logs and Redis hash
`chat_id`	UUID	Conversation this request belongs to
`structured_output_tokens`	int	Tokens from structured output calls
`streaming_tokens`	int	Tokens from streaming calls
`embedding_tokens`	int	Tokens from embedding calls
`total_prompt_tokens`	int	Sum of all prompt tokens
`total_completion_tokens`	int	Sum of all completion tokens
`total_tokens`	int	Grand total
`llm_call_count`	int	Number of LLM invocations in this request
`request_duration_ms`	int	Wall-clock time from init to persist
`node_type_breakdown`	JSONB	`{"node_name": {"t": total, "p": prompt, "c": completion}}`
`created_at`	datetime	When the request completed

Indexes: created_at, chat_id, user_id + created_at (composite), node_type_breakdown (GIN).

BackgroundJobTokenUsage (table: `background_job_token_usages`)¶

One row per background job run.

Field	Type	Purpose
`id`	UUID	Primary key
`user_id`	UUID	User whose data was processed
`correlation_id`	str	Links to logs
`job_type`	str	Job name (e.g., `ingest_emails`, `classify_emails`)
`total_tokens`	int	Grand total tokens consumed
`llm_call_count`	int	Number of LLM invocations
`processing_duration_ms`	int	Wall-clock time
`llm_type_breakdown`	JSONB	Per-LLM-type token counts
`created_at`	datetime	When the job run completed

Indexes: user_id + created_at (composite), job_type, llm_type_breakdown (GIN).

Two-Tier Storage¶

Redis (hot tier)¶

Key pattern: token_usage:{correlation_id}
Type: Redis hash
Fields: user_id, chat_id, structured_output_tokens, streaming_tokens, embedding_tokens, total_prompt_tokens, total_completion_tokens, total_tokens, llm_call_count, node_type_breakdown (JSON string), start_time
TTL: 3600 seconds (1 hour)
Purpose: Fast incremental updates during request processing. Each record_llm_usage() call increments counters via HINCRBY and updates the JSONB breakdown.

PostgreSQL (cold tier)¶

Tables: token_usages, background_job_token_usages
Write timing: Once per request/job, at completion
Purpose: Permanent storage for analytics queries. Redis key is deleted after successful persistence.

Analytics API¶

All endpoints are under /api/v1/analytics/ and require superuser authentication.

Endpoint	Response	Purpose
`GET /token-usage/users/{user_id}`	`UserTokenUsageSummary`	Total usage for one user in time range
`GET /token-usage/users`	`PaginatedUserTokenUsage`	All users, paginated (default page size 100, max 200)
`GET /token-usage/background-jobs/users/{user_id}`	`BackgroundJobUserTokenUsageSummary`	Background job usage for one user
`GET /token-usage/background-jobs/users`	`PaginatedBackgroundJobUserTokenUsage`	All users' job usage, paginated
`GET /token-usage/aggregated/node-types`	`AggregatedNodeTypeUsage`	Token breakdown by graph node across all requests
`GET /token-usage/aggregated/llm-types`	`AggregatedLLMTypeUsage`	Token breakdown by LLM type across all job runs

All endpoints accept optional start and end datetime query parameters for time-range filtering.

Key Design Decisions¶

Decision: Redis accumulation during request, PostgreSQL on completion

Chosen: Accumulate in Redis hash, flush to PostgreSQL once per request
Rejected: Direct PostgreSQL writes per LLM call, in-memory accumulation
Rationale: Direct DB writes would add latency to every LLM call. In-memory accumulation would lose data on process crash. Redis provides fast atomic increments with durability acceptable for the 1-hour window, and a single PostgreSQL write on completion keeps the transaction count low.

Decision: JSONB for node-type breakdown

Chosen: Store per-node token usage as a JSONB field on each row
Rejected: Separate child table with one row per node per request
Rationale: JSONB avoids a high-cardinality child table (dozens of nodes × thousands of requests). PostgreSQL GIN indexes make JSONB aggregations efficient. The trade-off is that schema changes to the breakdown format require careful migration.

Decision: Separate table for background job usage

Chosen: background_job_token_usages with job_type and llm_type_breakdown
Rejected: Shared table with nullable chat_id / job_type
Rationale: Job and interactive usage have different metadata (jobs have job_type and processing_duration_ms; interactive has chat_id and request_duration_ms). Separate tables avoid nullable columns and allow independent indexing strategies.

Known Trade-offs and Debt¶

Item	Impact	Remediation
No model name per call	Cannot determine cost per call without cross-referencing `llm_node_configuration`	Add `model` field to the tracking hash and persistence
No dollar-cost calculation	Administrators must manually compute cost from token counts and model pricing	Add a pricing table and compute cost at persistence time
Redis TTL risk	If a request exceeds 1 hour, accumulated data is lost before persistence	Increase TTL or add a periodic flush for long-running operations
No real-time dashboard	Analytics are query-based; no streaming dashboard for live monitoring	Add WebSocket-based live tracking or integrate with an observability platform