Skip to content

Token Usage & Analytics — Architecture

Audience: Architects, tech leads, senior engineers.


Context and Purpose

LLM API calls are the primary operational cost in Swisper. Without tracking, it is impossible to attribute costs to users, features, or specific pipeline nodes. The Token Usage system provides this attribution by instrumenting every LLM call with token-count recording, then persisting the data for analytics.

The driving architectural requirements are:

  • Zero impact on request latency — Token recording must not add database writes to the critical path. This is achieved via Redis accumulation during the request.
  • Per-node granularity — It must be possible to determine how many tokens each graph node consumes, not just the total per request.
  • Separate job tracking — Background job token usage must be tracked separately from interactive usage, since jobs run outside the user request lifecycle.

Architecture Overview

graph TD
    subgraph Request["Request Lifecycle"]
        SI["session_init"] -->|initialize_tracking| RH["Redis Hash"]
        LLM1["LLM Call 1"] -->|record_llm_usage| RH
        LLM2["LLM Call N"] -->|record_llm_usage| RH
        STREAM["streaming.py"] -->|persist_to_postgres| PG["PostgreSQL: token_usages"]
    end

    subgraph Job["Background Job Lifecycle"]
        JI["Job start"] -->|initialize_tracking| RH2["Redis Hash"]
        JLLM["LLM Calls"] -->|record_llm_usage| RH2
        JE["Job end"] -->|persist_background_job_to_postgres| PG2["PostgreSQL: background_job_token_usages"]
    end

    subgraph Analytics["Analytics API"]
        API["GET /analytics/token-usage/..."] --> PG & PG2
    end

Component Responsibilities

Component File Responsibility
Token Usage Tracking services/token_usage_tracking.py Core tracking logic: initialize_tracking(), record_llm_usage(), persist_to_postgres(), persist_background_job_to_postgres()
Token Usage Analytics services/token_usage.py Query service: per-user summaries, paginated user lists, node-type aggregations, LLM-type aggregations
Analytics API api/routes/token_usage.py Six admin-only endpoints under /api/v1/analytics/
SwisperLLMAdapter gateways/llm/adapter.py Calls record_llm_usage() after each LLM invocation (structured output, streaming, embedding)
TokenTrackingLLMAdapter gateways/llm/legacy/token_tracking.py Legacy adapter wrapper that instruments get_structured_output, stream_message_from_LLM, and embed_documents

Data Model

TokenUsage (table: token_usages)

One row per interactive request (conversation turn).

Field Type Purpose
id UUID Primary key
user_id UUID User who triggered the request
correlation_id str Links to logs and Redis hash
chat_id UUID Conversation this request belongs to
structured_output_tokens int Tokens from structured output calls
streaming_tokens int Tokens from streaming calls
embedding_tokens int Tokens from embedding calls
total_prompt_tokens int Sum of all prompt tokens
total_completion_tokens int Sum of all completion tokens
total_tokens int Grand total
llm_call_count int Number of LLM invocations in this request
request_duration_ms int Wall-clock time from init to persist
node_type_breakdown JSONB {"node_name": {"t": total, "p": prompt, "c": completion}}
created_at datetime When the request completed

Indexes: created_at, chat_id, user_id + created_at (composite), node_type_breakdown (GIN).

BackgroundJobTokenUsage (table: background_job_token_usages)

One row per background job run.

Field Type Purpose
id UUID Primary key
user_id UUID User whose data was processed
correlation_id str Links to logs
job_type str Job name (e.g., ingest_emails, classify_emails)
total_tokens int Grand total tokens consumed
llm_call_count int Number of LLM invocations
processing_duration_ms int Wall-clock time
llm_type_breakdown JSONB Per-LLM-type token counts
created_at datetime When the job run completed

Indexes: user_id + created_at (composite), job_type, llm_type_breakdown (GIN).


Two-Tier Storage

Redis (hot tier)

  • Key pattern: token_usage:{correlation_id}
  • Type: Redis hash
  • Fields: user_id, chat_id, structured_output_tokens, streaming_tokens, embedding_tokens, total_prompt_tokens, total_completion_tokens, total_tokens, llm_call_count, node_type_breakdown (JSON string), start_time
  • TTL: 3600 seconds (1 hour)
  • Purpose: Fast incremental updates during request processing. Each record_llm_usage() call increments counters via HINCRBY and updates the JSONB breakdown.

PostgreSQL (cold tier)

  • Tables: token_usages, background_job_token_usages
  • Write timing: Once per request/job, at completion
  • Purpose: Permanent storage for analytics queries. Redis key is deleted after successful persistence.

Analytics API

All endpoints are under /api/v1/analytics/ and require superuser authentication.

Endpoint Response Purpose
GET /token-usage/users/{user_id} UserTokenUsageSummary Total usage for one user in time range
GET /token-usage/users PaginatedUserTokenUsage All users, paginated (default page size 100, max 200)
GET /token-usage/background-jobs/users/{user_id} BackgroundJobUserTokenUsageSummary Background job usage for one user
GET /token-usage/background-jobs/users PaginatedBackgroundJobUserTokenUsage All users' job usage, paginated
GET /token-usage/aggregated/node-types AggregatedNodeTypeUsage Token breakdown by graph node across all requests
GET /token-usage/aggregated/llm-types AggregatedLLMTypeUsage Token breakdown by LLM type across all job runs

All endpoints accept optional start and end datetime query parameters for time-range filtering.


Key Design Decisions

Decision: Redis accumulation during request, PostgreSQL on completion

  • Chosen: Accumulate in Redis hash, flush to PostgreSQL once per request
  • Rejected: Direct PostgreSQL writes per LLM call, in-memory accumulation
  • Rationale: Direct DB writes would add latency to every LLM call. In-memory accumulation would lose data on process crash. Redis provides fast atomic increments with durability acceptable for the 1-hour window, and a single PostgreSQL write on completion keeps the transaction count low.

Decision: JSONB for node-type breakdown

  • Chosen: Store per-node token usage as a JSONB field on each row
  • Rejected: Separate child table with one row per node per request
  • Rationale: JSONB avoids a high-cardinality child table (dozens of nodes × thousands of requests). PostgreSQL GIN indexes make JSONB aggregations efficient. The trade-off is that schema changes to the breakdown format require careful migration.

Decision: Separate table for background job usage

  • Chosen: background_job_token_usages with job_type and llm_type_breakdown
  • Rejected: Shared table with nullable chat_id / job_type
  • Rationale: Job and interactive usage have different metadata (jobs have job_type and processing_duration_ms; interactive has chat_id and request_duration_ms). Separate tables avoid nullable columns and allow independent indexing strategies.

Known Trade-offs and Debt

Item Impact Remediation
No model name per call Cannot determine cost per call without cross-referencing llm_node_configuration Add model field to the tracking hash and persistence
No dollar-cost calculation Administrators must manually compute cost from token counts and model pricing Add a pricing table and compute cost at persistence time
Redis TTL risk If a request exceeds 1 hour, accumulated data is lost before persistence Increase TTL or add a periodic flush for long-running operations
No real-time dashboard Analytics are query-based; no streaming dashboard for live monitoring Add WebSocket-based live tracking or integrate with an observability platform