TD-002: LangGraph Pydantic State Serialization Pattern¶

Status: Identified (Workaround Implemented) Priority: Medium Estimated Effort: 1-2 weeks (Option C: State Interface Layer) Date Identified: 2025-11-27 Identified By: heiko (during HITL bug investigation)

Description¶

What: We store Pydantic models directly in LangGraph TypedDict state, which conflicts with how LangGraph's checkpoint serialization works. When state is checkpointed to Redis and restored, Pydantic objects come back as plain dictionaries, causing AttributeError when accessing model attributes.

Current State (Anti-Pattern):

# backend/app/api/services/agents/global_supervisor_state.py
class GlobalSupervisorState(TypedDict):
    global_planner_decision: GlobalPlannerDecision  # Pydantic stored directly ❌
    user_in_the_loop: UserInTheLoop                  # Pydantic stored directly ❌
    agent_responses: AgentResponses                  # Pydantic stored directly ❌
    # ... more Pydantic models in state

# In nodes - accessing attributes fails after checkpoint restore:
def some_node(state):
    decision = state.get("global_planner_decision")
    plan = decision.current_plan  # ❌ AttributeError: 'dict' has no attribute 'current_plan'

Official LangGraph Pattern:

# State should only contain primitives, dicts, lists, messages
class State(TypedDict):
    input: str
    user_feedback: str
    messages: list[BaseMessage]

# Pydantic is ONLY for LLM structured output parsing
class AskHuman(BaseModel):
    question: str

# In nodes - extract values from Pydantic, store as primitives/dicts
def ask_human(state):
    ask = AskHuman.model_validate(llm_output)  # Parse with Pydantic
    location = interrupt(ask.question)          # Use the value
    return {"messages": tool_message}           # Store as dict, not Pydantic

Why It Exists: - Original implementation predates LangGraph 1.0.x which uses msgpack serialization - Pydantic models provide nice validation and autocomplete in IDEs - Pattern worked locally but failed after checkpoint restore - Issue surfaced during HITL (Human-in-the-Loop) flows where state is persisted

Workaround Implemented¶

We've implemented state_helpers.py modules in each agent that safely handle both dict and Pydantic model access:

# backend/app/api/services/agents/global_supervisor/state_helpers.py
def get_planner_decision(state: GlobalSupervisorState) -> GlobalPlannerDecision | None:
    """Get planner_decision from state, handling both dict and Pydantic model."""
    raw = state.get("global_planner_decision")
    if raw is None:
        return None
    if isinstance(raw, dict):
        return GlobalPlannerDecision(**raw)  # Reconstruct Pydantic
    return raw  # Already Pydantic

Files implementing this workaround: - backend/app/api/services/agents/global_supervisor/state_helpers.py - backend/app/api/services/agents/wealth_agent/state_helpers.py - backend/app/api/services/agents/research_agent/state_helpers.py - backend/app/api/services/agents/productivity_agent/state_helpers.py - backend/app/api/services/agents/doc_agent/state_helpers.py

Impact¶

Maintainability¶

⚠️ Every new Pydantic field in state needs a helper function
⚠️ Easy to forget helpers - new developers might access state directly
⚠️ Workaround, not a fix - doesn't address root cause
⚠️ Pattern diverges from official LangGraph examples

Performance¶

✅ Minimal Impact - helper functions add negligible overhead
✅ Works correctly - HITL resume functions properly now

Testing¶

⚠️ Must test checkpoint scenarios - unit tests without Redis miss the issue
⚠️ Integration tests essential - need real Redis checkpointer to catch issues

Developer Experience¶

⚠️ IDE autocomplete works with helpers (type hints preserved)
⚠️ Easy to make mistakes when adding new state fields
⚠️ Documentation burden - must explain the pattern to new developers

Overall Impact: Medium - System works with workaround, but architectural debt accumulates.

Remediation Options¶

Option A: Full Refactor (Not Recommended Now)¶

Convert all state fields to plain dicts, never store Pydantic in state. - Effort: 2-3 weeks - Risk: High (touching all agents) - When: If adding 3+ new agents or major state model changes

Option B: Keep Helpers (Current - Short Term)¶

Continue with state_helpers.py pattern. - Effort: Done - Risk: Low - When: Now through next 1-2 sprints

Option C: State Interface Layer (Recommended - Medium Term)¶

Create accessor classes with automatic serialization/deserialization:

class GlobalSupervisorStateAccessor:
    def __init__(self, state: GlobalSupervisorState):
        self._state = state

    @property
    def planner_decision(self) -> GlobalPlannerDecision | None:
        raw = self._state.get("global_planner_decision")
        return GlobalPlannerDecision(**raw) if isinstance(raw, dict) else raw

    @planner_decision.setter
    def planner_decision(self, value: GlobalPlannerDecision):
        self._state["global_planner_decision"] = value.model_dump()

Effort: 1-2 weeks
Risk: Medium
When: Next major sprint, or when adding new domain agent

When to Increase Priority¶

❗ High Priority if: Adding 2+ new domain agents (to avoid spreading pattern)
❗ High Priority if: Multiple developers encounter AttributeError bugs
❗ High Priority if: LangGraph 2.0 breaks current workaround
❗ High Priority if: Major state model refactoring planned

Success Criteria (For Full Remediation)¶

✅ State only contains primitives, dicts, lists, messages ✅ Pydantic used only for LLM output parsing ✅ No state_helpers.py workaround files needed ✅ All HITL scenarios work without special handling ✅ New developers can follow official LangGraph docs ✅ All tests pass (unit + integration with Redis)

Root Cause: LangGraph checkpoint serialization uses msgpack, Pydantic models become dicts
Official Pattern: https://langchain-ai.github.io/langgraph/how-tos/human_in_the_loop/wait-user-input/
GitHub Issue: https://github.com/langchain-ai/langgraph/issues/5733 (TypedDict recommendation)
Code References:
backend/app/api/services/agents/global_supervisor_state.py (state definition)
backend/app/api/services/agents/*/state_helpers.py (workaround files)
backend/app/api/services/state_persistence/redis_checkpoint_service.py (checkpointer)

Status Updates¶

2025-11-27: Technical debt identified during HITL bug investigation
2025-11-27: Workaround (state_helpers.py) implemented for all agents
2025-11-27: decode_responses=False fix applied to Redis checkpointer
Future: When implementing Option C, update Status to "In Progress"
Future: When complete, update Status to "Resolved"

Notes¶

Developer Note: When adding new Pydantic models to agent state: 1. Add a getter function to the agent's state_helpers.py 2. Use the helper function in nodes, never access state.get() directly for Pydantic fields 3. Consider if the field really needs to be Pydantic, or if a plain dict would suffice

Architecture Note: This is a known limitation of using Pydantic with LangGraph checkpointing. The official recommendation is to use TypedDict with primitive types for state, and Pydantic only for LLM structured output parsing. Our current architecture diverges from this pattern for developer ergonomics (validation, autocomplete) at the cost of needing workaround helpers.

Testing Note: Always test HITL flows with actual Redis checkpointer, not just in-memory. The serialization issue only manifests when state is persisted and restored from Redis.