January 31, 2026
Agentic Frameworks: A First-Principles Deep Dive
Part 1: The 6 Core Abstractions Every Framework Provides
An LLM alone is a stateless function: f(prompt) → text. Agentic frameworks exist to turn this into a stateful, tool-using, goal-pursuing loop. Every framework — LangGraph, CrewAI, AutoGen, Semantic Kernel, OpenAI Agents SDK, etc. — is essentially providing some combination of the same ~6 core abstractions.
1. The Agent Loop (Reasoning-Action Cycle)
This is the foundational primitive. At its core:
while not done:
observation = perceive(environment)
thought = llm(system_prompt + memory + observation)
action = parse_action(thought)
result = execute(action)
memory.append(observation, thought, action, result)
This is the ReAct pattern (Reason + Act). Every framework implements some variant. The differences are in how much structure they impose on the loop — LangGraph gives you a full state machine with explicit edges, while lighter approaches like OpenAI's Agents SDK keep it more implicit.
Tradeoff: More structure = more predictable behavior but less flexibility. A rigid graph is easier to debug but harder to adapt to unexpected situations. A free-form loop is more capable but harder to make reliable.
2. Tool / Function Abstraction
A tool is a function with a schema the LLM can understand:
Tool = {
name: string,
description: string, # natural language for the LLM
parameters: JSONSchema, # structured input spec
execute: (params) → result # actual implementation
}
The framework handles: serializing the schema into the prompt or function-calling API, parsing the LLM's output into a valid function call, executing it, and feeding the result back. This is largely commoditized now that most model providers support native tool calling.
Tradeoff: Rich tool descriptions improve selection accuracy but consume context window. Too many tools degrade selection performance — empirically, accuracy drops significantly beyond ~15-20 tools, which is why some frameworks introduce tool routing or hierarchical tool selection.
3. Memory / State Management
This is where frameworks diverge most. There are three layers:
- Working memory — the current context window. Trivial but bounded by context length.
- Short-term memory — conversation history management. The key question: when context gets too long, what do you summarize/drop? This is a lossy compression problem with no clean solution.
- Long-term memory — persisted knowledge across sessions. Vector stores, knowledge graphs, structured databases.
The math matters here. If your context window is C tokens and each turn averages t tokens, you get roughly C/t turns before you must evict. Summarization compresses at some ratio r, but every compression loses signal. The real engineering challenge is the curation policy: what's worth remembering vs. forgetting.
Tradeoff: Richer memory = better coherence over long tasks, but higher latency per step (retrieval cost), more tokens consumed ($$), and more failure modes (stale or contradictory memories).
4. Orchestration / Multi-Agent Coordination
When you have multiple agents, you need a coordination pattern:
- Sequential/Pipeline — agent A's output feeds agent B. Simple, predictable.
- Hierarchical — a "manager" agent delegates to specialist agents. Mimics org charts.
- Collaborative/Debate — agents critique each other's work. Can improve quality but multiplies cost.
- Graph-based — arbitrary DAG of agent interactions (LangGraph's core selling point).
First-principles question: do you actually need multiple agents? Often a single agent with good tools and prompting outperforms a multi-agent setup. Multi-agent adds coordination overhead, error propagation between agents, and debugging complexity.
Tradeoff: Multi-agent can decompose complex tasks and enable specialization, but each agent-to-agent handoff is a lossy communication channel (natural language). You're trading compute cost and latency for potential quality gains that are highly task-dependent.
5. Planning / Task Decomposition
Some frameworks add explicit planning:
plan = llm("Break this goal into subtasks: " + goal)
for subtask in plan:
result = agent.execute(subtask)
if needs_replan(result):
plan = revise_plan(plan, result)
This ranges from simple prompt-based decomposition to tree search (Tree of Thoughts), to full MCTS-style exploration. The sophistication varies enormously.
Tradeoff: Explicit planning helps on complex multi-step tasks but adds latency (extra LLM calls), can over-decompose simple tasks, and the plan itself can be wrong — leading to confidently executing the wrong steps. "Plan and execute" vs. "just start doing" is genuinely task-dependent.
6. Guardrails / Control Flow
The safety and reliability layer:
- Input/output validation — schema checking, content filtering
- Human-in-the-loop gates — pause for approval before irreversible actions
- Retry/fallback logic — handle tool failures, malformed outputs
- Budget constraints — max iterations, token limits, cost caps
This is often underweighted in framework marketing but critical in production. An agent without guardrails is a while True loop with your credit card attached.
What Frameworks Actually Differ On
| Dimension | Lightweight (Agents SDK, Smolagents) | Heavyweight (LangGraph, CrewAI) |
|---|---|---|
| Loop structure | Implicit, model-driven | Explicit graphs/workflows |
| Multi-agent | Optional, simple handoffs | First-class, structured |
| State management | Minimal, bring your own | Built-in persistence |
| Learning curve | Low | High |
| Debuggability | Harder (implicit flow) | Easier (visible graph) |
| Flexibility | High | Constrained by framework |
Alternatives to Consider
Before adopting any framework, ask: what if I just wrote the loop myself? A basic agent is ~50 lines of code:
- Call the model with tools
- If it returns a tool call, execute it, feed result back
- If it returns text, you're done
- Add retry logic and a max-iteration cap
Frameworks earn their keep when you need persistent state across sessions, complex multi-agent coordination, observability/tracing, or human-in-the-loop workflows. If you don't need those, the framework is overhead you're paying for in complexity, abstraction leakage, and version churn.
The honest assessment: the agentic framework space is still immature and churning fast. The abstractions aren't fully settled yet. Betting heavily on any single framework's specific API is risky — betting on understanding the underlying patterns is not.
Part 2: Top 5 Frameworks Deep Dive
The Contenders
Based on current adoption, momentum, architectural distinctiveness, and production readiness:
| # | Framework | v1.0 / Stable | Philosophy |
|---|---|---|---|
| 1 | LangGraph | v1.0 (Oct 2025) | Graph-based agent runtime — maximum control |
| 2 | OpenAI Agents SDK | Stable (Mar 2025) | Minimalist primitives — fewest abstractions |
| 3 | CrewAI | Stable | Role-based multi-agent — team metaphor |
| 4 | PydanticAI | v1.0 (Sep 2025) | Type-safe, FastAPI-inspired — validation-first |
| 5 | Google ADK | v0.5 (2025) | Event-driven, multi-language — Google-ecosystem-optimized |
Why these 5? AutoGen (Microsoft) is notable but has fragmented between versions (0.2 → 0.4 rewrite → AG2 fork), creating ecosystem confusion. These 5 represent the clearest, most architecturally distinct approaches with active momentum going into 2026.
Dimension 1: The Agent Loop (Reasoning-Action Cycle)
This is the most fundamental question: how does the framework structure the think → act → observe cycle?
LangGraph: Explicit State Machine
LangGraph gives you the most control. The loop is a directed graph you define explicitly:
from langgraph.graph import StateGraph, START, END
class State(TypedDict):
messages: Annotated[list, add_messages]
next_action: str
graph = StateGraph(State)
graph.add_node("reason", reason_node)
graph.add_node("act", act_node)
graph.add_node("evaluate", evaluate_node)
graph.add_edge(START, "reason")
graph.add_conditional_edges("reason", route_decision)
graph.add_edge("act", "evaluate")
graph.add_conditional_edges("evaluate", should_continue)
app = graph.compile()
You choose exactly which nodes execute, in what order, under what conditions. The execution model is inspired by Google's Pregel (bulk synchronous parallel): nodes fire in "super-steps," each processing messages from the previous step.
The math: Your agent's behavior is a function f: State × Node → State applied iteratively. The graph topology constrains the iteration space. A recursion limit (default 1000 in v1.0.6) bounds total super-steps.
What this means in practice: You can implement ReAct, plan-and-execute, tree-of-thought, or any custom loop topology. The graph is the interface. But you must _design_ the graph — there's no default "just go do things" mode (though create_react_agent provides a one-liner for the standard ReAct pattern).
OpenAI Agents SDK: Implicit Model-Driven Loop
The SDK takes the opposite approach — the loop is hidden:
from agents import Agent, Runner
agent = Agent(
name="Assistant",
instructions="You are a helpful assistant",
tools=[search_tool, calc_tool]
)
result = await Runner.run(agent, "What's the GDP of France?")
The Runner internally does:
- Send prompt + tools to LLM
- If LLM returns tool call → execute tool → feed result back → goto 1
- If LLM returns text → done
That's it. The loop is the standard ReAct cycle with the model deciding when to stop. You don't define edges or nodes — the LLM _is_ the router.
What this means in practice: Fast to ship, easy to understand. But if you need the model to follow a strict multi-step workflow (e.g., "always validate before submitting"), you're encoding that in the prompt, not the framework. This works surprisingly well with capable models but gives you less structural guarantees.
CrewAI: Task-Driven Sequential/Hierarchical Loop
CrewAI structures the loop around tasks assigned to roles:
from crewai import Agent, Task, Crew
researcher = Agent(role="Researcher", goal="Find accurate data", ...)
writer = Agent(role="Writer", goal="Write compelling content", ...)
research_task = Task(description="Research AI trends", agent=researcher)
write_task = Task(description="Write article", agent=writer)
crew = Crew(
agents=[researcher, writer],
tasks=[research_task, write_task],
process=Process.sequential # or Process.hierarchical
)
result = crew.kickoff()
The loop is implicit within each task (agent reasons + acts until task is complete), and explicit between tasks (sequential or managed by a "manager" agent in hierarchical mode).
What this means in practice: The abstraction is high-level — you think in roles and deliverables, not graph edges. Great for "assign work to specialists" patterns. Less suitable for fine-grained control over individual reasoning steps.
PydanticAI: Typed Agent Loop with Validation Gates
PydanticAI's loop is structurally similar to OpenAI's but wrapped in strict type validation:
from pydantic_ai import Agent
from pydantic import BaseModel
class CityInfo(BaseModel):
name: str
population: int
country: str
agent = Agent(
'openai:gpt-4o',
result_type=CityInfo,
system_prompt="Extract city information"
)
result = agent.run_sync("Tell me about Paris")
# result.data is guaranteed to be a valid CityInfo
The loop runs until the LLM produces output that passes Pydantic validation. If validation fails, the error is fed back to the LLM for self-correction. This is a validation-gated ReAct loop — the framework won't return until the output schema is satisfied.
The math: This adds a constraint satisfaction layer: the output space is restricted to {o ∈ LLM_outputs | validate(o, Schema) = True}. The retry mechanism is essentially rejection sampling with feedback.
Google ADK: Event-Driven Runtime
ADK structures everything as events flowing through an event loop:
from google.adk.agents import Agent
from google.adk.tools import FunctionTool
def get_weather(city: str) -> str:
return f"Weather in {city}: Sunny, 22°C"
weather_agent = Agent(
name="weather_agent",
model="gemini-2.0-flash",
instruction="Help users check the weather",
tools=[FunctionTool(get_weather)]
)
Internally, the runtime processes a stream of events: UserInput → ModelCall → ToolCall → ToolResult → ModelCall → FinalResponse. Each event is a discrete unit that can be inspected, logged, replayed. ADK also provides workflow agents (SequentialAgent, ParallelAgent, LoopAgent) for deterministic control flow alongside LLM-driven agents.
What this means in practice: The event-driven model is excellent for debugging and observability (you can inspect every event). The dual approach of LLM agents + workflow agents means you can mix deterministic and non-deterministic control flow in the same system.
Comparative Assessment
| Aspect | LangGraph | OpenAI SDK | CrewAI | PydanticAI | Google ADK |
|---|---|---|---|---|---|
| Loop visibility | Fully explicit | Hidden | Task-level | Hidden + validation | Event stream |
| Custom topologies | Arbitrary graphs | Prompt-only | Sequential/Hierarchical | Linear + retry | LLM + Workflow agents |
| Default behavior | Must define | ReAct auto | Task execution | Validated ReAct | Event-driven ReAct |
| Best when | You need precise control | Standard patterns suffice | Role delegation works | Output correctness critical | Mixed deterministic + LLM |
First-principles verdict: The "right" loop structure depends on your failure mode tolerance. If wrong outputs are expensive (financial, safety), you want explicit graphs (LangGraph) or validation gates (PydanticAI). If speed-to-ship matters more and your model is capable, the implicit loops (OpenAI SDK, ADK) are pragmatic.
Dimension 2: Tool / Function Abstraction
LangGraph
Tools are LangChain Tool objects or plain functions decorated with @tool. LangGraph itself is tool-agnostic — it just executes nodes. Tools live inside nodes. MCP support via adapters that auto-discover and convert MCP tools to LangChain format.
Strength: Largest existing tool ecosystem via LangChain integrations. Weakness: The adapter layer for MCP adds abstraction cost vs. native MCP frameworks.
OpenAI Agents SDK
Tools are Python functions decorated with @function_tool, or MCP servers (hosted or local). Automatic schema generation. Minimal boilerplate.
Strength: Native MCP support (both hosted and local). Minimal boilerplate. Weakness: Best performance with OpenAI models; tool calling quality varies with other providers.
CrewAI
Tools inherit from BaseTool or use the @tool decorator. CrewAI also supports MCP servers in agent configuration.
Strength: Tools can be assigned per-agent, matching the role metaphor. Weakness: Tool ecosystem is smaller than LangChain's. Tool routing across agents can be opaque.
PydanticAI
Tools are plain Python functions with type annotations — Pydantic infers the schema automatically. The dependency injection via RunContext makes tools testable and type-safe. Native MCP support.
Strength: The dependency injection is genuinely elegant — it makes tools testable (you can mock deps) and type-safe. The type system catches schema mismatches at write time, not runtime. Weakness: Python-only for tool definitions.
Google ADK
Tools are FunctionTool wrappers, with built-in support for Google services, OpenAPI specs, and MCP. Also supports AgentTool — using another agent as a tool.
Strength: Rich pre-built tools for Google ecosystem. AgentTool concept is powerful. Multi-language support (Python, TypeScript, Go, Java). Weakness: Non-Google integrations require more work.
Comparative Assessment
| Aspect | LangGraph | OpenAI SDK | CrewAI | PydanticAI | Google ADK |
|---|---|---|---|---|---|
| Schema generation | Manual + decorator | Auto from types | Decorator | Auto from types (best) | Auto from types |
| MCP support | Adapter layer | Native (hosted + local) | Config-based | Native | Native |
| Testability | Standard | Standard | Standard | Excellent (DI) | Good |
| Pre-built tools | Largest (LangChain) | OpenAI built-ins | Moderate | Minimal | Google ecosystem |
| Multi-language | Python, JS | Python, TypeScript | Python | Python | Python, TS, Go, Java |
First-principles verdict: Tool abstraction is largely commoditized — they all wrap functions with schemas. The real differentiators are: PydanticAI's dependency injection (testability), Google ADK's multi-language breadth, and LangChain's existing integration ecosystem. MCP is converging as the universal protocol, which will further commoditize this layer.
Dimension 3: Memory / State Management
This is where frameworks diverge most significantly.
LangGraph: First-Class State with Checkpointing
LangGraph's state management is its defining feature. State is a typed schema that flows through the graph. Every node reads and writes to it. Reducers control how updates merge:
from typing import Annotated
from langgraph.graph import add_messages
class AgentState(TypedDict):
messages: Annotated[list, add_messages] # append reducer
documents: list[str] # overwrite reducer (default)
iteration_count: int
Checkpointing saves state at every super-step to a persistence backend (SQLite, Postgres, etc.). This enables durable execution, time travel, and human-in-the-loop pauses.
The math: State transitions are S_{t+1} = reducer(S_t, node_output). Checkpointing creates a DAG of states. Time-travel is graph traversal on this DAG. The cost is O(n) storage per super-step × state size.
Tradeoff: Most powerful state management of any framework, but also most complex. You need to design your state schema, choose reducers, configure checkpointers.
OpenAI Agents SDK: Session-Level, BYO Long-Term
The SDK provides Sessions for automatic conversation history. Long-term memory is not included — you integrate external solutions (Mem0, your own vector store, etc.).
Tradeoff: Clean and simple for conversational agents. But durable execution, state rollback, or cross-session memory are your responsibility.
CrewAI: Memory as a Feature Set
CrewAI provides multiple memory types as configuration: short-term, long-term (persisted across crew runs via embeddings), entity memory (structured knowledge about entities), and user memory (per-user preferences).
Tradeoff: Most "batteries-included" memory system. But you don't control the curation policy (what gets stored, what gets forgotten).
PydanticAI: Message History + Deps for State
PydanticAI manages conversation history via message_history and delegates broader state to the dependency injection system.
Tradeoff: Clean, explicit, type-safe. But memory management is your responsibility — no built-in long-term memory, embeddings, or curation.
Google ADK: Session + Memory Services
ADK separates state into two layers: Session state (per-session, managed by SessionService) and Memory service (cross-session recall, managed by MemoryService). Also supports session rewind — rolling back to before a previous invocation.
Tradeoff: Good built-in support, especially within Google Cloud. But full memory service capabilities depend on the Google ecosystem.
Comparative Assessment
| Aspect | LangGraph | OpenAI SDK | CrewAI | PydanticAI | Google ADK |
|---|---|---|---|---|---|
| Working memory | State schema + reducers | Auto session | Per-crew context | Message history | Session state |
| Short-term (conversation) | Built-in via messages | Sessions | Built-in | Manual message passing | SessionService |
| Long-term (cross-session) | Persistent checkpointers | BYO | Built-in (entity, user) | BYO via deps | MemoryService |
| Durable execution | ✅ Checkpointing | ❌ | ❌ | ❌ (external via Temporal) | Partial |
| Time travel / rollback | ✅ Full DAG | ❌ | ❌ | ❌ | ✅ Session rewind |
| State complexity | High (you design it) | Low | Medium (config-based) | Medium (DI-based) | Medium |
First-principles verdict: LangGraph is the clear winner if state management is your core requirement. CrewAI is easiest to "just turn on." PydanticAI and OpenAI SDK are honest about what they don't provide, which is respectable.
Dimension 4: Orchestration / Multi-Agent Coordination
LangGraph: Arbitrary Graph Topologies
Agents are nodes. Coordination is edges. Supports sequential, parallel (via parallel super-steps), hierarchical, cyclic, and any custom topology. Sub-graphs can be composed into larger graphs.
Strength: Maximum flexibility. If you can draw the coordination pattern, you can build it. Weakness: You _must_ draw it. No sensible defaults. Graph design is a skill.
OpenAI Agents SDK: Handoffs
Multi-agent coordination via a single primitive: handoffs. An agent can transfer control to another agent. The receiving agent takes over completely.
Strength: Dead simple. One concept to learn. Weakness: Only supports delegation/transfer patterns. No parallel execution, no debate, no complex coordination topologies.
CrewAI: Role-Based Teams
Two modes: sequential (assembly line) and hierarchical (manager delegates and reviews).
Strength: The role/team metaphor is intuitive. Good for decomposable workflows. Weakness: Limited to these two coordination patterns. Can't do arbitrary topologies without workarounds.
PydanticAI: Agent-as-Tool Delegation
No native multi-agent orchestration. You compose agents by using one agent as a dependency or tool for another.
Strength: Explicit, type-safe, composable. No magic. Weakness: No built-in coordination patterns. Multi-agent workflows are hand-rolled.
Google ADK: Hierarchical Agents + Workflow Agents
Structured multi-agent via agent hierarchies and workflow agents (SequentialAgent, ParallelAgent, LoopAgent). Also supports AgentTool and the A2A protocol for cross-framework interoperability.
Strength: Clean separation of deterministic and dynamic orchestration. A2A protocol is forward-looking. Weakness: Still v0.5 — solid primitives but younger ecosystem.
Comparative Assessment
| Pattern | LangGraph | OpenAI SDK | CrewAI | PydanticAI | Google ADK |
|---|---|---|---|---|---|
| Sequential | ✅ | Via handoffs | ✅ Native | Manual | ✅ SequentialAgent |
| Parallel | ✅ Super-steps | ❌ | ❌ | ❌ | ✅ ParallelAgent |
| Hierarchical | ✅ Sub-graphs | Handoff chains | ✅ Native | Manual | ✅ Agent hierarchy |
| Cyclic / Debate | ✅ | ❌ | ❌ | Manual | ✅ LoopAgent |
| Custom topology | ✅ Arbitrary | ❌ | ❌ | Manual | Workflow + LLM combo |
| Cross-framework | ❌ | ❌ | ❌ | A2A support | A2A protocol |
First-principles verdict: Do you actually need multi-agent? If yes, LangGraph gives you the most expressive power. CrewAI gives you the fastest path to role-based delegation. Google ADK's mix of deterministic + LLM orchestration is architecturally the most interesting.
Dimension 5: Planning / Task Decomposition
| Framework | Approach | Verdict |
|---|---|---|
| LangGraph | No built-in. Implement as graph nodes. Can build plan-and-execute, tree-of-thought, MCTS as topology. | Maximum flexibility, zero built-in convenience. |
| OpenAI SDK | No explicit planning. Relies on model's internal chain-of-thought. | Works well with strong reasoning models; fragile with weaker ones. |
| CrewAI | Task lists _are_ the plan. In hierarchical mode, the manager does dynamic planning. | Good for known workflows; less flexible for dynamic decomposition. |
| PydanticAI | No built-in. Structured outputs can encode plans as typed objects. | You can build planning, but the framework doesn't provide it. |
| Google ADK | Workflow agents provide deterministic planning. LLM agents do dynamic planning. | The static/dynamic split is pragmatic. Pre-plan the structure, let LLMs handle details. |
Dimension 6: Guardrails / Control Flow
| Guardrail Type | LangGraph | OpenAI SDK | CrewAI | PydanticAI | Google ADK |
|---|---|---|---|---|---|
| Output schema enforcement | Manual | Manual | Basic | Best-in-class | Good |
| Input validation | Node-based | First-class | Basic | Via deps | Callbacks |
| HITL interrupts | Best-in-class | ❌ | Per-task | ❌ | Tool confirmation |
| Cost/token limits | Via LangSmith | Max turns | Max iterations | Usage tracking | Via Cloud |
| Observability | LangSmith | Built-in tracing | Basic logging | Logfire integration | Dev UI + Cloud |
| Evaluation tooling | Via LangSmith | Via OpenAI evals | ❌ | pydantic_evals | Built-in CLI |
Summary: When to Use What
Choose LangGraph when: You need durable execution, human-in-the-loop as a core requirement, custom coordination topologies, long-running agents (hours/days), or time-travel debugging. Cost: Steep learning curve. Graph design is a new skill.
Choose OpenAI Agents SDK when: You want to ship fast with minimal abstractions, standard ReAct + handoff patterns suffice, and you're primarily using OpenAI models. Cost: You'll build everything beyond the basics yourself.
Choose CrewAI when: Your problem maps naturally to roles and tasks, you want multi-agent out of the box, and built-in memory matters. Cost: Less control over individual agent behavior.
Choose PydanticAI when: Output correctness is your #1 priority, you want type safety and IDE support, and you value testability. Cost: Limited orchestration — you build coordination yourself.
Choose Google ADK when: You're in the Google Cloud ecosystem, want multi-language support, the deterministic + LLM agent split matches your architecture, and A2A interoperability matters. Cost: Still pre-1.0. Google-ecosystem bias.
The Honest Meta-Assessment
None of these frameworks have settled on truly stable abstractions yet. Some patterns are emerging:
- MCP is winning as the universal tool protocol. All 5 now support it.
- The "just write the loop" approach (OpenAI SDK, PydanticAI) is gaining ground as models get more capable.
- State management remains the hardest unsolved problem. LangGraph's checkpointing is the most mature but also most complex.
- Multi-agent is oversold. Most production systems use 1-3 agents with good tools, not swarms of 10+ specialists.
- Interoperability (A2A, MCP, AG-UI) is the real next frontier.
The safest strategy: understand the primitives, write your core agent logic in a way that's framework-portable, and treat the framework as the deployment/infrastructure layer rather than the business logic layer.