January 14, 2026

LLM API Abstraction Landscape

LLM APIsabstractionPydanticAIinfrastructure

The Standards War: Who Won?
Wire-Level API Comparison
Open-Source & Chinese Providers
Abstraction Layer Approaches
Decision Framework
The Math That Matters
- 4a. LiteLLM / Gateway Proxy
- 4b. Vercel AI SDK
- 4c. LangChain
- 4d. PydanticAI

1. The Standards War: Who Won?

OpenAI's `/chat/completions` — The De Facto Standard

OpenAI's Chat Completions API became the de facto standard due to ChatGPT adoption (late 2022). Virtually every inference engine and open-source provider adopted it: vLLM, SGLang, Ollama, Together, Fireworks, DeepSeek, Qwen, Kimi, Zhipu (GLM), and xAI all expose OpenAI-compatible endpoints.

OpenAI's Own Migration Away

Ironically, OpenAI itself is transitioning away from this standard:

Generation	API	Status
Gen 1	`/v1/completions`	Deprecated
Gen 2	`/v1/chat/completions`	Current standard, but OpenAI moving on
Gen 3	`/v1/responses` (Responses API)	OpenAI's new direction (March 2025)

Assistants API deprecated August 2025, sunset August 2026
Open Responses Specification released January 2025 — open-source spec based on the Responses API
Supported by: Ollama, vLLM, OpenRouter, Hugging Face, Vercel, Nvidia
Notable omissions: Anthropic, Google DeepMind

The Emerging Two-Standard Reality

Rather than convergence on one universal format, the ecosystem is converging on two dominant wire protocols:

OpenAI's /chat/completions — the default for everything
Anthropic's /v1/messages — emerging as a second standard, driven primarily by Claude Code compatibility

DeepSeek, MiniMax, xAI, and Ollama now support both formats. The prediction: frontier open-source labs will offer both OpenAI and Anthropic API compatibility.

2. Wire-Level API Comparison

2.1 Basic Request Structure

OpenAI (/v1/chat/completions):

{
  "model": "gpt-4.1",
  "messages": [
    { "role": "system", "content": "You are helpful." },
    { "role": "user", "content": "Hello" }
  ],
  "max_tokens": 1000
}

Anthropic (/v1/messages):

{
  "model": "claude-sonnet-4-20250514",
  "system": "You are helpful.",
  "messages": [{ "role": "user", "content": "Hello" }],
  "max_tokens": 1000
}

Gemini (/v1beta/models/gemini-2.5-flash:generateContent):

{
  "system_instruction": { "parts": [{ "text": "You are helpful." }] },
  "contents": [{ "role": "user", "parts": [{ "text": "Hello" }] }],
  "generationConfig": { "maxOutputTokens": 1000 }
}

xAI/Grok — identical to OpenAI (drop-in compatible).

2.2 Content Model Architecture

This is the first-principles structural difference that matters:

Provider	Content Model	Unit of Content	Nesting
OpenAI	`messages[].content` is string OR array of `{type, ...}` objects	Message	Flat
Anthropic	`messages[].content` is string OR array of content blocks (`text`, `image`, `tool_use`, `tool_result`, `thinking`)	Content Block	Flat but typed
Gemini	`contents[].parts[]` — list of Part objects (`text`, `inlineData`, `functionCall`, `functionResponse`, `fileData`)	Part	Nested with `candidates` wrapper
xAI	Same as OpenAI	Message	Flat

Key insight: Anthropic's content-block model is the most compositional. A single assistant turn can interleave thinking, text, and tool calls as sibling blocks:

{
  "role": "assistant",
  "content": [
    {"type": "thinking", "thinking": "Let me reason..."},
    {"type": "text", "text": "The answer is..."},
    {"type": "tool_use", "id": "call_1", "name": "search", "input": {...}}
  ]
}

OpenAI separates tool calls into a dedicated tool_calls field on the message. Gemini embeds them as parts.

2.3 System Prompt Location

Provider	Location
OpenAI / xAI	Inside `messages` array with `role: "system"`
Anthropic	Top-level `system` field
Gemini	Top-level `system_instruction` field

2.4 Response Structures

Provider	Content path	Stop signal	Usage
OpenAI	`choices[0].message.content`	`finish_reason`	`usage.prompt_tokens` / `completion_tokens`
Anthropic	`content[0].text`	`stop_reason`	`usage.input_tokens` / `output_tokens`
Gemini	`candidates[0].content.parts[0].text`	`finishReason`	`usageMetadata.promptTokenCount` / `candidatesTokenCount`

2.5 Tool / Function Calling

Aspect	OpenAI	Anthropic	Gemini
Tool definition	`tools[].function.parameters` (JSON Schema)	`tools[].input_schema` (JSON Schema)	`tools[].functionDeclarations[].parameters` (OpenAPI subset)
Tool call in response	`message.tool_calls[]` (separate field)	Content block `{"type": "tool_use"}` (inline)	Part `{"functionCall": {...}}` (inline)
Tool result	`{"role": "tool", "tool_call_id": "..."}`	`{"type": "tool_result"}` inside user message	`{"role": "function", "parts": [{"functionResponse": {...}}]}`
Tool choice	`"auto" / "required" / {"function": {"name": "..."}}`	`{"type": "auto" / "any" / "tool", "name": "..."}`	`toolConfig.functionCallingConfig.mode: "AUTO" / "ANY" / "NONE"`

2.6 Streaming Protocols

Provider	Protocol	Structure
OpenAI	SSE	`data: {"choices": [{"delta": {"content": "token"}}]}` — incremental deltas
Anthropic	SSE with semantic events	`message_start`, `content_block_start`, `content_block_delta`, `content_block_stop`, `message_delta`, `message_stop` — most structured
Gemini	SSE	Chunks of `GenerateContentResponse` objects

Anthropic's streaming is the most structured — you know exactly which content block is being populated via indexed start/delta/stop events.

2.7 Provider-Specific Features (Not Abstractable)

Feature	Provider(s)	Abstraction Difficulty
Prompt caching	Anthropic (inline `cache_control`), Gemini (`cachedContent` resource)	Hard — different mechanisms
Extended thinking	Anthropic (`thinking` blocks), OpenAI (`reasoning` in Responses), Gemini (`thinkingConfig`)	Medium
Citations	Anthropic (inline on text blocks), Gemini (grounding metadata)	Hard
Computer use	Anthropic (native tool types), Gemini	Very hard
Built-in tools (web search, code exec)	OpenAI Responses, xAI, Anthropic, Gemini	Provider-specific
Safety ratings	Gemini only	Can't map
`max_tokens` required	Anthropic requires it; others default	Small but breaks naive adapters

3. Open-Source & Chinese Providers

API Format Adoption Map

Provider	Primary API Format	Secondary Format	Notes
DeepSeek	OpenAI `/chat/completions`	Anthropic `/messages`	V3.2 uses OpenAI SDK directly. Added Anthropic format for Claude Code compat.
Kimi / Moonshot	OpenAI `/chat/completions`	Anthropic `/messages`	Anthropic API maps temperature: `real_temp = request_temp * 0.6`
Qwen / Alibaba	OpenAI (via "compatible mode")	Native DashScope SDK	`dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions`
MiniMax	Anthropic `/messages` (recommended)	OpenAI `/chat/completions`	Recommends Anthropic format for full feature support
GLM / Zhipu (Z.ai)	OpenAI `/chat/completions`	—	Standard `/v1/chat/completions` format
xAI / Grok	OpenAI `/chat/completions`	Anthropic `/messages`	Compatible with both SDKs natively

Self-Hosted Inference Engines

Engine	Format
vLLM	OpenAI-compatible
SGLang	OpenAI-compatible
Ollama	OpenAI-compatible + Anthropic-compatible (announced)
llama.cpp	OpenAI-compatible
TGI (HuggingFace)	OpenAI-compatible

Where "Compatible" Breaks Down

The compatibility is real for the 80% case but gets leaky fast:

Reasoning/thinking tokens — every provider does this differently even within the OpenAI-compatible format:

DeepSeek: reasoning_content as extra field on message
Kimi K2 Thinking: same reasoning_content field name (not standard OpenAI)
Qwen3: extra_body parameters to toggle thinking
OpenAI: separate reasoning object in Responses API

Tool calling chat templates — when running models locally, the inference engine needs model-specific chat templates for tool calls. DeepSeek V3.2, Qwen, Llama, and Mistral all have different internal formats papered over by the "OpenAI-compatible" endpoint.

Provider-specific extra_body extensions:

Qwen: extra_body={"enable_search": True}
DeepSeek: extra_body={"reasoning_content": "enable"}
MiniMax: extra_body={"reasoning_split": True}

What's Actually Standardized vs. Fragmented

┌─────────────────────────────────────────────┐
│  WELL STANDARDIZED                          │
│  • POST /v1/chat/completions                │
│  • messages[{role, content}] shape           │
│  • response.choices[0].message.content       │
│  • basic streaming (SSE data: chunks)        │
│  • temperature, max_tokens, top_p            │
├─────────────────────────────────────────────┤
│  MOSTLY WORKS BUT QUIRKY                    │
│  • Function/tool calling schemas             │
│  • Structured output / JSON mode             │
│  • Multi-modal (image_url in content)        │
│  • Stop sequences                            │
├─────────────────────────────────────────────┤
│  FRAGMENTED / PROVIDER-SPECIFIC             │
│  • Reasoning/thinking tokens                 │
│  • Prompt caching                            │
│  • Extended context management               │
│  • Web search / built-in tools               │
│  • Computer use                              │
│  • Streaming event semantics                 │
│  • Usage reporting for cached tokens         │
│  • Reasoning effort controls                 │
└─────────────────────────────────────────────┘

4. Abstraction Layer Approaches

4a. LiteLLM / Gateway Proxy

Approach: HTTP proxy server that normalizes OpenAI-format requests into provider-specific API calls. Maintains massive translation table for 100+ providers.

Architecture:

Your App (speaks OpenAI format)
    → HTTP → LiteLLM Proxy
        → translates → Provider API
        ← translates ← Provider Response
    ← HTTP ← OpenAI-format response

Strengths:

Language-agnostic (any HTTP client works)
Centralized cost tracking, rate limiting, audit logging
Widely adopted: CrewAI, Giskard use as default

Weaknesses:

Lowest-common-denominator problem — loses provider-specific features or gets them late
Latency overhead: ~8ms p95 claimed, worse at scale; gradual memory leaks reported
Operational complexity: another service to deploy and monitor
Competitors emerging: Bifrost (54x faster p99), Portkey, OpenRouter

Best for: Multi-provider routing with cost tracking needs, Python-centric teams.

4b. Vercel AI SDK

Approach: TypeScript-first SDK with specification layer defining abstract interfaces. Each provider has a typed adapter that preserves native capabilities via providerOptions.

Architecture:

Your App (TypeScript)
    → generateText() / streamText()
        → Provider Adapter (compile-time translation)
            → Provider API (native format)

Strengths:

Zero runtime overhead — translation happens at build time
Provider options with type safety — providerOptions.anthropic.thinking, providerOptions.openai.reasoningEffort
Full type safety across the stack
React/Vue/Svelte/Angular streaming support
AI SDK 6 introduces reusable Agent abstraction

Weaknesses:

TypeScript-only ecosystem
Still needs per-provider knowledge for advanced features
Smaller community than LangChain

Philosophy: "Abstraction without erasure" — routing is valuable, hiding provider differences is harmful.

Best for: TypeScript web apps, especially with frontend streaming needs.

4c. LangChain

Approach: Full Python framework with internal canonical type system (intermediate representation). Each provider integration translates to/from LangChain's message types.

Architecture:

Your App (Python)
    │ speaks: HumanMessage, AIMessage, ToolMessage
    │ calls: .invoke(), .stream(), .batch(), .bind_tools()
    ▼
BaseChatModel (Runnable Interface)
    │
    ├── ChatOpenAI      → translates → OpenAI API
    ├── ChatAnthropic    → translates → Anthropic API
    ├── ChatGoogleGenAI  → translates → Gemini API
    └── ...

Internal Message Types:

BaseMessage (abstract)
├── HumanMessage       → OpenAI "user", Anthropic "user"
├── AIMessage          → OpenAI "assistant", Anthropic "assistant"
│   ├── .tool_calls    (standardized across providers)
│   ├── .usage_metadata (standardized)
│   └── .additional_kwargs (untyped provider-specific bag)
├── SystemMessage      → OpenAI "system", Anthropic top-level system
├── ToolMessage        → OpenAI {"role":"tool"}, Anthropic {"type":"tool_result"}
└── FunctionMessage    → (deprecated)

Provider-Specific Feature Handling — Three Mechanisms:

Standardized fields on AIMessage — tool calls, usage metadata normalized across providers
additional_kwargs — untyped Dict[str, Any] for raw provider data (the escape hatch)
Provider-specific class parameters — ChatAnthropic(thinking=..., betas=[...]), not portable

init_chat_model — Universal Constructor:

from langchain.chat_models import init_chat_model
model = init_chat_model("openai:gpt-4.1")
model = init_chat_model("anthropic:claude-sonnet-4-5-20250929")

Swap models in one line — but provider-incompatible params cause runtime errors, not compile-time safety.

Strengths:

Massive ecosystem: memory, retrievers, vector stores, document loaders, 100+ integrations
Runnable interface: composable chains with | pipe operator, with_fallbacks, with_retry
LangGraph: stateful agent graphs where model is a pluggable node
LangSmith: observability and evaluation

Weaknesses:

Leaky abstraction by design — standardizes 60% (text in/out, basic tools), punts 40% to provider-specific code
additional_kwargs is an untyped bag — invisible to type system
Deep class hierarchy — Serializable → Runnable → RunnableSerializable → BaseLanguageModel → BaseChatModel → ChatAnthropic
Debugging through layers — when translation mangles data, you're digging through 6+ abstraction levels
False sense of portability — chain looks provider-agnostic but breaks when swapping if you use caching, thinking, etc.

Best for: Complex multi-step agent workflows where the model is one component among many, and you want the full ecosystem (LangGraph + LangSmith + retrievers + memory).

4d. PydanticAI

Approach: Thin, well-typed Python agent framework that cleanly separates three concerns: Model (interface), Provider (auth/endpoint), and Profile (model-specific quirks). Built by the Pydantic team with FastAPI-inspired ergonomics.

Architecture:

Agent (orchestrates the loop)
│ - typed dependencies (dependency injection)
│ - typed output (Pydantic model validation)
│ - tools (auto-schema from Python functions)
│
│ speaks: ModelRequest / ModelResponse (parts-based)
▼
Model (abstract interface)
│ - request() → ModelResponse
│ - request_stream() → StreamedResponse
│
├── OpenAIChatModel + Provider + Profile
├── AnthropicModel  + Provider + Profile
├── GoogleModel      + Provider + Profile
└── ...

Key Design Innovation — Three-Way Separation:

Concern	What It Does	Example
Model	Wire format implementation	`OpenAIChatModel` (speaks `/chat/completions`)
Provider	Auth, endpoint, HTTP client	`AzureProvider` (Azure auth with OpenAI wire format)
Profile	Schema transforms, model quirks	DeepSeek profile (different tool template, same OpenAI wire format)

This elegantly solves the "DeepSeek speaks OpenAI's wire format but isn't OpenAI" problem — no separate integration package needed.

Parts-Based IR (vs. LangChain's Role-Based Messages):

ModelRequest(parts=[
    SystemPromptPart(content="You are helpful."),
    UserPromptPart(content="Roll me a dice."),
])

ModelResponse(parts=[
    ThinkingPart(content="..."),       # first-class citizen
    ToolCallPart(tool_name="roll_dice", args={}),
], provider_details={"finish_reason": "STOP"})

Why this is better than LangChain's approach: When Anthropic returns [thinking_block, text_block, tool_use_block] in one response, PydanticAI's parts list naturally represents that. LangChain packs it into AIMessage.content: List[Dict] and relies on you to iterate correctly. ThinkingPart is a first-class type, not a bolt-on.

Provider-Specific Feature Handling:

provider_details on responses — typed, named field (not additional_kwargs catch-all)
Prefixed settings — openai_service_tier='priority', anthropic_thinking=... — makes non-portable params obvious
Builtin tools — WebSearchTool, CodeExecutionTool map to provider-native implementations when available, fall back to generic otherwise

Strengths:

Cleanest abstraction design — Model/Provider/Profile separation
Type safety as core principle — output_type=CityInfo enforced by Pydantic validation
Parts-based IR is closer to how models actually work
Dependency injection built-in (testability from day one)
TestModel and FunctionModel for testing without API calls
Lightweight — focused scope, not a framework for everything
v1.0 released September 2025; production-adopted

Weaknesses:

Smaller ecosystem — no built-in retrievers, vector stores, document loaders
Still subject to LCD problem — same feature (structured output) implemented very differently across providers (Claude post-formats → 2x latency, Gemini zeroes logits → single call)
Provider API often richer than what any abstraction exposes
Breaks down for larger teams relying heavily on provider-specific capabilities

Best for: Python developers who want a thin, well-typed abstraction with a clean agent loop and are comfortable assembling other components themselves.

5. Decision Framework

Quick Reference Matrix

Dimension	LiteLLM/Gateway	Vercel AI SDK	LangChain	PydanticAI
Language	Any (HTTP proxy)	TypeScript	Python-first	Python
Abstraction cost	Network hop per request	Zero (compile-time)	In-process function call	In-process function call
Feature fidelity	Lowest common denominator	High (typed `providerOptions`)	Medium (untyped `additional_kwargs`)	High (typed `provider_details` + prefixed params)
Provider features	Mostly dropped	Explicit escape hatches	Class-specific params	Model/Provider/Profile separation
Scope	LLM calls only	LLM calls + UI streaming	Full framework (chains, memory, RAG, agents)	Agents + tools + typed output
Type safety	None	Full (TypeScript)	Partial	Full (Pydantic)
Ecosystem	100+ provider support	Growing, frontend-centric	Massive	Focused, growing
Ops complexity	Proxy server to manage	SDK dependency	Framework dependency	Library dependency
Thinking tokens	Varies by provider support	Provider-specific metadata	`additional_kwargs`	First-class `ThinkingPart`

When to Use What

Situation	Recommendation
TypeScript web app	Vercel AI SDK — principled abstraction, frontend streaming
Centralized gateway (cost tracking, rate limiting, audit)	LiteLLM / Portkey — plan for proxy bottleneck
Local open-source models	OpenAI-compatible via vLLM / SGLang / Ollama
Complex multi-step agent workflows (memory, RAG, graph agents)	LangChain + LangGraph + LangSmith
Python, clean typed agents	PydanticAI — thin abstraction, Pydantic validation
Heavy provider-specific features (thinking, caching, computer use)	Native provider SDK + thin adapter you control
Single-provider, max performance	Native SDK directly

6. The Math That Matters

The Set Intersection Problem

Let F_i = feature set of provider i.

A universal abstraction can natively support only ∩F_i (the intersection of all providers' features).

Everything in F_i \ ∩F_i (features unique to provider i) requires provider-specific escape hatches.

As providers innovate independently:

|F_i \ ∩F_i| (provider-unique features) grows faster than |∩F_i| (shared features)
The universal adapter gets worse over time relative to native integration
Every abstraction layer acknowledges this with escape hatches (additional_kwargs, providerOptions, provider_details, extra_body)

The 80/20 Reality

80% case — send text, get text, simple function calling:

Translation layers work fine
Conceptual model is shared across all providers
Any abstraction layer handles this well

20% case — prompt caching, thinking tokens, structured outputs, multi-modal, provider tools:

These are the features that differentiate models and drive provider choice
Abstraction layers either drop them (losing value) or leak them (complicating the abstraction)
These features are increasingly where the production value lives

The Core Principle

**Abstraction layers that hide provider differences are valuable for routing.
Abstractions that erase them are harmful for quality.**

The best tools (Vercel AI SDK, PydanticAI) understand this distinction and provide typed extensibility mechanisms rather than pretending differences don't exist.

Practical Implication

The real question isn't "which abstraction layer?" — it's:

How many providers do I actually need to swap between?
Am I using provider-specific features?

If "maybe 2" and "yes" → thin adapter you control. If "5+" and "no" → gateway/abstraction layer is fine.

Appendix: Provider API Quick Reference

Endpoint Patterns

Provider	Base URL	Endpoint
OpenAI	`https://api.openai.com`	`/v1/chat/completions`
Anthropic	`https://api.anthropic.com`	`/v1/messages`
Gemini	`https://generativelanguage.googleapis.com`	`/v1beta/models/{model}:generateContent`
xAI	`https://api.x.ai`	`/v1/chat/completions`
DeepSeek	`https://api.deepseek.com`	`/chat/completions`
Qwen (DashScope)	`https://dashscope-intl.aliyuncs.com`	`/compatible-mode/v1/chat/completions`
Kimi (Moonshot)	`https://platform.moonshot.ai`	OpenAI-compat + Anthropic-compat
MiniMax	`https://api.minimax.io`	`/v1` (OpenAI) or `/anthropic` (Anthropic)

Auth Patterns

Provider	Auth Header
OpenAI / xAI / DeepSeek / Qwen	`Authorization: Bearer {key}`
Anthropic	`x-api-key: {key}` + `anthropic-version: 2023-06-01`
Gemini	`x-goog-api-key: {key}` or OAuth2
MiniMax	`Authorization: Bearer {key}` (both endpoints)

Table of Contents

1. The Standards War: Who Won?

OpenAI's /chat/completions — The De Facto Standard

OpenAI's Own Migration Away

The Emerging Two-Standard Reality

2. Wire-Level API Comparison

2.1 Basic Request Structure

2.2 Content Model Architecture

2.3 System Prompt Location

2.4 Response Structures

2.5 Tool / Function Calling

2.6 Streaming Protocols

2.7 Provider-Specific Features (Not Abstractable)

3. Open-Source & Chinese Providers

API Format Adoption Map

Self-Hosted Inference Engines

Where "Compatible" Breaks Down

What's Actually Standardized vs. Fragmented

4. Abstraction Layer Approaches

4a. LiteLLM / Gateway Proxy

4b. Vercel AI SDK

4c. LangChain

4d. PydanticAI

5. Decision Framework

Quick Reference Matrix

When to Use What

6. The Math That Matters

The Set Intersection Problem

The 80/20 Reality

The Core Principle

Practical Implication

Appendix: Provider API Quick Reference

Endpoint Patterns

Auth Patterns

OpenAI's `/chat/completions` — The De Facto Standard