January 14, 2026
LLM API Abstraction Landscape
Table of Contents
- The Standards War: Who Won?
- Wire-Level API Comparison
- Open-Source & Chinese Providers
- Abstraction Layer Approaches
- Decision Framework
- 4a. LiteLLM / Gateway Proxy
- 4b. Vercel AI SDK
- 4c. LangChain
- 4d. PydanticAI
1. The Standards War: Who Won?
OpenAI's /chat/completions — The De Facto Standard
OpenAI's Chat Completions API became the de facto standard due to ChatGPT adoption (late 2022). Virtually every inference engine and open-source provider adopted it: vLLM, SGLang, Ollama, Together, Fireworks, DeepSeek, Qwen, Kimi, Zhipu (GLM), and xAI all expose OpenAI-compatible endpoints.
OpenAI's Own Migration Away
Ironically, OpenAI itself is transitioning away from this standard:
| Generation | API | Status |
|---|---|---|
| Gen 1 | /v1/completions | Deprecated |
| Gen 2 | /v1/chat/completions | Current standard, but OpenAI moving on |
| Gen 3 | /v1/responses (Responses API) | OpenAI's new direction (March 2025) |
- Assistants API deprecated August 2025, sunset August 2026
- Open Responses Specification released January 2025 — open-source spec based on the Responses API
- Supported by: Ollama, vLLM, OpenRouter, Hugging Face, Vercel, Nvidia
- Notable omissions: Anthropic, Google DeepMind
The Emerging Two-Standard Reality
Rather than convergence on one universal format, the ecosystem is converging on two dominant wire protocols:
- OpenAI's
/chat/completions— the default for everything - Anthropic's
/v1/messages— emerging as a second standard, driven primarily by Claude Code compatibility
DeepSeek, MiniMax, xAI, and Ollama now support both formats. The prediction: frontier open-source labs will offer both OpenAI and Anthropic API compatibility.
2. Wire-Level API Comparison
2.1 Basic Request Structure
OpenAI (/v1/chat/completions):
{
"model": "gpt-4.1",
"messages": [
{ "role": "system", "content": "You are helpful." },
{ "role": "user", "content": "Hello" }
],
"max_tokens": 1000
}
Anthropic (/v1/messages):
{
"model": "claude-sonnet-4-20250514",
"system": "You are helpful.",
"messages": [{ "role": "user", "content": "Hello" }],
"max_tokens": 1000
}
Gemini (/v1beta/models/gemini-2.5-flash:generateContent):
{
"system_instruction": { "parts": [{ "text": "You are helpful." }] },
"contents": [{ "role": "user", "parts": [{ "text": "Hello" }] }],
"generationConfig": { "maxOutputTokens": 1000 }
}
xAI/Grok — identical to OpenAI (drop-in compatible).
2.2 Content Model Architecture
This is the first-principles structural difference that matters:
| Provider | Content Model | Unit of Content | Nesting |
|---|---|---|---|
| OpenAI | messages[].content is string OR array of {type, ...} objects | Message | Flat |
| Anthropic | messages[].content is string OR array of content blocks (text, image, tool_use, tool_result, thinking) | Content Block | Flat but typed |
| Gemini | contents[].parts[] — list of Part objects (text, inlineData, functionCall, functionResponse, fileData) | Part | Nested with candidates wrapper |
| xAI | Same as OpenAI | Message | Flat |
Key insight: Anthropic's content-block model is the most compositional. A single assistant turn can interleave thinking, text, and tool calls as sibling blocks:
{
"role": "assistant",
"content": [
{"type": "thinking", "thinking": "Let me reason..."},
{"type": "text", "text": "The answer is..."},
{"type": "tool_use", "id": "call_1", "name": "search", "input": {...}}
]
}
OpenAI separates tool calls into a dedicated tool_calls field on the message. Gemini embeds them as parts.
2.3 System Prompt Location
| Provider | Location |
|---|---|
| OpenAI / xAI | Inside messages array with role: "system" |
| Anthropic | Top-level system field |
| Gemini | Top-level system_instruction field |
2.4 Response Structures
| Provider | Content path | Stop signal | Usage |
|---|---|---|---|
| OpenAI | choices[0].message.content | finish_reason | usage.prompt_tokens / completion_tokens |
| Anthropic | content[0].text | stop_reason | usage.input_tokens / output_tokens |
| Gemini | candidates[0].content.parts[0].text | finishReason | usageMetadata.promptTokenCount / candidatesTokenCount |
2.5 Tool / Function Calling
| Aspect | OpenAI | Anthropic | Gemini |
|---|---|---|---|
| Tool definition | tools[].function.parameters (JSON Schema) | tools[].input_schema (JSON Schema) | tools[].functionDeclarations[].parameters (OpenAPI subset) |
| Tool call in response | message.tool_calls[] (separate field) | Content block {"type": "tool_use"} (inline) | Part {"functionCall": {...}} (inline) |
| Tool result | {"role": "tool", "tool_call_id": "..."} | {"type": "tool_result"} inside user message | {"role": "function", "parts": [{"functionResponse": {...}}]} |
| Tool choice | "auto" / "required" / {"function": {"name": "..."}} | {"type": "auto" / "any" / "tool", "name": "..."} | toolConfig.functionCallingConfig.mode: "AUTO" / "ANY" / "NONE" |
2.6 Streaming Protocols
| Provider | Protocol | Structure |
|---|---|---|
| OpenAI | SSE | data: {"choices": [{"delta": {"content": "token"}}]} — incremental deltas |
| Anthropic | SSE with semantic events | message_start, content_block_start, content_block_delta, content_block_stop, message_delta, message_stop — most structured |
| Gemini | SSE | Chunks of GenerateContentResponse objects |
Anthropic's streaming is the most structured — you know exactly which content block is being populated via indexed start/delta/stop events.
2.7 Provider-Specific Features (Not Abstractable)
| Feature | Provider(s) | Abstraction Difficulty |
|---|---|---|
| Prompt caching | Anthropic (inline cache_control), Gemini (cachedContent resource) | Hard — different mechanisms |
| Extended thinking | Anthropic (thinking blocks), OpenAI (reasoning in Responses), Gemini (thinkingConfig) | Medium |
| Citations | Anthropic (inline on text blocks), Gemini (grounding metadata) | Hard |
| Computer use | Anthropic (native tool types), Gemini | Very hard |
| Built-in tools (web search, code exec) | OpenAI Responses, xAI, Anthropic, Gemini | Provider-specific |
| Safety ratings | Gemini only | Can't map |
max_tokens required | Anthropic requires it; others default | Small but breaks naive adapters |
3. Open-Source & Chinese Providers
API Format Adoption Map
| Provider | Primary API Format | Secondary Format | Notes |
|---|---|---|---|
| DeepSeek | OpenAI /chat/completions | Anthropic /messages | V3.2 uses OpenAI SDK directly. Added Anthropic format for Claude Code compat. |
| Kimi / Moonshot | OpenAI /chat/completions | Anthropic /messages | Anthropic API maps temperature: real_temp = request_temp * 0.6 |
| Qwen / Alibaba | OpenAI (via "compatible mode") | Native DashScope SDK | dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions |
| MiniMax | Anthropic /messages (recommended) | OpenAI /chat/completions | Recommends Anthropic format for full feature support |
| GLM / Zhipu (Z.ai) | OpenAI /chat/completions | — | Standard /v1/chat/completions format |
| xAI / Grok | OpenAI /chat/completions | Anthropic /messages | Compatible with both SDKs natively |
Self-Hosted Inference Engines
| Engine | Format |
|---|---|
| vLLM | OpenAI-compatible |
| SGLang | OpenAI-compatible |
| Ollama | OpenAI-compatible + Anthropic-compatible (announced) |
| llama.cpp | OpenAI-compatible |
| TGI (HuggingFace) | OpenAI-compatible |
Where "Compatible" Breaks Down
The compatibility is real for the 80% case but gets leaky fast:
Reasoning/thinking tokens — every provider does this differently even within the OpenAI-compatible format:
- DeepSeek:
reasoning_contentas extra field on message - Kimi K2 Thinking: same
reasoning_contentfield name (not standard OpenAI) - Qwen3:
extra_bodyparameters to toggle thinking - OpenAI: separate
reasoningobject in Responses API
Tool calling chat templates — when running models locally, the inference engine needs model-specific chat templates for tool calls. DeepSeek V3.2, Qwen, Llama, and Mistral all have different internal formats papered over by the "OpenAI-compatible" endpoint.
Provider-specific extra_body extensions:
- Qwen:
extra_body={"enable_search": True} - DeepSeek:
extra_body={"reasoning_content": "enable"} - MiniMax:
extra_body={"reasoning_split": True}
What's Actually Standardized vs. Fragmented
┌─────────────────────────────────────────────┐
│ WELL STANDARDIZED │
│ • POST /v1/chat/completions │
│ • messages[{role, content}] shape │
│ • response.choices[0].message.content │
│ • basic streaming (SSE data: chunks) │
│ • temperature, max_tokens, top_p │
├─────────────────────────────────────────────┤
│ MOSTLY WORKS BUT QUIRKY │
│ • Function/tool calling schemas │
│ • Structured output / JSON mode │
│ • Multi-modal (image_url in content) │
│ • Stop sequences │
├─────────────────────────────────────────────┤
│ FRAGMENTED / PROVIDER-SPECIFIC │
│ • Reasoning/thinking tokens │
│ • Prompt caching │
│ • Extended context management │
│ • Web search / built-in tools │
│ • Computer use │
│ • Streaming event semantics │
│ • Usage reporting for cached tokens │
│ • Reasoning effort controls │
└─────────────────────────────────────────────┘
4. Abstraction Layer Approaches
4a. LiteLLM / Gateway Proxy
Approach: HTTP proxy server that normalizes OpenAI-format requests into provider-specific API calls. Maintains massive translation table for 100+ providers.
Architecture:
Your App (speaks OpenAI format)
→ HTTP → LiteLLM Proxy
→ translates → Provider API
← translates ← Provider Response
← HTTP ← OpenAI-format response
Strengths:
- Language-agnostic (any HTTP client works)
- Centralized cost tracking, rate limiting, audit logging
- Widely adopted: CrewAI, Giskard use as default
Weaknesses:
- Lowest-common-denominator problem — loses provider-specific features or gets them late
- Latency overhead: ~8ms p95 claimed, worse at scale; gradual memory leaks reported
- Operational complexity: another service to deploy and monitor
- Competitors emerging: Bifrost (54x faster p99), Portkey, OpenRouter
Best for: Multi-provider routing with cost tracking needs, Python-centric teams.
4b. Vercel AI SDK
Approach: TypeScript-first SDK with specification layer defining abstract interfaces. Each provider has a typed adapter that preserves native capabilities via providerOptions.
Architecture:
Your App (TypeScript)
→ generateText() / streamText()
→ Provider Adapter (compile-time translation)
→ Provider API (native format)
Strengths:
- Zero runtime overhead — translation happens at build time
- Provider options with type safety —
providerOptions.anthropic.thinking,providerOptions.openai.reasoningEffort - Full type safety across the stack
- React/Vue/Svelte/Angular streaming support
- AI SDK 6 introduces reusable Agent abstraction
Weaknesses:
- TypeScript-only ecosystem
- Still needs per-provider knowledge for advanced features
- Smaller community than LangChain
Philosophy: "Abstraction without erasure" — routing is valuable, hiding provider differences is harmful.
Best for: TypeScript web apps, especially with frontend streaming needs.
4c. LangChain
Approach: Full Python framework with internal canonical type system (intermediate representation). Each provider integration translates to/from LangChain's message types.
Architecture:
Your App (Python)
│ speaks: HumanMessage, AIMessage, ToolMessage
│ calls: .invoke(), .stream(), .batch(), .bind_tools()
▼
BaseChatModel (Runnable Interface)
│
├── ChatOpenAI → translates → OpenAI API
├── ChatAnthropic → translates → Anthropic API
├── ChatGoogleGenAI → translates → Gemini API
└── ...
Internal Message Types:
BaseMessage (abstract)
├── HumanMessage → OpenAI "user", Anthropic "user"
├── AIMessage → OpenAI "assistant", Anthropic "assistant"
│ ├── .tool_calls (standardized across providers)
│ ├── .usage_metadata (standardized)
│ └── .additional_kwargs (untyped provider-specific bag)
├── SystemMessage → OpenAI "system", Anthropic top-level system
├── ToolMessage → OpenAI {"role":"tool"}, Anthropic {"type":"tool_result"}
└── FunctionMessage → (deprecated)
Provider-Specific Feature Handling — Three Mechanisms:
- Standardized fields on AIMessage — tool calls, usage metadata normalized across providers
additional_kwargs— untypedDict[str, Any]for raw provider data (the escape hatch)- Provider-specific class parameters —
ChatAnthropic(thinking=..., betas=[...]), not portable
init_chat_model — Universal Constructor:
from langchain.chat_models import init_chat_model
model = init_chat_model("openai:gpt-4.1")
model = init_chat_model("anthropic:claude-sonnet-4-5-20250929")
Swap models in one line — but provider-incompatible params cause runtime errors, not compile-time safety.
Strengths:
- Massive ecosystem: memory, retrievers, vector stores, document loaders, 100+ integrations
- Runnable interface: composable chains with
|pipe operator,with_fallbacks,with_retry - LangGraph: stateful agent graphs where model is a pluggable node
- LangSmith: observability and evaluation
Weaknesses:
- Leaky abstraction by design — standardizes 60% (text in/out, basic tools), punts 40% to provider-specific code
additional_kwargsis an untyped bag — invisible to type system- Deep class hierarchy —
Serializable → Runnable → RunnableSerializable → BaseLanguageModel → BaseChatModel → ChatAnthropic - Debugging through layers — when translation mangles data, you're digging through 6+ abstraction levels
- False sense of portability — chain looks provider-agnostic but breaks when swapping if you use caching, thinking, etc.
Best for: Complex multi-step agent workflows where the model is one component among many, and you want the full ecosystem (LangGraph + LangSmith + retrievers + memory).
4d. PydanticAI
Approach: Thin, well-typed Python agent framework that cleanly separates three concerns: Model (interface), Provider (auth/endpoint), and Profile (model-specific quirks). Built by the Pydantic team with FastAPI-inspired ergonomics.
Architecture:
Agent (orchestrates the loop)
│ - typed dependencies (dependency injection)
│ - typed output (Pydantic model validation)
│ - tools (auto-schema from Python functions)
│
│ speaks: ModelRequest / ModelResponse (parts-based)
▼
Model (abstract interface)
│ - request() → ModelResponse
│ - request_stream() → StreamedResponse
│
├── OpenAIChatModel + Provider + Profile
├── AnthropicModel + Provider + Profile
├── GoogleModel + Provider + Profile
└── ...
Key Design Innovation — Three-Way Separation:
| Concern | What It Does | Example |
|---|---|---|
| Model | Wire format implementation | OpenAIChatModel (speaks /chat/completions) |
| Provider | Auth, endpoint, HTTP client | AzureProvider (Azure auth with OpenAI wire format) |
| Profile | Schema transforms, model quirks | DeepSeek profile (different tool template, same OpenAI wire format) |
This elegantly solves the "DeepSeek speaks OpenAI's wire format but isn't OpenAI" problem — no separate integration package needed.
Parts-Based IR (vs. LangChain's Role-Based Messages):
ModelRequest(parts=[
SystemPromptPart(content="You are helpful."),
UserPromptPart(content="Roll me a dice."),
])
ModelResponse(parts=[
ThinkingPart(content="..."), # first-class citizen
ToolCallPart(tool_name="roll_dice", args={}),
], provider_details={"finish_reason": "STOP"})
ModelResponsePart is a discriminated union: TextPart | ToolCallPart | BuiltinToolCallPart | BuiltinToolReturnPart | ThinkingPart | FilePart
Why this is better than LangChain's approach: When Anthropic returns [thinking_block, text_block, tool_use_block] in one response, PydanticAI's parts list naturally represents that. LangChain packs it into AIMessage.content: List[Dict] and relies on you to iterate correctly. ThinkingPart is a first-class type, not a bolt-on.
Provider-Specific Feature Handling:
provider_detailson responses — typed, named field (notadditional_kwargscatch-all)- Prefixed settings —
openai_service_tier='priority',anthropic_thinking=...— makes non-portable params obvious - Builtin tools —
WebSearchTool,CodeExecutionToolmap to provider-native implementations when available, fall back to generic otherwise
Strengths:
- Cleanest abstraction design — Model/Provider/Profile separation
- Type safety as core principle —
output_type=CityInfoenforced by Pydantic validation - Parts-based IR is closer to how models actually work
- Dependency injection built-in (testability from day one)
TestModelandFunctionModelfor testing without API calls- Lightweight — focused scope, not a framework for everything
- v1.0 released September 2025; production-adopted
Weaknesses:
- Smaller ecosystem — no built-in retrievers, vector stores, document loaders
- Still subject to LCD problem — same feature (structured output) implemented very differently across providers (Claude post-formats → 2x latency, Gemini zeroes logits → single call)
- Provider API often richer than what any abstraction exposes
- Breaks down for larger teams relying heavily on provider-specific capabilities
Best for: Python developers who want a thin, well-typed abstraction with a clean agent loop and are comfortable assembling other components themselves.
5. Decision Framework
Quick Reference Matrix
| Dimension | LiteLLM/Gateway | Vercel AI SDK | LangChain | PydanticAI |
|---|---|---|---|---|
| Language | Any (HTTP proxy) | TypeScript | Python-first | Python |
| Abstraction cost | Network hop per request | Zero (compile-time) | In-process function call | In-process function call |
| Feature fidelity | Lowest common denominator | High (typed providerOptions) | Medium (untyped additional_kwargs) | High (typed provider_details + prefixed params) |
| Provider features | Mostly dropped | Explicit escape hatches | Class-specific params | Model/Provider/Profile separation |
| Scope | LLM calls only | LLM calls + UI streaming | Full framework (chains, memory, RAG, agents) | Agents + tools + typed output |
| Type safety | None | Full (TypeScript) | Partial | Full (Pydantic) |
| Ecosystem | 100+ provider support | Growing, frontend-centric | Massive | Focused, growing |
| Ops complexity | Proxy server to manage | SDK dependency | Framework dependency | Library dependency |
| Thinking tokens | Varies by provider support | Provider-specific metadata | additional_kwargs | First-class ThinkingPart |
When to Use What
| Situation | Recommendation |
|---|---|
| TypeScript web app | Vercel AI SDK — principled abstraction, frontend streaming |
| Centralized gateway (cost tracking, rate limiting, audit) | LiteLLM / Portkey — plan for proxy bottleneck |
| Local open-source models | OpenAI-compatible via vLLM / SGLang / Ollama |
| Complex multi-step agent workflows (memory, RAG, graph agents) | LangChain + LangGraph + LangSmith |
| Python, clean typed agents | PydanticAI — thin abstraction, Pydantic validation |
| Heavy provider-specific features (thinking, caching, computer use) | Native provider SDK + thin adapter you control |
| Single-provider, max performance | Native SDK directly |
6. The Math That Matters
The Set Intersection Problem
Let F_i = feature set of provider i.
A universal abstraction can natively support only ∩F_i (the intersection of all providers' features).
Everything in F_i \ ∩F_i (features unique to provider i) requires provider-specific escape hatches.
As providers innovate independently:
- |F_i \ ∩F_i| (provider-unique features) grows faster than |∩F_i| (shared features)
- The universal adapter gets worse over time relative to native integration
- Every abstraction layer acknowledges this with escape hatches (
additional_kwargs,providerOptions,provider_details,extra_body)
The 80/20 Reality
80% case — send text, get text, simple function calling:
- Translation layers work fine
- Conceptual model is shared across all providers
- Any abstraction layer handles this well
20% case — prompt caching, thinking tokens, structured outputs, multi-modal, provider tools:
- These are the features that differentiate models and drive provider choice
- Abstraction layers either drop them (losing value) or leak them (complicating the abstraction)
- These features are increasingly where the production value lives
The Core Principle
**Abstraction layers that hide provider differences are valuable for routing.
Abstractions that erase them are harmful for quality.**
The best tools (Vercel AI SDK, PydanticAI) understand this distinction and provide typed extensibility mechanisms rather than pretending differences don't exist.
Practical Implication
The real question isn't "which abstraction layer?" — it's:
- How many providers do I actually need to swap between?
- Am I using provider-specific features?
If "maybe 2" and "yes" → thin adapter you control. If "5+" and "no" → gateway/abstraction layer is fine.
Appendix: Provider API Quick Reference
Endpoint Patterns
| Provider | Base URL | Endpoint |
|---|---|---|
| OpenAI | https://api.openai.com | /v1/chat/completions |
| Anthropic | https://api.anthropic.com | /v1/messages |
| Gemini | https://generativelanguage.googleapis.com | /v1beta/models/{model}:generateContent |
| xAI | https://api.x.ai | /v1/chat/completions |
| DeepSeek | https://api.deepseek.com | /chat/completions |
| Qwen (DashScope) | https://dashscope-intl.aliyuncs.com | /compatible-mode/v1/chat/completions |
| Kimi (Moonshot) | https://platform.moonshot.ai | OpenAI-compat + Anthropic-compat |
| MiniMax | https://api.minimax.io | /v1 (OpenAI) or /anthropic (Anthropic) |
Auth Patterns
| Provider | Auth Header |
|---|---|
| OpenAI / xAI / DeepSeek / Qwen | Authorization: Bearer {key} |
| Anthropic | x-api-key: {key} + anthropic-version: 2023-06-01 |
| Gemini | x-goog-api-key: {key} or OAuth2 |
| MiniMax | Authorization: Bearer {key} (both endpoints) |