Back to resources

January 14, 2026

LLM API Abstraction Landscape

LLM APIsabstractionPydanticAIinfrastructure

Table of Contents

  1. The Standards War: Who Won?
  2. Wire-Level API Comparison
  3. Open-Source & Chinese Providers
  4. Abstraction Layer Approaches
  5. Decision Framework
  6. The Math That Matters

    • 4a. LiteLLM / Gateway Proxy
    • 4b. Vercel AI SDK
    • 4c. LangChain
    • 4d. PydanticAI

1. The Standards War: Who Won?

OpenAI's /chat/completions — The De Facto Standard

OpenAI's Chat Completions API became the de facto standard due to ChatGPT adoption (late 2022). Virtually every inference engine and open-source provider adopted it: vLLM, SGLang, Ollama, Together, Fireworks, DeepSeek, Qwen, Kimi, Zhipu (GLM), and xAI all expose OpenAI-compatible endpoints.

OpenAI's Own Migration Away

Ironically, OpenAI itself is transitioning away from this standard:

GenerationAPIStatus
Gen 1/v1/completionsDeprecated
Gen 2/v1/chat/completionsCurrent standard, but OpenAI moving on
Gen 3/v1/responses (Responses API)OpenAI's new direction (March 2025)
  • Assistants API deprecated August 2025, sunset August 2026
  • Open Responses Specification released January 2025 — open-source spec based on the Responses API
  • Supported by: Ollama, vLLM, OpenRouter, Hugging Face, Vercel, Nvidia
  • Notable omissions: Anthropic, Google DeepMind

The Emerging Two-Standard Reality

Rather than convergence on one universal format, the ecosystem is converging on two dominant wire protocols:

  1. OpenAI's /chat/completions — the default for everything
  2. Anthropic's /v1/messages — emerging as a second standard, driven primarily by Claude Code compatibility

DeepSeek, MiniMax, xAI, and Ollama now support both formats. The prediction: frontier open-source labs will offer both OpenAI and Anthropic API compatibility.


2. Wire-Level API Comparison

2.1 Basic Request Structure

OpenAI (/v1/chat/completions):

{
  "model": "gpt-4.1",
  "messages": [
    { "role": "system", "content": "You are helpful." },
    { "role": "user", "content": "Hello" }
  ],
  "max_tokens": 1000
}

Anthropic (/v1/messages):

{
  "model": "claude-sonnet-4-20250514",
  "system": "You are helpful.",
  "messages": [{ "role": "user", "content": "Hello" }],
  "max_tokens": 1000
}

Gemini (/v1beta/models/gemini-2.5-flash:generateContent):

{
  "system_instruction": { "parts": [{ "text": "You are helpful." }] },
  "contents": [{ "role": "user", "parts": [{ "text": "Hello" }] }],
  "generationConfig": { "maxOutputTokens": 1000 }
}

xAI/Grok — identical to OpenAI (drop-in compatible).

2.2 Content Model Architecture

This is the first-principles structural difference that matters:

ProviderContent ModelUnit of ContentNesting
OpenAImessages[].content is string OR array of {type, ...} objectsMessageFlat
Anthropicmessages[].content is string OR array of content blocks (text, image, tool_use, tool_result, thinking)Content BlockFlat but typed
Geminicontents[].parts[] — list of Part objects (text, inlineData, functionCall, functionResponse, fileData)PartNested with candidates wrapper
xAISame as OpenAIMessageFlat

Key insight: Anthropic's content-block model is the most compositional. A single assistant turn can interleave thinking, text, and tool calls as sibling blocks:

{
  "role": "assistant",
  "content": [
    {"type": "thinking", "thinking": "Let me reason..."},
    {"type": "text", "text": "The answer is..."},
    {"type": "tool_use", "id": "call_1", "name": "search", "input": {...}}
  ]
}

OpenAI separates tool calls into a dedicated tool_calls field on the message. Gemini embeds them as parts.

2.3 System Prompt Location

ProviderLocation
OpenAI / xAIInside messages array with role: "system"
AnthropicTop-level system field
GeminiTop-level system_instruction field

2.4 Response Structures

ProviderContent pathStop signalUsage
OpenAIchoices[0].message.contentfinish_reasonusage.prompt_tokens / completion_tokens
Anthropiccontent[0].textstop_reasonusage.input_tokens / output_tokens
Geminicandidates[0].content.parts[0].textfinishReasonusageMetadata.promptTokenCount / candidatesTokenCount

2.5 Tool / Function Calling

AspectOpenAIAnthropicGemini
Tool definitiontools[].function.parameters (JSON Schema)tools[].input_schema (JSON Schema)tools[].functionDeclarations[].parameters (OpenAPI subset)
Tool call in responsemessage.tool_calls[] (separate field)Content block {"type": "tool_use"} (inline)Part {"functionCall": {...}} (inline)
Tool result{"role": "tool", "tool_call_id": "..."}{"type": "tool_result"} inside user message{"role": "function", "parts": [{"functionResponse": {...}}]}
Tool choice"auto" / "required" / {"function": {"name": "..."}}{"type": "auto" / "any" / "tool", "name": "..."}toolConfig.functionCallingConfig.mode: "AUTO" / "ANY" / "NONE"

2.6 Streaming Protocols

ProviderProtocolStructure
OpenAISSEdata: {"choices": [{"delta": {"content": "token"}}]} — incremental deltas
AnthropicSSE with semantic eventsmessage_start, content_block_start, content_block_delta, content_block_stop, message_delta, message_stop — most structured
GeminiSSEChunks of GenerateContentResponse objects

Anthropic's streaming is the most structured — you know exactly which content block is being populated via indexed start/delta/stop events.

2.7 Provider-Specific Features (Not Abstractable)

FeatureProvider(s)Abstraction Difficulty
Prompt cachingAnthropic (inline cache_control), Gemini (cachedContent resource)Hard — different mechanisms
Extended thinkingAnthropic (thinking blocks), OpenAI (reasoning in Responses), Gemini (thinkingConfig)Medium
CitationsAnthropic (inline on text blocks), Gemini (grounding metadata)Hard
Computer useAnthropic (native tool types), GeminiVery hard
Built-in tools (web search, code exec)OpenAI Responses, xAI, Anthropic, GeminiProvider-specific
Safety ratingsGemini onlyCan't map
max_tokens requiredAnthropic requires it; others defaultSmall but breaks naive adapters

3. Open-Source & Chinese Providers

API Format Adoption Map

ProviderPrimary API FormatSecondary FormatNotes
DeepSeekOpenAI /chat/completionsAnthropic /messagesV3.2 uses OpenAI SDK directly. Added Anthropic format for Claude Code compat.
Kimi / MoonshotOpenAI /chat/completionsAnthropic /messagesAnthropic API maps temperature: real_temp = request_temp * 0.6
Qwen / AlibabaOpenAI (via "compatible mode")Native DashScope SDKdashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions
MiniMaxAnthropic /messages (recommended)OpenAI /chat/completionsRecommends Anthropic format for full feature support
GLM / Zhipu (Z.ai)OpenAI /chat/completionsStandard /v1/chat/completions format
xAI / GrokOpenAI /chat/completionsAnthropic /messagesCompatible with both SDKs natively

Self-Hosted Inference Engines

EngineFormat
vLLMOpenAI-compatible
SGLangOpenAI-compatible
OllamaOpenAI-compatible + Anthropic-compatible (announced)
llama.cppOpenAI-compatible
TGI (HuggingFace)OpenAI-compatible

Where "Compatible" Breaks Down

The compatibility is real for the 80% case but gets leaky fast:

Reasoning/thinking tokens — every provider does this differently even within the OpenAI-compatible format:

  • DeepSeek: reasoning_content as extra field on message
  • Kimi K2 Thinking: same reasoning_content field name (not standard OpenAI)
  • Qwen3: extra_body parameters to toggle thinking
  • OpenAI: separate reasoning object in Responses API

Tool calling chat templates — when running models locally, the inference engine needs model-specific chat templates for tool calls. DeepSeek V3.2, Qwen, Llama, and Mistral all have different internal formats papered over by the "OpenAI-compatible" endpoint.

Provider-specific extra_body extensions:

  • Qwen: extra_body={"enable_search": True}
  • DeepSeek: extra_body={"reasoning_content": "enable"}
  • MiniMax: extra_body={"reasoning_split": True}

What's Actually Standardized vs. Fragmented

┌─────────────────────────────────────────────┐
│  WELL STANDARDIZED                          │
│  • POST /v1/chat/completions                │
│  • messages[{role, content}] shape           │
│  • response.choices[0].message.content       │
│  • basic streaming (SSE data: chunks)        │
│  • temperature, max_tokens, top_p            │
├─────────────────────────────────────────────┤
│  MOSTLY WORKS BUT QUIRKY                    │
│  • Function/tool calling schemas             │
│  • Structured output / JSON mode             │
│  • Multi-modal (image_url in content)        │
│  • Stop sequences                            │
├─────────────────────────────────────────────┤
│  FRAGMENTED / PROVIDER-SPECIFIC             │
│  • Reasoning/thinking tokens                 │
│  • Prompt caching                            │
│  • Extended context management               │
│  • Web search / built-in tools               │
│  • Computer use                              │
│  • Streaming event semantics                 │
│  • Usage reporting for cached tokens         │
│  • Reasoning effort controls                 │
└─────────────────────────────────────────────┘

4. Abstraction Layer Approaches

4a. LiteLLM / Gateway Proxy

Approach: HTTP proxy server that normalizes OpenAI-format requests into provider-specific API calls. Maintains massive translation table for 100+ providers.

Architecture:

Your App (speaks OpenAI format)
    → HTTP → LiteLLM Proxy
        → translates → Provider API
        ← translates ← Provider Response
    ← HTTP ← OpenAI-format response

Strengths:

  • Language-agnostic (any HTTP client works)
  • Centralized cost tracking, rate limiting, audit logging
  • Widely adopted: CrewAI, Giskard use as default

Weaknesses:

  • Lowest-common-denominator problem — loses provider-specific features or gets them late
  • Latency overhead: ~8ms p95 claimed, worse at scale; gradual memory leaks reported
  • Operational complexity: another service to deploy and monitor
  • Competitors emerging: Bifrost (54x faster p99), Portkey, OpenRouter

Best for: Multi-provider routing with cost tracking needs, Python-centric teams.

4b. Vercel AI SDK

Approach: TypeScript-first SDK with specification layer defining abstract interfaces. Each provider has a typed adapter that preserves native capabilities via providerOptions.

Architecture:

Your App (TypeScript)
    → generateText() / streamText()
        → Provider Adapter (compile-time translation)
            → Provider API (native format)

Strengths:

  • Zero runtime overhead — translation happens at build time
  • Provider options with type safetyproviderOptions.anthropic.thinking, providerOptions.openai.reasoningEffort
  • Full type safety across the stack
  • React/Vue/Svelte/Angular streaming support
  • AI SDK 6 introduces reusable Agent abstraction

Weaknesses:

  • TypeScript-only ecosystem
  • Still needs per-provider knowledge for advanced features
  • Smaller community than LangChain

Philosophy: "Abstraction without erasure" — routing is valuable, hiding provider differences is harmful.

Best for: TypeScript web apps, especially with frontend streaming needs.

4c. LangChain

Approach: Full Python framework with internal canonical type system (intermediate representation). Each provider integration translates to/from LangChain's message types.

Architecture:

Your App (Python)
    │ speaks: HumanMessage, AIMessage, ToolMessage
    │ calls: .invoke(), .stream(), .batch(), .bind_tools()
    ▼
BaseChatModel (Runnable Interface)
    │
    ├── ChatOpenAI      → translates → OpenAI API
    ├── ChatAnthropic    → translates → Anthropic API
    ├── ChatGoogleGenAI  → translates → Gemini API
    └── ...

Internal Message Types:

BaseMessage (abstract)
├── HumanMessage       → OpenAI "user", Anthropic "user"
├── AIMessage          → OpenAI "assistant", Anthropic "assistant"
│   ├── .tool_calls    (standardized across providers)
│   ├── .usage_metadata (standardized)
│   └── .additional_kwargs (untyped provider-specific bag)
├── SystemMessage      → OpenAI "system", Anthropic top-level system
├── ToolMessage        → OpenAI {"role":"tool"}, Anthropic {"type":"tool_result"}
└── FunctionMessage    → (deprecated)

Provider-Specific Feature Handling — Three Mechanisms:

  1. Standardized fields on AIMessage — tool calls, usage metadata normalized across providers
  2. additional_kwargs — untyped Dict[str, Any] for raw provider data (the escape hatch)
  3. Provider-specific class parametersChatAnthropic(thinking=..., betas=[...]), not portable

init_chat_model — Universal Constructor:

from langchain.chat_models import init_chat_model
model = init_chat_model("openai:gpt-4.1")
model = init_chat_model("anthropic:claude-sonnet-4-5-20250929")

Swap models in one line — but provider-incompatible params cause runtime errors, not compile-time safety.

Strengths:

  • Massive ecosystem: memory, retrievers, vector stores, document loaders, 100+ integrations
  • Runnable interface: composable chains with | pipe operator, with_fallbacks, with_retry
  • LangGraph: stateful agent graphs where model is a pluggable node
  • LangSmith: observability and evaluation

Weaknesses:

  • Leaky abstraction by design — standardizes 60% (text in/out, basic tools), punts 40% to provider-specific code
  • additional_kwargs is an untyped bag — invisible to type system
  • Deep class hierarchySerializable → Runnable → RunnableSerializable → BaseLanguageModel → BaseChatModel → ChatAnthropic
  • Debugging through layers — when translation mangles data, you're digging through 6+ abstraction levels
  • False sense of portability — chain looks provider-agnostic but breaks when swapping if you use caching, thinking, etc.

Best for: Complex multi-step agent workflows where the model is one component among many, and you want the full ecosystem (LangGraph + LangSmith + retrievers + memory).

4d. PydanticAI

Approach: Thin, well-typed Python agent framework that cleanly separates three concerns: Model (interface), Provider (auth/endpoint), and Profile (model-specific quirks). Built by the Pydantic team with FastAPI-inspired ergonomics.

Architecture:

Agent (orchestrates the loop)
│ - typed dependencies (dependency injection)
│ - typed output (Pydantic model validation)
│ - tools (auto-schema from Python functions)
│
│ speaks: ModelRequest / ModelResponse (parts-based)
▼
Model (abstract interface)
│ - request() → ModelResponse
│ - request_stream() → StreamedResponse
│
├── OpenAIChatModel + Provider + Profile
├── AnthropicModel  + Provider + Profile
├── GoogleModel      + Provider + Profile
└── ...

Key Design Innovation — Three-Way Separation:

ConcernWhat It DoesExample
ModelWire format implementationOpenAIChatModel (speaks /chat/completions)
ProviderAuth, endpoint, HTTP clientAzureProvider (Azure auth with OpenAI wire format)
ProfileSchema transforms, model quirksDeepSeek profile (different tool template, same OpenAI wire format)

This elegantly solves the "DeepSeek speaks OpenAI's wire format but isn't OpenAI" problem — no separate integration package needed.

Parts-Based IR (vs. LangChain's Role-Based Messages):

ModelRequest(parts=[
    SystemPromptPart(content="You are helpful."),
    UserPromptPart(content="Roll me a dice."),
])

ModelResponse(parts=[
    ThinkingPart(content="..."),       # first-class citizen
    ToolCallPart(tool_name="roll_dice", args={}),
], provider_details={"finish_reason": "STOP"})

ModelResponsePart is a discriminated union: TextPart | ToolCallPart | BuiltinToolCallPart | BuiltinToolReturnPart | ThinkingPart | FilePart

Why this is better than LangChain's approach: When Anthropic returns [thinking_block, text_block, tool_use_block] in one response, PydanticAI's parts list naturally represents that. LangChain packs it into AIMessage.content: List[Dict] and relies on you to iterate correctly. ThinkingPart is a first-class type, not a bolt-on.

Provider-Specific Feature Handling:

  1. provider_details on responses — typed, named field (not additional_kwargs catch-all)
  2. Prefixed settingsopenai_service_tier='priority', anthropic_thinking=... — makes non-portable params obvious
  3. Builtin toolsWebSearchTool, CodeExecutionTool map to provider-native implementations when available, fall back to generic otherwise

Strengths:

  • Cleanest abstraction design — Model/Provider/Profile separation
  • Type safety as core principleoutput_type=CityInfo enforced by Pydantic validation
  • Parts-based IR is closer to how models actually work
  • Dependency injection built-in (testability from day one)
  • TestModel and FunctionModel for testing without API calls
  • Lightweight — focused scope, not a framework for everything
  • v1.0 released September 2025; production-adopted

Weaknesses:

  • Smaller ecosystem — no built-in retrievers, vector stores, document loaders
  • Still subject to LCD problem — same feature (structured output) implemented very differently across providers (Claude post-formats → 2x latency, Gemini zeroes logits → single call)
  • Provider API often richer than what any abstraction exposes
  • Breaks down for larger teams relying heavily on provider-specific capabilities

Best for: Python developers who want a thin, well-typed abstraction with a clean agent loop and are comfortable assembling other components themselves.


5. Decision Framework

Quick Reference Matrix

DimensionLiteLLM/GatewayVercel AI SDKLangChainPydanticAI
LanguageAny (HTTP proxy)TypeScriptPython-firstPython
Abstraction costNetwork hop per requestZero (compile-time)In-process function callIn-process function call
Feature fidelityLowest common denominatorHigh (typed providerOptions)Medium (untyped additional_kwargs)High (typed provider_details + prefixed params)
Provider featuresMostly droppedExplicit escape hatchesClass-specific paramsModel/Provider/Profile separation
ScopeLLM calls onlyLLM calls + UI streamingFull framework (chains, memory, RAG, agents)Agents + tools + typed output
Type safetyNoneFull (TypeScript)PartialFull (Pydantic)
Ecosystem100+ provider supportGrowing, frontend-centricMassiveFocused, growing
Ops complexityProxy server to manageSDK dependencyFramework dependencyLibrary dependency
Thinking tokensVaries by provider supportProvider-specific metadataadditional_kwargsFirst-class ThinkingPart

When to Use What

SituationRecommendation
TypeScript web appVercel AI SDK — principled abstraction, frontend streaming
Centralized gateway (cost tracking, rate limiting, audit)LiteLLM / Portkey — plan for proxy bottleneck
Local open-source modelsOpenAI-compatible via vLLM / SGLang / Ollama
Complex multi-step agent workflows (memory, RAG, graph agents)LangChain + LangGraph + LangSmith
Python, clean typed agentsPydanticAI — thin abstraction, Pydantic validation
Heavy provider-specific features (thinking, caching, computer use)Native provider SDK + thin adapter you control
Single-provider, max performanceNative SDK directly

6. The Math That Matters

The Set Intersection Problem

Let F_i = feature set of provider i.

A universal abstraction can natively support only ∩F_i (the intersection of all providers' features).

Everything in F_i \ ∩F_i (features unique to provider i) requires provider-specific escape hatches.

As providers innovate independently:

  • |F_i \ ∩F_i| (provider-unique features) grows faster than |∩F_i| (shared features)
  • The universal adapter gets worse over time relative to native integration
  • Every abstraction layer acknowledges this with escape hatches (additional_kwargs, providerOptions, provider_details, extra_body)

The 80/20 Reality

80% case — send text, get text, simple function calling:

  • Translation layers work fine
  • Conceptual model is shared across all providers
  • Any abstraction layer handles this well

20% case — prompt caching, thinking tokens, structured outputs, multi-modal, provider tools:

  • These are the features that differentiate models and drive provider choice
  • Abstraction layers either drop them (losing value) or leak them (complicating the abstraction)
  • These features are increasingly where the production value lives

The Core Principle

**Abstraction layers that hide provider differences are valuable for routing.
Abstractions that erase them are harmful for quality.**

The best tools (Vercel AI SDK, PydanticAI) understand this distinction and provide typed extensibility mechanisms rather than pretending differences don't exist.

Practical Implication

The real question isn't "which abstraction layer?" — it's:

  1. How many providers do I actually need to swap between?
  2. Am I using provider-specific features?

If "maybe 2" and "yes" → thin adapter you control. If "5+" and "no" → gateway/abstraction layer is fine.


Appendix: Provider API Quick Reference

Endpoint Patterns

ProviderBase URLEndpoint
OpenAIhttps://api.openai.com/v1/chat/completions
Anthropichttps://api.anthropic.com/v1/messages
Geminihttps://generativelanguage.googleapis.com/v1beta/models/{model}:generateContent
xAIhttps://api.x.ai/v1/chat/completions
DeepSeekhttps://api.deepseek.com/chat/completions
Qwen (DashScope)https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions
Kimi (Moonshot)https://platform.moonshot.aiOpenAI-compat + Anthropic-compat
MiniMaxhttps://api.minimax.io/v1 (OpenAI) or /anthropic (Anthropic)

Auth Patterns

ProviderAuth Header
OpenAI / xAI / DeepSeek / QwenAuthorization: Bearer {key}
Anthropicx-api-key: {key} + anthropic-version: 2023-06-01
Geminix-goog-api-key: {key} or OAuth2
MiniMaxAuthorization: Bearer {key} (both endpoints)