December 23, 2025

Five Things AI (LLMs) Excel At: A Bayesian Statistician's Perspective

LLMsBayesianfirst principlesAI engineering

Five Things AI (LLMs) Excel At: A Bayesian Statistician's Perspective

Given my training in Bayesian statistics, years of experience in data science and ML, and now having worked in the AI engineering trenches for the past 3 years, I tend to look at LLMs from first principles and use analogies from Bayesian statistics and information theory. After all, transformers are simply predicting the next token statistically.

Memorization (compressed, not rote). During pretraining, LLMs ingest a lot of common patterns: code structure, devop flows, common practices, etc. That's why they are so good at creating scaffolding, templates, etc. This is no brainer and not much surprise here. But the nuance is that it's lossy compression, not a lookup table — the model learns a minimum description length encoding of its training corpus. This is why it can interpolate between patterns it's seen (generating plausible code in a style it never exactly encountered), but also why it hallucinates — it's reconstructing from a compressed representation, and sometimes the decompression produces artifacts that look right but aren't.
Formatting, translation and transformation of information. This is literally in its name! If you look at the origin of transformer, it was intended to translate natural languages. The attention mechanism learns alignments between different representational spaces — that's literally what cross-attention was designed for in the original seq2seq context. What's remarkable is that the same mechanism generalizes: code↔natural language, formal↔informal, JSON↔prose — these are all just different "languages" with learnable alignment maps. The mental model is that the model learns a shared latent space where meaning lives format-free, and then projects into whatever output format you request. A "presentation layer" on top of content, if you will.
Moving information up and down the scale. LLMs have a lot of layers. As information moves up the layers, it gets distilled and moves up in abstraction — early layers capture syntactic/local patterns, middle layers capture semantic relationships, later layers handle task-specific reasoning. And given the residual stream architecture, information can easily move back down the scale and scoop up the details if so instructed. Think of the residual connections as a "bus" — any layer can read from or write to it. If you want a physics analogy, it's like a wavelet decomposition: different layers capture different frequency components of the information. That's why LLMs are so good at both summarization (reading from abstract layers) and digging up details (tapping lower layers through the residual path).
Brainstorming. This is really a fun hypothesis for me. The mental model is that a LLM embodies a gigantic distribution with a very rugged landscape: there are a lot of valleys and peaks in its likelihood function, almost like a fractal structure — if you zoom in, you see more richness. So brainstorming is all about guiding or prompting the LLM to the right area at the right scale, and the LLM will give you the best proposal at that location.
In Bayesian terms: the pretrained model is the prior p(θ), your prompt is the observed data D, and the response is a sample from the posterior p(θ|D) ∝ p(D|θ)·p(θ). Temperature controls how peaked vs. diffuse that posterior is. Brainstorming works because at moderate temperature, you're sampling across multiple modes of this posterior rather than just hill-climbing to the MAP estimate. The fractal structure means the distribution has useful structure at every scale of specificity — which is why you can brainstorm at "give me business ideas" or "give me variable names for this function" and get useful outputs at both levels.
One important bound to keep in mind: the quality of brainstorming is limited by the support of the prior. LLMs can't propose things truly outside their training distribution — they can only recombine and interpolate. Excellent brainstorming partners for well-explored spaces, weaker where genuine novelty is needed.
Automating routine decision making. This is a recent one, and it is also where the hype and opportunities about agents lie. With extensive RL training, now the top models are so good at:
- Decompose a task into steps: my guess is the top labs have a lot of synthetic data about this one. Though there's also an argument that chain-of-thought reasoning emerges somewhat naturally from next-token prediction on text that contains reasoning chains (math textbooks, StackOverflow answers, etc.).
- Use the right tools: this is coming from post-training, but it works so well because tool use is structurally similar to function calling in code, which the model has seen billions of examples of. The prior is already strong before post-training even begins.
- Follow the rules: this is most likely a combination of inference-time simulation (for models with extended thinking) and RLHF/RLAIF baked into the reward signal. Constitutional AI specifically trains the model to internalize constraint-following as a behavioral pattern — it's not just simulation, it's learned.