Skip to main content

Command Palette

Search for a command to run...

How Embeddings Actually Work: From Arbitrary IDs to the Geometry of Meaning

Updated
6 min read

This post is based on How Embeddings Actually Work by Claudius Papirus — Episode 5 of the "How AI Actually Works" course.


Take the word king. Subtract man. Add woman. You get queen.

That's not a metaphor. That's real arithmetic, done on real numbers, learned by a model that read billions of words and figured out the relationship entirely on its own. No one programmed it. No one wrote a definition. It just… emerged.

This is the story of embeddings — the hidden layer where words stop being text and start becoming something a machine can actually think with.


The Problem: Tokens Are Meaningless

In Episode 2 of this series, we learned that text gets broken into tokens. But tokens are just IDs — arbitrary numbers. The token for cat might be 9,674. That number tells you nothing about cats.

So between the raw token and the intelligent response you get back, something has to happen. The meaningless ID has to become a set of numbers that actually captures what the word means.

That bridge is an embedding.


Why the Obvious Approaches Fail

Option 1: Sequential numbering

Give every word a number. The = 1, cat = 2, sat = 3.

Problem: the model infers that cat is somehow "between" the and sat. Arbitrary numbering creates false relationships that have nothing to do with meaning.

Option 2: One-hot encoding

Give each word its own dimension. Cat = [1, 0, 0, 0...], dog = [0, 1, 0, 0...].

With a vocabulary of 50,000 words, you get a 50,000-dimensional space where every word is exactly as far from every other word. Cat is as distant from kitten as it is from economics. You've removed the false structure — but you've removed all structure.

What you actually need is a smaller space — a few hundred dimensions — where the geometry reflects meaning. Words that mean similar things should end up close together. Words that don't should be far apart.

But you can't design that by hand. Too many words, too many relationships, too many shades of meaning.

So you don't design it. You let the data build it.


Word2Vec: Teaching a Model to Learn Meaning

In 2013, Tomas Mikolov published a paper at Google that changed how the field thinks about language. The key insight came from a 1957 observation by linguist J.R. Firth:

"You shall know a word by the company it keeps."

Words that appear in similar contexts tend to mean similar things. Dog and cat both show up near pet, fed, walks. Dog and inflation don't.

Mikolov's team made that idea trainable. They built a small neural network with a deceptively simple task: given a word, predict the words that surround it. No definitions. No dictionaries. No human labels. Just billions of words of raw text and one prediction task.

During training, each word gets mapped to a vector — a list of around 300 numbers. Those numbers get adjusted millions of times as the model learns to predict context better.

When training is done, something remarkable emerges:

  • Happy, joyful, cheerfulneighbours in the space
  • Run, sprint, jogneighbours in the space
  • Words organised into a geography of meaning, with no one telling the model what anything meant

They called it Word2Vec.

The Analogy Trick

Researchers then found something even more striking. The vectors didn't just cluster by similarity — they encoded relationships.

The direction from man to woman in the vector space is roughly the same direction as king to queen, and uncle to aunt. Gender is a consistent direction in the space.

So is tense: walkedwalking matches swamswimming.

So is geography: Paris − France + Italy lands near Rome.

Directions in a space nobody designed, encoding relationships nobody labelled — discovered purely from predicting which words appear near which.


The Polysemy Problem

Word2Vec had a flaw that seems obvious once you see it: each word gets exactly one vector, no matter what.

"I deposited money at the bank." "I sat by the river bank."

In Word2Vec, bank is the same embedding in both sentences — a blurry average of every context it's ever appeared in. Not quite right for the financial meaning, not quite right for the river meaning.

This is the polysemy problem. One word, multiple meanings, one vector. Light in light blue vs light bulb vs light as a feather all collapse to the same point.

Static embeddings couldn't capture the fact that meaning shifts with context.


The 2018 Breakthrough: Contextual Embeddings

In 2018, two papers cracked it open:

  • ELMo from AI2
  • BERT from Google

Both arrived at the same answer from different angles: instead of one fixed vector per word, the embedding changes based on context. Bank next to river gets pulled in one direction. Bank next to investment gets pulled in another. Same word, different numbers.

This is exactly what happens inside the transformers that power modern AI. When a model processes your input:

  1. Each token starts with an initial embedding — looked up from a learned table
  2. The attention mechanism examines every other token in the sequence
  3. Layer by layer, the vectors get repositioned based on context
  4. By the time a word has passed through dozens of layers, it's been reshaped into something specific to this exact sentence, this exact position, this exact meaning

The dimensions have scaled to match, too — from Word2Vec's 300 numbers to thousands in today's models. More dimensions, finer distinctions, more room for nuance.

A concrete example: in the sentence "The cat was tired because it hadn't slept" — by the final layer, the embedding for it has drifted toward cat. The model resolved the reference without being told to. The same word it in a different sentence would point somewhere entirely different.

The embedding isn't a label anymore. It's a coordinate that moves with meaning.


Why This Matters: It's Running Everything

This geometry isn't just elegant — it's behind most of what you use today.

Semantic search: When you search and find results that match your meaning rather than your exact words, embeddings are why. The search engine converts your question into a vector and compares it to document vectors. "How to fix a leaky faucet" matches "plumbing repair guide" — zero shared words, but their embeddings are close.

RAG (Retrieval-Augmented Generation): When an AI retrieves relevant documents before answering your question, it's doing vector similarity search in embedding space.

Recommendations: When a system finds content you didn't search for but somehow knew you'd want, it's comparing your preference vector to content vectors.

Translation: When translation works between languages that structure sentences completely differently, the same principle applies — meaning has a shape, and that shape can be compared across languages.


A Closing Thought

Somewhere between the words you type and the response you get back, there's a space — high-dimensional, invisible, learned purely from patterns — where happy sits near joyful, and king − man + woman points toward queen.

Not because anyone decided it should. Because across billions of words, that's where the patterns put them.

Meaning, it turns out, isn't something you define. It's something that emerges when you pay enough attention to the company words keep.


Episode 5 of the How AI Actually Works course by Claudius Papirus. Previous episodes cover LLMs, tokens, training, and context windows.

More from this blog

A

AI with Alex & Angus

102 posts