How Word2Vec Learns Representations: A Step-by-Step Breakdown

By ✦ min read

Introduction

Word2Vec is a foundational algorithm for creating dense vector representations of words—often called embeddings. These embeddings capture semantic relationships, such as analogies like king – man + woman ≈ queen. But how exactly does Word2Vec learn these representations? Recent research reveals that its learning process is surprisingly structured: it discovers concepts one at a time, progressively building a multi-dimensional embedding space. In this guide, we’ll walk through the steps that explain what Word2Vec learns and why its embeddings exhibit linear algebraic properties. By the end, you’ll have an intuitive grasp of the algorithm’s internal mechanics.

How Word2Vec Learns Representations: A Step-by-Step Breakdown — Source: bair.berkeley.edu

What You Need

Basic knowledge of word embeddings – Familiarity with vector representations of words.
Understanding of supervised vs self-supervised learning – Word2Vec uses a contrastive objective.
Elementary linear algebra – Concepts like vectors, subspaces, and eigenvalues.
Familiarity with gradient descent – The algorithm trains via backpropagation.
Open mind for learning dynamics – We’ll explore how training evolves over time.

Step 1: Frame Word2Vec as a Minimal Language Model

Word2Vec is essentially a two-layer neural network with a linear hidden layer. It takes a word (or context) and predicts a target word. The model uses a self-supervised contrastive loss, such as negative sampling, to maximize the probability of actual word pairs while minimizing that of random pairs. The weights of the hidden layer become the word embeddings. Understanding this framing is crucial—it sets the stage for the learning dynamics we’ll examine.

Step 2: Initialize Embeddings Near the Origin

In standard training, the embedding vectors are initialized randomly with very small values—essentially starting at the origin (zero vector). This has a profound effect: initially, the embeddings are nearly zero-dimensional. From there, they expand into distinct directions as training progresses. The initialization scale determines how the learning process unfolds; a tiny scale forces the model to gradually add new dimensions.

Step 3: Observe Discrete, Rank-Incrementing Learning Steps

As training begins, the embedding vectors collectively learn one “concept” (an orthogonal linear subspace) at a time. Empirically, the loss decreases in a stepwise fashion. Each step corresponds to the addition of a new dimension to the embedding space. For a concrete example, imagine the embeddings first capture a broad syntactic relationship (like noun-verb), then later refine to capture gender or tense. This sequential discovery is reminiscent of how humans learn—first grasping a coarse structure, then filling in details.

Step 4: Relate Learning to Matrix Factorization

Under mild approximations, the learning problem reduces to unweighted least-squares matrix factorization. The pointwise mutual information (PMI) matrix of word co-occurrences is factorized into two lower-rank matrices—the final embeddings. This connection explains why embeddings capture linear relationships: matrix factorization naturally discovers latent factors, analogous to PCA. The gradient flow dynamics can be solved exactly, showing that the final representation is simply the principal components of the PMI matrix.

Step 5: Solve Gradient Flow Dynamics – Equivalent to PCA

The continuous-time version of gradient descent (gradient flow) applied to this matrix factorization converges to a solution that corresponds to the PCA of the co-occurrence matrix. In other words, the learned embeddings are a truncated SVD of the shifted PMI matrix. This ties Word2Vec directly to classical dimensionality reduction techniques. The order in which dimensions are learned follows the descending eigenvalues: the strongest signal (largest eigenvalue) appears first, weaker signals later.

Step 6: Understand Linear Subspaces and Analogies

Once we have the PCA solution, the geometry of embeddings becomes clear. Each eigenvector represents an interpretable concept. For example, the top eigenvectors often encode gender, verb tense, or semantic categories. Because PCA is linear, these directions align with simple addition and subtraction operations. The classic analogy “man : woman :: king : queen” works because the vector difference between man and woman approximates an eigenvector, which when added to king lands near queen. This linear representation hypothesis is not random—it emerges from the matrix factorization formulation.

Step 7: Validate with Experiments and Visualizations

You can reproduce these dynamics by training a small Word2Vec model from scratch. Visualize the embedding matrix’s singular values during training—they will increase one at a time. Plot the loss curve and observe the plateaus. Use t-SNE or PCA to project the learned embeddings and see the emergence of linear structures. These experiments confirm the theory: Word2Vec does not learn arbitrarily; it systematically builds a hierarchical semantic space.

Tips for Deepening Your Understanding

Experiment with initial scales – Try larger initializations to see how the stepwise behavior blurs.
Compare with alternatives – GloVe (also matrix factorization) or modern transformers; note the shared principles.
Study general linear representation hypothesis – This concept extends to large language models; understanding Word2Vec is a precursor.
Read the original paper – The referenced paper (which this guide summarizes) provides rigorous proofs.
Iterate on a small corpus – Use a tiny text dataset to watch each learning step unfold in real time.

Tags: