Chapter 58: Non-negative Matrix Factorization (NMF)¶

Prerequisites: Non-negative Matrices (Ch17) · Singular Value Decomposition (Ch11) · Matrix Analysis (Ch14)

Chapter Outline: Motivation for NMF (The Non-negative Essence of Data) → Definition of Non-negative Matrix Factorization → The Concept of Non-negative Rank $rank_{nn}(V)$ → Differences from SVD: From Global Orthogonality to Local Non-negativity → Core Trait: Part-based Representation → Algorithmic Implementation: Multiplicative Update Rules → Objective Functions: Frobenius Distance vs. KL Divergence → Applications: Text Mining (Topic Modeling), Facial Recognition (Feature Detection), Hyperspectral Unmixing, and Recommender Systems

Extension: NMF is the product of linear algebra's compromise with "interpretability"; by sacrificing orthogonality for non-negativity, it reveals how complex data is composed of simple, physically meaningful components through "addition" rather than "cancellation." It is a cornerstone of modern unsupervised learning.

In traditional SVD, basis vectors may contain negative values, which often lack intuitive meaning when processing image pixels or word counts. Non-negative Matrix Factorization (NMF) imposes a strong non-negativity constraint, forcing the system to synthesize data only through "addition." This "add-only" logic unexpectedly results in a part-based representation, allowing us to automatically extract meaningful "components" from raw data. This chapter explores this decomposition technique that balances computational challenge with explanatory power.

58.1 Definition and Non-negative Rank¶

Definition 58.1 (NMF)

Given a non-negative matrix $V \in \mathbb{R}^{m \times n}$ ($V \ge 0$), find non-negative matrices $W \in \mathbb{R}^{m \times k}$ and $H \in \mathbb{R}^{k \times n}$ such that: $$V \approx WH, \quad W, H \ge 0$$ - $W$ is usually called the Basis Matrix (features). - $H$ is usually called the Coefficient Matrix (weights).

Definition 58.2 (Non-negative Rank)

The smallest dimension $k$ for which $V = WH$ holds exactly with $W, H \ge 0$ is the non-negative rank of $V$, denoted $rank_{nn}(V)$. Property: $rank(V) \le rank_{nn}(V)$. Computing the non-negative rank is NP-hard.

58.2 Core Trait: Part-based Representation¶

Explanation: Why extract "parts"?

Because entries of $W$ and $H$ are non-negative, a data point in $V$ can only be obtained through the linear superposition of basis vectors. To reconstruct complex patterns, the algorithm tends to make the basis vectors in $W$ sparse, representing distinct parts of an object (like eyes or a nose in a face). In SVD, basis vectors can cancel each other out (positive and negative), leading to overlapping global patterns that often lack physical meaning.

58.3 Algorithm: Multiplicative Update¶

Algorithm 58.1 (Lee-Seung Multiplicative Update)

For the Frobenius norm objective, the update rules are: $$H_{aj} \leftarrow H_{aj} \frac{(W^T V)_{aj}}{(W^T WH)_{aj}}, \quad W_{ia} \leftarrow W_{ia} \frac{(VH^T)_{ia}}{(WHH^T)_{ia}}$$ Trait: As long as the initial values are positive, the update automatically maintains non-negativity and decreases the objective function at each step.

Exercises¶

1. [Basics] Find the non-negative rank of $V = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}$.

Solution

Analysis: 1. The standard rank is 2. 2. Since it is a diagonal matrix with 1s, we cannot synthesize these two orthogonal directions using only one non-negative column vector. Conclusion: The non-negative rank is 2. For the identity matrix, the non-negative rank equals the standard rank.

2. [Comparison] Briefly state the main difference between NMF and SVD regarding basis vectors.

Solution

Comparison: - SVD: Basis vectors are mutually orthogonal and contain both positive and negative values. Ideal for variance explanation but often lacks interpretability (e.g., negative pixel values). - NMF: Basis vectors are not necessarily orthogonal but must be non-negative. This produces "local" features with strong physical interpretability.

3. [Calculation] Find the non-negative rank and an NMF factorization for $V = \begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix}$.

Solution

Construction: 1. Standard rank is 1. 2. $V = \begin{pmatrix} 1 \\ 1 \end{pmatrix} \begin{pmatrix} 1 & 1 \end{pmatrix}$. 3. Factors are non-negative. Conclusion: The non-negative rank is 1.

4. [Existence] Does every non-negative matrix have an NMF where $k = rank(V)$?

Solution

Conclusion: Not necessarily. This is one of the deepest questions in NMF theory. There exist matrices (e.g., specific $5 \times 5$ constructions) with standard rank 3, but where every non-negative factorization requires $k \ge 4$ or $5$. This "rank gap" reflects the algebraic rigidity introduced by non-negativity constraints.

5. [Objective] Name another common loss function for NMF besides the Frobenius norm.

Solution

Conclusion: Kullback-Leibler (KL) Divergence. $$D(V || WH) = \sum (V_{ij} \log \frac{V_{ij}}{(WH)_{ij}} - V_{ij} + (WH)_{ij})$$ This performs better when handling data with Poisson noise, such as count data or word frequencies in text.

6. [Uniqueness] Is the NMF decomposition unique?

Solution

Conclusion: Usually not. Reasoning: If $V = WH$, then for any diagonal matrix $D \succ 0$, $V = (WD)(D^{-1}H)$. There can also be complex rotational uncertainties. To obtain a unique solution, additional constraints such as sparsity are usually imposed.

7. [Text Mining] In text mining, what does the basis matrix $W$ represent?

Solution

Explanation: In a term-document matrix, each column of $W$ represents a Topic. Words with large values in a column are the core keywords for that topic. The coefficient matrix $H$ represents the distribution weights of each document over these topics.

8. [Sparsity] Why do NMF results tend to be sparse?

Solution

Algebraic Intuition: Since elements can only be added, to reconstruct a sparse data matrix, $W$ and $H$ must contain many zeros. The non-negativity constraint pushes the solution toward the "boundary" of the non-negative orthant (the axes), naturally inducing sparsity.

9. [Complexity] Why is NMF a non-convex optimization problem?

Solution

Reasoning: While the objective function is convex in $W$ (given $H$) and convex in $H$ (given $W$), it is jointly non-convex in $(W, H)$. This means algorithms may get stuck in local minima, and the choice of initialization is critical.

10. [Application] Briefly describe NMF in a recommender system (e.g., movie ratings).

Solution

$V$ is the user-movie rating matrix.
$W$ represents user "latent preferences" (e.g., liking for sci-fi, action, romance).
$H$ represents movie "latent attributes" (e.g., how much a movie belongs to those genres).
Predicted ratings are the inner products of these non-negative vectors. Non-negativity ensures that preferences and attributes accumulate, which is intuitive.

Chapter Summary¶

Non-negative matrix factorization represents a revolution of "interpretability" in linear algebra:

Reward of Constraints: It proves that by imposing seemingly restrictive non-negativity, we gain structured decompositions with real-world physical meaning beyond pure mathematical optimization.
Part and Whole: NMF successfully realizes the "part-based" decomposition, degrading complex signals into atomic components and establishing a new paradigm for feature extraction.
Computational Challenge: The hardness of non-negative rank and joint non-convexity show that "structural interpretation" comes at a cost, driving the evolution of randomized and alternating optimization algorithms.