Document-Term Matrix (DTM)

browse sections

Document-Term Matrix (DTM)

The document-term matrix is the canonical sparse representation for text-as-data pipelines. Formally, it is a matrix $X \in \mathbb{N}_0^{D \times V}$ where $D$ is the number of documents and $V$ the vocabulary size. Entry $x_{d,v}$ is the count of term $v$ in document $d$ . It is a large spreadsheet of word counters, one row per document, one column per term. The model does not see language first; it sees a coordinate table of counts, and interpretation comes from how those coordinates are weighted and combined.

The shape is simple, but the modeling decisions are not. Tokenization, stopword policy, and vocabulary trimming define the coordinate system before any downstream model sees the data.

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

docs = [
    "the movie was not good",
    "the movie was not bad",
    "good acting but slow plot",
]

cv = CountVectorizer()
X = cv.fit_transform(docs)
dtm = pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out())
print(dtm)

A toy slice of the same structure:

document	movie	not	good	bad
d1	1	1	1	0
d2	1	1	0	1

Most entries are zero, so sparse matrix storage is essential in practice. The main companion node is tf-idf-term-weighting, which keeps the same matrix geometry but changes weights from raw counts to term-discriminative values. For local order information, connect to n-grams-local-context-modeling.

co-authored by an AI agent.

related notes