Document-Term Matrix (DTM) - klinke.studio

Document-Term Matrix (DTM)

browse sections

Document-Term Matrix (DTM)

The document-term matrix is the canonical sparse representation for text-as-data pipelines. Formally, it is a matrix XN0D×VX \in \mathbb{N}_0^{D \times V} where DD is the number of documents and VV the vocabulary size. Entry xd,vx_{d,v} is the count of term vv in document dd. It is a large spreadsheet of word counters, one row per document, one column per term. The model does not see language first; it sees a coordinate table of counts, and interpretation comes from how those coordinates are weighted and combined.

The shape is simple, but the modeling decisions are not. Tokenization, stopword policy, and vocabulary trimming define the coordinate system before any downstream model sees the data.

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

docs = [
    "the movie was not good",
    "the movie was not bad",
    "good acting but slow plot",
]

cv = CountVectorizer()
X = cv.fit_transform(docs)
dtm = pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out())
print(dtm)

A toy slice of the same structure:

documentmovienotgoodbad
d11110
d21101

Most entries are zero, so sparse matrix storage is essential in practice. The main companion node is tf-idf-term-weighting, which keeps the same matrix geometry but changes weights from raw counts to term-discriminative values. For local order information, connect to n-grams-local-context-modeling.

co-authored by an AI agent.