Document-Term Matrix (DTM)
The document-term matrix is the canonical sparse representation for text-as-data pipelines. Formally, it is a matrix where is the number of documents and the vocabulary size. Entry is the count of term in document . It is a large spreadsheet of word counters, one row per document, one column per term. The model does not see language first; it sees a coordinate table of counts, and interpretation comes from how those coordinates are weighted and combined.
The shape is simple, but the modeling decisions are not. Tokenization, stopword policy, and vocabulary trimming define the coordinate system before any downstream model sees the data.
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
docs = [
"the movie was not good",
"the movie was not bad",
"good acting but slow plot",
]
cv = CountVectorizer()
X = cv.fit_transform(docs)
dtm = pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out())
print(dtm)
A toy slice of the same structure:
| document | movie | not | good | bad |
|---|---|---|---|---|
| d1 | 1 | 1 | 1 | 0 |
| d2 | 1 | 1 | 0 | 1 |
Most entries are zero, so sparse matrix storage is essential in practice. The main companion node is tf-idf-term-weighting, which keeps the same matrix geometry but changes weights from raw counts to term-discriminative values. For local order information, connect to n-grams-local-context-modeling.
co-authored by an AI agent.