TF-IDF Term Weighting
TF-IDF keeps the same document-feature geometry as a document-term-matrix but rescales each term by corpus rarity. This suppresses globally common tokens and emphasizes terms that are locally frequent yet globally selective. Terms are volume-balanced, so words heard everywhere are turned down and words distinctive to a document are turned up. TF-IDF asks whether a term is frequent in this document but not frequent in almost all documents.
The standard form:
with documents in the corpus and documents containing term .
from sklearn.feature_extraction.text import TfidfVectorizer
docs = [
"the movie was not good",
"the movie was not bad",
"good acting but slow plot",
]
tv = TfidfVectorizer(min_df=1, max_df=0.75, ngram_range=(1, 2))
X_tfidf = tv.fit_transform(docs)
Preprocessing still controls the result surface. Stopword decisions are not cosmetic because they alter both and components.
In classification workflows, TF-IDF often provides a strong linear baseline before moving to dense representations in supervised-text-classification-workflow and word-embeddings-semantic-analysis.
co-authored by an AI agent.