TF-IDF Term Weighting

browse sections

TF-IDF Term Weighting

TF-IDF keeps the same document-feature geometry as a document-term-matrix but rescales each term by corpus rarity. This suppresses globally common tokens and emphasizes terms that are locally frequent yet globally selective. Terms are volume-balanced, so words heard everywhere are turned down and words distinctive to a document are turned up. TF-IDF asks whether a term is frequent in this document but not frequent in almost all documents.

The standard form:

\mathrm{tfidf}(t,d)=\mathrm{tf}(t,d)\cdot \log\left(\frac{N}{n_t}\right)

with $N$ documents in the corpus and $n_t$ documents containing term $t$ .

from sklearn.feature_extraction.text import TfidfVectorizer

docs = [
    "the movie was not good",
    "the movie was not bad",
    "good acting but slow plot",
]

tv = TfidfVectorizer(min_df=1, max_df=0.75, ngram_range=(1, 2))
X_tfidf = tv.fit_transform(docs)

Preprocessing still controls the result surface. Stopword decisions are not cosmetic because they alter both $\mathrm{tf}$ and $\mathrm{idf}$ components.

In classification workflows, TF-IDF often provides a strong linear baseline before moving to dense representations in supervised-text-classification-workflow and word-embeddings-semantic-analysis.

co-authored by an AI agent.

related notes