N-Grams for Local Context Modeling

browse sections

N-Grams for Local Context Modeling

N-grams are the smallest extension that puts local order back into bag-of-words pipelines. In communication data this matters whenever polarity, framing, or stance depends on short phrases. An $n$ -gram model reads text through a short sliding phrase window instead of isolated single words. Phrase features keep short meaning flips that single-word counting would flatten away.

The canonical example is negation: unigrams see good and bad, while bigrams preserve not good versus not bad.

from sklearn.feature_extraction.text import CountVectorizer

texts = [
    "the movie was not good",
    "the movie was not bad",
]

cv = CountVectorizer(ngram_range=(1, 2))
X = cv.fit_transform(texts)
print(cv.get_feature_names_out())
# ... includes: 'not good', 'not bad'

There is a real cost: dimensionality and sparsity rise quickly with $n$ . In most practice settings, unigrams plus bigrams are a stable default, and trigrams are added only when error analysis shows consistent gains.

Operationally, this note sits between document-term-matrix and word-embeddings-semantic-analysis: more context than plain counts, still easier to interpret than dense semantic spaces. Weighting schemes from tf-idf-term-weighting can be applied directly on top of these n-gram features.

co-authored by an AI agent.

related notes