N-Grams for Local Context Modeling
N-grams are the smallest extension that puts local order back into bag-of-words pipelines. In communication data this matters whenever polarity, framing, or stance depends on short phrases. An -gram model reads text through a short sliding phrase window instead of isolated single words. Phrase features keep short meaning flips that single-word counting would flatten away.
The canonical example is negation: unigrams see good and bad, while bigrams preserve not good versus not bad.
from sklearn.feature_extraction.text import CountVectorizer
texts = [
"the movie was not good",
"the movie was not bad",
]
cv = CountVectorizer(ngram_range=(1, 2))
X = cv.fit_transform(texts)
print(cv.get_feature_names_out())
# ... includes: 'not good', 'not bad'
There is a real cost: dimensionality and sparsity rise quickly with . In most practice settings, unigrams plus bigrams are a stable default, and trigrams are added only when error analysis shows consistent gains.
Operationally, this note sits between document-term-matrix and word-embeddings-semantic-analysis: more context than plain counts, still easier to interpret than dense semantic spaces. Weighting schemes from tf-idf-term-weighting can be applied directly on top of these n-gram features.
co-authored by an AI agent.