how to count word frequency

📖 Bu rehber ToolPazar ekibi tarafından hazırlanmıştır. Tüm araçlarımız ücretsiz ve reklamsızdır.

Tokenization: the first hard choice

How you cut text into words determines every count downstream. Naive whitespace split:

Case folding

Punctuation is attached. Better: split on non-word characters, then lowercase:

Stop words

This keeps contractions (“don’t”) but strips commas and periods. Add hyphens to the class if you want “state-of-the-art” as one token.

Stemming vs lemmatization

The top 20 words in English text are “the, of, and, a, to, in, is, you, that, it, he, was, for, on, are” — rarely interesting. Standard stop-word lists strip them so the remaining counts reflect content.

Counting

Customize the list for your domain. SEO stop-word lists usually keep more terms than research-corpus lists.

N-grams: beyond single words

Both collapse word variants to a single form:

TF-IDF: frequency in context

Trivial with a map:

SEO application: keyword density

Single-word counts miss phrases. “San Francisco” carries information that “san” + “francisco” separately doesn’t. Bigrams (2-word) and trigrams (3-word) capture this:

Style checking

Bigram stop-word filtering is trickier — “of the” is noise but “state of the art” is signal. Strip bigrams where both tokens are stop words, keep the rest.

Research and corpus analysis

High TF-IDF = characteristic of the document. Great for tagging, topic extraction, and finding the “gist” words.

Hapax legomena and Zipf’s law

Keyword density = (count of keyword / total words) × 100. Old SEO target was 1–3%. Modern consensus: natural language beats forced density. Use frequency counting to:

Common mistakes

Frequency counts reveal habitual tics: “really,” “just,” “very,” “that” overused as filler. Run your draft through a frequency pass and the top 30 content words show your patterns.

Run the numbers

For larger corpora: