chromium/components/url_formatter/spoof_checks/common_words/data/README.md

# Chrome common word list

Chrome's common words list is derived from Peter Norvig's `count_1w.txt`
dataset, itself derived from Google's Web Trillion Word Corpus. See
https://norvig.com/ngrams/.

## `common_words.gperf`
This file contains a list of 10k most common English words, all of which are at
least 3 characters long, and containing no brands from the current top domain
list. This file is used to generate a DAFSA which is then embedded in Chrome.

## `common_words_test.gperf`
A version of `common_words.gperf` used in unit tests. Most (all?) lookalike
tests use test versions of top domain list called `test_domains.list`.
`common_words_test.gpref` should be kept in sync with the test domains file.

## `brands_in_common_words.list`
This file contains a list of brands, where each brand is identified by
manually looking through overlaps between brand names in `domains.list` and
common English words.

## How to generate the files

`common_words.gperf` is generated by downloading a publicly available common
words file, computing the overlap of this file with `domains.list` and manually
filtering the overlap file.

1. Generate the initial list of overlaps with:
```

# Run this from the src/ dir:
cd components/url_formatter/spoof_checks/top_domains

# Get the count_1w.txt file:
wget https://norvig.com/ngrams/count_1w.txt

# Sort the registered-part of the domains from the top domain list, such as
# "google" and "apple":
cut -f1 -d. domains.list | sort > brands_sorted.list

# Generate overlap.list file which contains words common between count_1w.txt
# and brands_sorted.list. Only uses top 10K words with at least three characters
# from count_1w.txt.
awk 'length($1) > 2 {print $1}' count_1w.txt \
  | head -n 10000 | sort | comm -12 - brands_sorted.list > overlap.list
```

2. Manually filter `overlap.list`, removing anything that either isn't a brand
(e.g. "weather"), or is a brand but is also a valid English word (e.g. "apple").

This approach isn't perfect -- the common words list still contains lots of
brands, but at least it shouldn't contain brands that are in the current top
domain list.

3. Copy the contents of `overlap.list` to `brands_in_common_words.list`.

4. Run `./generate_common_words.sh` to generate `common_words.gperf`.

The final list also contains extra unfiltered words. Spot-check the final
`common_words.gperf` file to ensure no brands snuck in. If there are any brands,
remove them manually.

**This list should be regenerated whenever the top domains list is updated.**