# Chrome common word list
Chrome's common words list is derived from Peter Norvig's `count_1w.txt`
dataset, itself derived from Google's Web Trillion Word Corpus. See
https://norvig.com/ngrams/.
## `common_words.gperf`
This file contains a list of 10k most common English words, all of which are at
least 3 characters long, and containing no brands from the current top domain
list. This file is used to generate a DAFSA which is then embedded in Chrome.
## `common_words_test.gperf`
A version of `common_words.gperf` used in unit tests. Most (all?) lookalike
tests use test versions of top domain list called `test_domains.list`.
`common_words_test.gpref` should be kept in sync with the test domains file.
## `brands_in_common_words.list`
This file contains a list of brands, where each brand is identified by
manually looking through overlaps between brand names in `domains.list` and
common English words.
## How to generate the files
`common_words.gperf` is generated by downloading a publicly available common
words file, computing the overlap of this file with `domains.list` and manually
filtering the overlap file.
1. Generate the initial list of overlaps with:
```
# Run this from the src/ dir:
cd components/url_formatter/spoof_checks/top_domains
# Get the count_1w.txt file:
wget https://norvig.com/ngrams/count_1w.txt
# Sort the registered-part of the domains from the top domain list, such as
# "google" and "apple":
cut -f1 -d. domains.list | sort > brands_sorted.list
# Generate overlap.list file which contains words common between count_1w.txt
# and brands_sorted.list. Only uses top 10K words with at least three characters
# from count_1w.txt.
awk 'length($1) > 2 {print $1}' count_1w.txt \
| head -n 10000 | sort | comm -12 - brands_sorted.list > overlap.list
```
2. Manually filter `overlap.list`, removing anything that either isn't a brand
(e.g. "weather"), or is a brand but is also a valid English word (e.g. "apple").
This approach isn't perfect -- the common words list still contains lots of
brands, but at least it shouldn't contain brands that are in the current top
domain list.
3. Copy the contents of `overlap.list` to `brands_in_common_words.list`.
4. Run `./generate_common_words.sh` to generate `common_words.gperf`.
The final list also contains extra unfiltered words. Spot-check the final
`common_words.gperf` file to ensure no brands snuck in. If there are any brands,
remove them manually.
**This list should be regenerated whenever the top domains list is updated.**