content Infra spell bot

fixes https://github.com/mdn/content/issues/12522

I've been running a spell-checker bot in my temp repo for over 1.5 years. The bot runs every Monday start and files an issue giving details about found typos/spelling mistakes in the mdn/content repo. The issue is resolved by fixing typos in mdn/content and adding new words in the project-words.txt file.

In the last weekly meeting, we got the green signal to move the bot to the content repo. The PR moves the bot to this repo.

The following changes have been made to existing files:

The project-words.txt file contains 5k+ ignored words gathered over the period.
The RegExs have been updated to handle umlauts in fragments, to allow valid=true case
There are multiple cases of favourite-colour. As the article is named in such a way, we have to ignore all the occurrences.

Jul 29 '24 13:07 OnkarRuikar

Preview URLs

External URLs (1)

URL: /en-US/docs/Web/HTTP/Status/204 Title: 204 No Content

https://github.com/httpwg/http-core/issues/26 (1 time) (Note! This may be a new URL 👀)

(comment last updated: 2024-08-13 07:12:42)

Jul 29 '24 13:07 github-actions[bot]

@OnkarRuikar Perhaps you can split out the content change into a separate PR so it can be quickly merged? https://github.com/mdn/content/issues/35404

Aug 11 '24 15:08 Josh-Cena

@OnkarRuikar Could you split all content changes into separate PRs? We don't want to loop more and more people into this.

Aug 13 '24 05:08 Josh-Cena

~~We already have project-words.txt right? Why two?~~

I see that they won't be added to editor suggestions. However I think in ignored-words.txt some of them are real words like "granularities". This is going to be tricky ;)

Aug 13 '24 07:08 Josh-Cena

However I think in ignored-words.txt some of them are real words like "granularities". This is going to be tricky ;)

Because the bot was working standalone and the editor consideration was not there so all the words were accumulated in one file. There are 5578 words in the file, and classifying them is a huge time-consuming task. But we can do it later after the bot goes live.

If we could force contributors to use camel case variable names then a ton of words will get removed. :roll_eyes: myvar vs myVar

Aug 13 '24 08:08 OnkarRuikar

If we could force contributors to use camel case variable names then a ton of words will get removed. 🙄

I would not be opposed to a PR that goes slightly beyond fixing real typos and also reduces the dictionary size! In practice contributors won't be "forced" to do anything anyway since the typo check is out of band.

Aug 13 '24 14:08 Josh-Cena

I've successfully tested the workflow. Created Issue: https://github.com/OnkarRuikar/content/issues/29 The workflow run: https://github.com/OnkarRuikar/content/actions/runs/10591216755/job/29348284353#step:4:45

Aug 28 '24 06:08 OnkarRuikar

I've successfully tested the workflow. Created Issue: OnkarRuikar#29 The workflow run: https://github.com/OnkarRuikar/content/actions/runs/10591216755/job/29348284353#step:4:45

It looks good, tnx. I think we can merge this shortly. I would also echo the sentiment from Josh that we should try to reduce the 5k+ size dictionary by some means in a follow-up. Like:

moving actual technology terms from ignore-words to project-words
camelCasing variable names (requires content updates)
Some project documentation about the spellchecker, how to add words, what kinds of words go where (CONTRIBUTING.md?)

Aug 29 '24 09:08 bsmth

I ran sort file | uniq file > file on the word list files. How about names project-words-list.txt and ignored-words-list.txt for the files?

Aug 29 '24 10:08 OnkarRuikar

I ran sort file | uniq file > file on the word list files.

nice, thank you

How about names project-words-list.txt and ignored-words-list.txt for the files?

IMO "words" is the part to get rid of in the ignore file, so I'd prefer to keep the original or do something like:

terms-abbreviations.txt
ignore-list.txt

What do you reckon?

Aug 29 '24 12:08 bsmth

@bsmth terms-abbreviations.txt and ignore-list.txt sound good I've renamed the files and updated rest of the content.

Aug 29 '24 14:08 OnkarRuikar

Don't forget to trigger the workflow manually.

Aug 29 '24 15:08 OnkarRuikar

content content copied to clipboard

Infra spell bot

content
content copied to clipboard