content icon indicating copy to clipboard operation
content copied to clipboard

Infra spell bot

Open OnkarRuikar opened this issue 1 year ago • 7 comments

  • fixes https://github.com/mdn/content/issues/12522

I've been running a spell-checker bot in my temp repo for over 1.5 years. The bot runs every Monday start and files an issue giving details about found typos/spelling mistakes in the mdn/content repo. The issue is resolved by fixing typos in mdn/content and adding new words in the project-words.txt file.

In the last weekly meeting, we got the green signal to move the bot to the content repo. The PR moves the bot to this repo.

The following changes have been made to existing files:

  • The project-words.txt file contains 5k+ ignored words gathered over the period.
  • The RegExs have been updated to handle umlauts in fragments, to allow valid=true case
  • There are multiple cases of favourite-colour. As the article is named in such a way, we have to ignore all the occurrences.

OnkarRuikar avatar Jul 29 '24 13:07 OnkarRuikar

Preview URLs

External URLs (1)

URL: /en-US/docs/Web/HTTP/Status/204 Title: 204 No Content

(comment last updated: 2024-08-13 07:12:42)

github-actions[bot] avatar Jul 29 '24 13:07 github-actions[bot]

@OnkarRuikar Perhaps you can split out the content change into a separate PR so it can be quickly merged? https://github.com/mdn/content/issues/35404

Josh-Cena avatar Aug 11 '24 15:08 Josh-Cena

@OnkarRuikar Could you split all content changes into separate PRs? We don't want to loop more and more people into this.

Josh-Cena avatar Aug 13 '24 05:08 Josh-Cena

~~We already have project-words.txt right? Why two?~~

I see that they won't be added to editor suggestions. However I think in ignored-words.txt some of them are real words like "granularities". This is going to be tricky ;)

Josh-Cena avatar Aug 13 '24 07:08 Josh-Cena

However I think in ignored-words.txt some of them are real words like "granularities". This is going to be tricky ;)

Because the bot was working standalone and the editor consideration was not there so all the words were accumulated in one file. There are 5578 words in the file, and classifying them is a huge time-consuming task. But we can do it later after the bot goes live.

If we could force contributors to use camel case variable names then a ton of words will get removed. :roll_eyes: myvar vs myVar

OnkarRuikar avatar Aug 13 '24 08:08 OnkarRuikar

If we could force contributors to use camel case variable names then a ton of words will get removed. 🙄

I would not be opposed to a PR that goes slightly beyond fixing real typos and also reduces the dictionary size! In practice contributors won't be "forced" to do anything anyway since the typo check is out of band.

Josh-Cena avatar Aug 13 '24 14:08 Josh-Cena

I've successfully tested the workflow. Created Issue: https://github.com/OnkarRuikar/content/issues/29 The workflow run: https://github.com/OnkarRuikar/content/actions/runs/10591216755/job/29348284353#step:4:45

OnkarRuikar avatar Aug 28 '24 06:08 OnkarRuikar

I've successfully tested the workflow. Created Issue: OnkarRuikar#29 The workflow run: https://github.com/OnkarRuikar/content/actions/runs/10591216755/job/29348284353#step:4:45

It looks good, tnx. I think we can merge this shortly. I would also echo the sentiment from Josh that we should try to reduce the 5k+ size dictionary by some means in a follow-up. Like:

  • moving actual technology terms from ignore-words to project-words
  • camelCasing variable names (requires content updates)
  • Some project documentation about the spellchecker, how to add words, what kinds of words go where (CONTRIBUTING.md?)

bsmth avatar Aug 29 '24 09:08 bsmth

I ran sort file | uniq file > file on the word list files. How about names project-words-list.txt and ignored-words-list.txt for the files?

OnkarRuikar avatar Aug 29 '24 10:08 OnkarRuikar

I ran sort file | uniq file > file on the word list files.

nice, thank you

How about names project-words-list.txt and ignored-words-list.txt for the files?

IMO "words" is the part to get rid of in the ignore file, so I'd prefer to keep the original or do something like:

terms-abbreviations.txt
ignore-list.txt

What do you reckon?

bsmth avatar Aug 29 '24 12:08 bsmth

@bsmth terms-abbreviations.txt and ignore-list.txt sound good I've renamed the files and updated rest of the content.

OnkarRuikar avatar Aug 29 '24 14:08 OnkarRuikar

Don't forget to trigger the workflow manually.

OnkarRuikar avatar Aug 29 '24 15:08 OnkarRuikar