content
content copied to clipboard
Infra spell bot
- fixes https://github.com/mdn/content/issues/12522
I've been running a spell-checker bot in my temp repo for over 1.5 years. The bot runs every Monday start and files an issue giving details about found typos/spelling mistakes in the mdn/content repo. The issue is resolved by fixing typos in mdn/content and adding new words in the project-words.txt file.
In the last weekly meeting, we got the green signal to move the bot to the content repo. The PR moves the bot to this repo.
The following changes have been made to existing files:
- The
project-words.txtfile contains 5k+ ignored words gathered over the period. - The RegExs have been updated to handle umlauts in fragments, to allow
valid=truecase - There are multiple cases of
favourite-colour. As the article is named in such a way, we have to ignore all the occurrences.
Preview URLs
/en-US/docs/Web/HTML/Global_attributes/class/en-US/docs/Web/HTML/Global_attributes/id/en-US/docs/Web/HTTP/Status/204/en-US/docs/Web/HTTP/Status/226
External URLs (1)
URL: /en-US/docs/Web/HTTP/Status/204
Title: 204 No Content
- https://github.com/httpwg/http-core/issues/26 (1 time) (Note! This may be a new URL 👀)
(comment last updated: 2024-08-13 07:12:42)
@OnkarRuikar Perhaps you can split out the content change into a separate PR so it can be quickly merged? https://github.com/mdn/content/issues/35404
@OnkarRuikar Could you split all content changes into separate PRs? We don't want to loop more and more people into this.
~~We already have project-words.txt right? Why two?~~
I see that they won't be added to editor suggestions. However I think in ignored-words.txt some of them are real words like "granularities". This is going to be tricky ;)
However I think in ignored-words.txt some of them are real words like "granularities". This is going to be tricky ;)
Because the bot was working standalone and the editor consideration was not there so all the words were accumulated in one file. There are 5578 words in the file, and classifying them is a huge time-consuming task. But we can do it later after the bot goes live.
If we could force contributors to use camel case variable names then a ton of words will get removed. :roll_eyes: myvar vs myVar
If we could force contributors to use camel case variable names then a ton of words will get removed. 🙄
I would not be opposed to a PR that goes slightly beyond fixing real typos and also reduces the dictionary size! In practice contributors won't be "forced" to do anything anyway since the typo check is out of band.
I've successfully tested the workflow. Created Issue: https://github.com/OnkarRuikar/content/issues/29 The workflow run: https://github.com/OnkarRuikar/content/actions/runs/10591216755/job/29348284353#step:4:45
I've successfully tested the workflow. Created Issue: OnkarRuikar#29 The workflow run: https://github.com/OnkarRuikar/content/actions/runs/10591216755/job/29348284353#step:4:45
It looks good, tnx. I think we can merge this shortly. I would also echo the sentiment from Josh that we should try to reduce the 5k+ size dictionary by some means in a follow-up. Like:
- moving actual technology terms from ignore-words to project-words
- camelCasing variable names (requires content updates)
- Some project documentation about the spellchecker, how to add words, what kinds of words go where (
CONTRIBUTING.md?)
I ran sort file | uniq file > file on the word list files. How about names project-words-list.txt and ignored-words-list.txt for the files?
I ran
sort file | uniq file > fileon the word list files.
nice, thank you
How about names
project-words-list.txtandignored-words-list.txtfor the files?
IMO "words" is the part to get rid of in the ignore file, so I'd prefer to keep the original or do something like:
terms-abbreviations.txt
ignore-list.txt
What do you reckon?
@bsmth terms-abbreviations.txt and ignore-list.txt sound good I've renamed the files and updated rest of the content.
Don't forget to trigger the workflow manually.