orthographic-pedant icon indicating copy to clipboard operation
orthographic-pedant copied to clipboard

Detect language and correct accordingly

Open rymai opened this issue 9 years ago • 11 comments

I received a pull-request on rymai/elevator-simulator#1 where "attendent" was mistaken for "attendant". The problem is that the README is in French, and in that case, "ils attendent" means "they are waiting".

rymai avatar Oct 01 '15 06:10 rymai

Your README mixes both English and French, which will make it extra hard to detect language.

You are probably not alone mixing languages as this can happen for a number of reasons.

orthographic-pedant could use a number of methods to fix this one:

  • blacklisting
  • language settings in orthographic-pedant configuration files
  • language settings embedded in md files using language tags

uiteoi avatar Oct 01 '15 11:10 uiteoi

Thanks @uiteoi. Indeed the readme mixes both languages, but at least the script could stop its work in that case instead of proposing a wrong correction...

I certainly don't want to add a configuration file for this bot, nor add settings in md files. :)

rymai avatar Oct 01 '15 12:10 rymai

Proper detection is certainly the best way going forward. Considering the complexity of implementation I was considering other options. In your case, blacklisting would be the most appropriate short-term solution.

uiteoi avatar Oct 01 '15 14:10 uiteoi

Exactly!

rymai avatar Oct 01 '15 14:10 rymai

What I've found is that explict white-listing is the way to go. I made a few early mistakes correcting ` Ceasar to Caesar and had half of Latin-America mad at me. For this particular case I'm going to remove this word from the correcting list. I've only done the A's so far, you can see what corrections will be attempted here:

https://github.com/thoppe/orthographic-pedant/blob/master/wordlists/parsed_wikipedia_list.txt

A poor-man's check for a possible foreign language would check if the entire README could be converted to ASCII without loss. Obviously this is a bit heavy handed, but I'm not sure how this problem is solved in the real-world.

thoppe avatar Oct 01 '15 14:10 thoppe

@thoppe, you are going to have this same problem with countless other words, French and English in particular share countless words with slightly different spellings. e.g. example / exemple, appartement / apartment, ...

So I would suggest that you start looking for some form of detection and ease the possibility to blacklist repos.

Good luck with your project.

uiteoi avatar Oct 01 '15 15:10 uiteoi

Maybe another possible suggestion, if some repo owner rejected a pull-request once, you may want to blacklist that repo automatically to avoid submitting further suggested fixes.

uiteoi avatar Oct 01 '15 15:10 uiteoi

Good suggestions @uiteoi. Since I don't speak French, is there a list of "homophonic cognates" somewhere that you can vouch for as a good starting point?

Natural language is deceptively hard to get right, especially when I have to cross the phase boundary between two of them!

As a side-note, many happy users reject a PR by accident since they are unfamiliar with githubs PR system. Ad-hoc, this amounts to about 5%. Very few people vehemently dislike the bot (but that number is not zero).

thoppe avatar Oct 01 '15 15:10 thoppe

Here's a wikipedia article showing a list of common spelling mistakes in French, it is used by the WPCleaner bot to detect spelling mistakes.

https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Liste_de_fautes_d%27orthographe_courantes

You should expect that most major language would have similar lists for the WPCleaner bot to use.

I see you are using Python which comes with a number of NL libraries using NLTK. Here's an example I found by googling "python natural language detection": https://pypi.python.org/pypi/guess-language

For people who reject a PR by accident, they should be able to submit a PR on your repo to get removed from the blacklist.

I personally think this is a great project and I encourage you to further develop it.

uiteoi avatar Oct 01 '15 15:10 uiteoi

I'm going to reopen this issue since it turns out this is a really good idea. It shouldn't be too hard to detect if the language is not English and skip the repo outright. This should help with the words that are correct in French and English at least.

thoppe avatar Oct 09 '15 18:10 thoppe

Great :+1:

uiteoi avatar Oct 10 '15 05:10 uiteoi