CorporaCreator
CorporaCreator copied to clipboard
Adding latvian sentence cleaners
Adding Latvian cleaners to filter out sentences with broken encoding.
Good idea @raivisdejus. I think you are trying to correct this:
The "?" inside words were caused by an encoding issue during import from old sentence collector, unicode characters for many languages were replaced by "?". Some of these sentences got recorded by volunteers, because they are humanly readable.
If this is the case, I think it should be corrected for all languages (perhaps not for es
if the sentence starts with it - any more languages?).
@HarikalarKutusu You are correct, I am fixing issues with encodings of special characters. Created another PR the would validate this case in all languages. Currently it does not include any special handling of Spanish, see the considerations for this in the other PR.