CorporaCreator icon indicating copy to clipboard operation
CorporaCreator copied to clipboard

Adding latvian sentence cleaners

Open raivisdejus opened this issue 11 months ago • 2 comments

Adding Latvian cleaners to filter out sentences with broken encoding.

raivisdejus avatar Mar 23 '24 17:03 raivisdejus

Good idea @raivisdejus. I think you are trying to correct this:

The "?" inside words were caused by an encoding issue during import from old sentence collector, unicode characters for many languages were replaced by "?". Some of these sentences got recorded by volunteers, because they are humanly readable.

If this is the case, I think it should be corrected for all languages (perhaps not for es if the sentence starts with it - any more languages?).

HarikalarKutusu avatar Mar 23 '24 22:03 HarikalarKutusu

@HarikalarKutusu You are correct, I am fixing issues with encodings of special characters. Created another PR the would validate this case in all languages. Currently it does not include any special handling of Spanish, see the considerations for this in the other PR.

raivisdejus avatar Mar 24 '24 10:03 raivisdejus