CorporaCreator
CorporaCreator copied to clipboard
Will invalidate sentences with question mark in the middle
Adding validation for sentences with question mark before lower case character.
This PR extends comments mentioned in https://github.com/common-voice/CorporaCreator/pull/126
Current code in this PR does not do any special validations for Spanish as it assumes that the regular question mark "?" before lower case character is still invalid for Spanish and valid cases would be use of upside down question mark "¿" or regular question mark "?" before upper case character.
Also as this code marks sentences invalid and does not drop them completely they can still be picked up from the invalidated set if needed.
Happy to adjust the code with more specific checks for Spanish if someone guides me on valid and invalid examples for Spanish