language_tool_python
language_tool_python copied to clipboard
Offset position "longer" than text
I have a match that look like this:
Match({'ruleId': 'MORFOLOGIK_RULE_ES', 'message': 'Se ha encontrado un posible error ortogrΓ‘fico.', 'replacements': ['telΓ©fonos', 'telΓ©fono', 'telefotos'], 'offsetInContext': 43, 'context': '...π πππππππ podemos compartir tus telefonos con el conductor πΊπ', 'offset': 307, 'errorLength': 9, 'category': 'TYPOS', 'ruleIssueType': 'misspelling', 'sentence': 'Rider > Lost Items > Standard lost item > Driver found riders itemdescripcion del articulo perdido π΄π πππππ
π πππ πππππππ πππππ ingresa un numero de telefono alternativo incluye el codigo de tu pais informacion sobre el viaje πππ
ππππππ π πππππππ podemos compartir tus telefonos con el conductor πΊπ'})
Original sentence look like shown in example:
Rider > Lost Items > Standard lost item > Driver found riders itemdescripcion del articulo perdido π΄π πππππ
π πππ πππππππ πππππ ingresa un numero de telefono alternativo incluye el codigo de tu pais informacion sobre el viaje πππ
ππππππ π πππππππ podemos compartir tus telefonos con el conductor πΊπ
Problem is that offset is said to be 307, while sentence length in chars 296.
I think that the problem is that the text has some chars that actually internally take more than one position in unicode encoding (are compose but multiple chars).
The problem is that when I try to reference detection to original text I get an error because that position is wrong and does not reference the true position in the text