languagetool
languagetool copied to clipboard
Suggestion match with regexp_match not working when using tashkeel
Example :
<rule id="word_use_0005_muta2akid" name="متأكد">
<pattern>
<token inflected="yes">متأكد</token>
</pattern>
<message>يفضل أن يقال:
<suggestion><match no="1" regexp_match="متأكد" regexp_replace="متحقِّق"/></suggestion>
<suggestion><match no="1" regexp_match="متأكد" regexp_replace="متيقِّن"/></suggestion>
متيقن أو متحقق بدلا من متأكد</message>
<example correction="متحقِّق|متيقِّن" type="incorrect"> هل أنت <marker>متأكد</marker>؟</example>
<!-- Wrong: هل أنتَ متأكِّد؟ -->
<!--Correct: هل أنتَ متيقِّن؟ / هل أنتَ متحقق؟ -->
</rule>
with the sentence :
هل أنتَ مُتأكِّد أنّنا نسير في الاتّجاه الصّحيح؟
output:
1.) Line 1, column 9, Rule ID: word_use_0005_muta2akid[4]
Message: يفضل أن يقال:
'مُتأكِّد'
'مُتأكِّد'
متيقن أو متحقق بدلا من متأكد
Suggestion: مُتأكِّد
Rule source: /org/languagetool/rules/ar/grammar.xml
هل أنتَ مُتأكِّد أنّنا نسير في الاتّجاه الصّحيح؟
^^^^^^^^
The problem: regexp_match is not matching if the word contains tashkeel.
I suggest to re- program the "case_conversion" attribute. to handle tashkeel strip or ignoring
https://dev.languagetool.org/tips-and-tricks#changing-the-case-of-matched-word
I found a way to do this, I make some changes on code, can you update the repository from upstream, in order to make a PR for this change thanks
The commit,
https://github.com/linuxscout/languagetool/commit/8d0f2ea46a83333c478d6b7be12c2c2cf3812949
@linuxscout My repo is synched now with upstream
To be closed
Should I include the removeTashkeel method?
I tried to add it, take a look on the PR, I updated core-files. Perhaps there is a way to includes changes only on arabic module
To be closed