tika
tika copied to clipboard
[TIKA-3340] LanguageProfile for Myanmar
Adds Myanmar LanguageProfile for Apache Tika https://issues.apache.org/jira/browse/TIKA-3340
Hi @arky - thanks for the PR! Would it be possible to add my
to the list of languages being tested in LanguageIdentifierTest
? You'd have to add a tika-core/src/test/resources/org/apache/tika/language/my.test
file with Burmese as well.
@kkrugler I'll be happy to contribute test cases for Myanmar. Can you please tell me more about how to do this?
Just adding 'lang_code.test' file with 100 lines of Myanamar text is enough? https://github.com/apache/tika/tree/main/tika-core/src/test/resources/org/apache/tika/language
How do I verify this testcase? Just 'mvn run tests...'
Hi @arky you also need to edit the LanguageIdentifierTest.java
file, to add my
to the list of languages, like this:
private static final String[] languages = new String[] {
// TODO - currently Estonian and Greek fail these tests.
// Enable when language detection works better.
"da", "de", /* "et", "el", */ "en", "es", "fi", "fr", "it",
"lt", "my", "nl", "pt", "sv"
};
And then run mvn clean test
from the tika/tika-core
directory.
@kkrugler Thanks for that information, I'll add a pull request to add appropriate testcase for Myanmar and few other language that were introduced.
Any technical objections to using UDHR Burmese translated text as the testcase?
https://www.ohchr.org/EN/UDHR/Pages/Language.aspx?LangID=bms
@arky - re using UDHR text...that's fine, but as per the Permissions section on https://www.ohchr.org/EN/UDHR/Pages/Introduction.aspx, you would need to add attribution to the end of the Tika top-level LICENSE.txt
file (see other examples in that file of test data).
@arky can you please update this PR so we can review and attempt to merge into main? Thank you
@arky can you please rebase?