tika icon indicating copy to clipboard operation
tika copied to clipboard

[TIKA-3340] LanguageProfile for Myanmar

Open arky opened this issue 3 years ago • 7 comments

Adds Myanmar LanguageProfile for Apache Tika https://issues.apache.org/jira/browse/TIKA-3340

arky avatar Mar 30 '21 19:03 arky

Hi @arky - thanks for the PR! Would it be possible to add my to the list of languages being tested in LanguageIdentifierTest? You'd have to add a tika-core/src/test/resources/org/apache/tika/language/my.test file with Burmese as well.

kkrugler avatar Mar 30 '21 22:03 kkrugler

@kkrugler I'll be happy to contribute test cases for Myanmar. Can you please tell me more about how to do this?

Just adding 'lang_code.test' file with 100 lines of Myanamar text is enough? https://github.com/apache/tika/tree/main/tika-core/src/test/resources/org/apache/tika/language

How do I verify this testcase? Just 'mvn run tests...'

arky avatar Mar 31 '21 08:03 arky

Hi @arky you also need to edit the LanguageIdentifierTest.java file, to add my to the list of languages, like this:

    private static final String[] languages = new String[] {
        // TODO - currently Estonian and Greek fail these tests.
        // Enable when language detection works better.
        "da", "de", /* "et", "el", */ "en", "es", "fi", "fr", "it",
        "lt", "my", "nl", "pt", "sv"
    };

And then run mvn clean test from the tika/tika-core directory.

kkrugler avatar Mar 31 '21 14:03 kkrugler

@kkrugler Thanks for that information, I'll add a pull request to add appropriate testcase for Myanmar and few other language that were introduced.

Any technical objections to using UDHR Burmese translated text as the testcase?

https://www.ohchr.org/EN/UDHR/Pages/Language.aspx?LangID=bms

arky avatar Mar 31 '21 15:03 arky

@arky - re using UDHR text...that's fine, but as per the Permissions section on https://www.ohchr.org/EN/UDHR/Pages/Introduction.aspx, you would need to add attribution to the end of the Tika top-level LICENSE.txt file (see other examples in that file of test data).

kkrugler avatar Mar 31 '21 16:03 kkrugler

@arky can you please update this PR so we can review and attempt to merge into main? Thank you

lewismc avatar May 14 '21 15:05 lewismc

@arky can you please rebase?

lewismc avatar Feb 06 '22 02:02 lewismc