Xponents
Xponents copied to clipboard
Test Latest TextTagger in other languages/scripts
Describe the bug TextTagger usage with languages other than English.
To Reproduce
- Java or Python version: Any Java (openjdk 8 and 12)
- Usage: Arabic text produces a "zero-length token" exception from TextTagger process()
- Data input:
- Did you enable logging (level =
DEBUG)? - Other notes:
15:59:47.288 [main] ERROR org.apache.solr.handler.RequestHandlerBase - java.lang.IllegalArgumentException: term: analyzed to a zero-length token
at org.apache.solr.handler.tagger.Tagger.process(Tagger.java:142)
at org.apache.solr.handler.tagger.TaggerRequestHandler.handleRequestBody(TaggerRequestHandler.java:231)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2551)
at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:191)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:194)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
at org.opensextant.extraction.SolrMatcherSupport.tagTextCallSolrTagger(SolrMatcherSupport.java:181)
at org.opensextant.extractors.geo.GazetteerMatcher.tagText(GazetteerMatcher.java:444)
at org.opensextant.extractors.geo.GazetteerMatcher.tagText(GazetteerMatcher.java:404)
at org.opensextant.extractors.geo.PlaceGeocoder.extract(PlaceGeocoder.java:475)
at org.opensextant.extractors.test.TestPlaceGeocoder.tagFile(TestPlaceGeocoder.java:57)
at org.opensextant.extractors.test.TestPlaceGeocoder.main(TestPlaceGeocoder.java:164)
Expected behavior
More reasonable behavior is expected from TextTagger -- its possible the whole Solr 7.x assembly needs to be replaced with a clean setup and fully reindex data.