Different behaviour of JapaneseAnalyzer at indexing time vs. at search time results in no matches for some words. [LUCENE-5905]
A document with the word 秋葉原 in the body, when analysed by the JapaneseAnalyzer (AKA Kuromoji), cannot be found when searching for the same text as a phrase query.
Two programs are provided to reproduce the issue. Both programs print out the term docs and positions and then the result of parsing the phrase query.
As shown by the output, at analysis time, there is a lone Japanese term "秋葉原". At query parsing time, there are three such terms - "秋葉" and "秋葉原" at position 0 and "原" at position 1. Because all terms must be present for a phrase query to be a match, the query never matches, which is quite a serious issue for us.
Any workarounds, no matter how hacky, would be extremely helpful at this point.
My guess is that this is a quirk with the analyser. If it happened with StandardAnalyzer, surely someone would have discovered it before I did.
Lucene 5.2.1 reproduction:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.ja.JapaneseAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.LeafReader;
import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.index.MultiFields;
import org.apache.lucene.index.PostingsEnum;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.queryparser.flexible.standard.StandardQueryParser;
import org.apache.lucene.queryparser.flexible.standard.config.StandardQueryConfigHandler;
import org.apache.lucene.search.DocIdSetIterator;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Bits;
import org.apache.lucene.util.BytesRef;
public class LuceneMissingTerms {
public static void main(String[] args) throws Exception {
try (Directory directory = new RAMDirectory()) {
Analyzer analyser = new JapaneseAnalyzer();
try (IndexWriter writer = new IndexWriter(directory, new IndexWriterConfig(analyser))) {
Document document = new Document();
document.add(new TextField("content", "blah blah commercial blah blah \u79CB\u8449\u539F blah blah", Field.Store.NO));
writer.addDocument(document);
}
try (IndexReader multiReader = DirectoryReader.open(directory)) {
for (LeafReaderContext leaf : multiReader.leaves()) {
LeafReader reader = leaf.reader();
Terms terms = MultiFields.getFields(reader).terms("content");
TermsEnum termsEnum = terms.iterator();
BytesRef text;
//noinspection NestedAssignment
while ((text = termsEnum.next()) != null) {
System.out.println("term: " + text.utf8ToString());
Bits liveDocs = reader.getLiveDocs();
PostingsEnum postingsEnum = termsEnum.postings(liveDocs, null, PostingsEnum.POSITIONS);
int doc;
//noinspection NestedAssignment
while ((doc = postingsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
System.out.println(" doc: " + doc);
int freq = postingsEnum.freq();
for (int i = 0; i < freq; i++) {
int pos = postingsEnum.nextPosition();
System.out.println(" pos: " + pos);
}
}
}
}
StandardQueryParser queryParser = new StandardQueryParser(analyser);
queryParser.setDefaultOperator(StandardQueryConfigHandler.Operator.AND);
// quoted to work around strange behaviour of StandardQueryParser treating this as a boolean query.
Query query = queryParser.parse("\"\u79CB\u8449\u539F\"", "content");
System.out.println(query);
TopDocs topDocs = new IndexSearcher(multiReader).search(query, 10);
System.out.println(topDocs.totalHits);
}
}
}
}
Lucene 4.9 reproduction:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.ja.JapaneseAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.AtomicReader;
import org.apache.lucene.index.AtomicReaderContext;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.DocsAndPositionsEnum;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.MultiFields;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.queryparser.flexible.standard.StandardQueryParser;
import org.apache.lucene.queryparser.flexible.standard.config.StandardQueryConfigHandler;
import org.apache.lucene.search.DocIdSetIterator;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Bits;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.Version;
public class LuceneMissingTerms {
public static void main(String[] args) throws Exception {
try (Directory directory = new RAMDirectory()) {
Analyzer analyser = new JapaneseAnalyzer(Version.LUCENE_4_9);
try (IndexWriter writer = new IndexWriter(directory,
new IndexWriterConfig(Version.LUCENE_4_9, analyser))) {
Document document = new Document();
document.add(new TextField("content", "blah blah
commercial blah blah \u79CB\u8449\u539F blah blah", Field.Store.NO));
writer.addDocument(document);
}
try (IndexReader multiReader =
DirectoryReader.open(directory)) {
for (AtomicReaderContext atomicReaderContext :
multiReader.leaves()) {
AtomicReader reader = atomicReaderContext.reader();
Terms terms =
MultiFields.getFields(reader).terms("content");
TermsEnum termsEnum = terms.iterator(null);
BytesRef text;
//noinspection NestedAssignment
while ((text = termsEnum.next()) != null) {
System.out.println("term: " + text.utf8ToString());
Bits liveDocs = reader.getLiveDocs();
DocsAndPositionsEnum docsAndPositionsEnum
= termsEnum.docsAndPositions(liveDocs, null);
int doc;
//noinspection NestedAssignment
while ((doc =
docsAndPositionsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
System.out.println(" doc: " + doc);
int freq = docsAndPositionsEnum.freq();
for (int i = 0; i < freq; i++) {
int pos =
docsAndPositionsEnum.nextPosition();
System.out.println(" pos: " + pos);
}
}
}
}
StandardQueryParser queryParser = new
StandardQueryParser(analyser);
queryParser.setDefaultOperator(StandardQueryConfigHandler.Operator.AND);
// quoted to work around strange behaviour of
StandardQueryParser treating this as a boolean query.
Query query =
queryParser.parse("\"\u79CB\u8449\u539F\"", "content");
System.out.println(query);
TopDocs topDocs = new
IndexSearcher(multiReader).search(query, 10);
System.out.println(topDocs.totalHits);
}
}
}
}
Lucene 3.6.2 reproduction:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.ja.JapaneseAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.index.TermPositions;
import org.apache.lucene.queryParser.standard.StandardQueryParser;
import org.apache.lucene.queryParser.standard.config.StandardQueryConfigHandler;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;
import static org.hamcrest.Matchers.*;
import static org.junit.Assert.*;
public class TestJapaneseAnalysis {
`@Test`
public void testJapaneseAnalysis() throws Exception {
try (Directory directory = new RAMDirectory()) {
Analyzer analyser = new JapaneseAnalyzer(Version.LUCENE_36);
try (IndexWriter writer = new IndexWriter(directory,
new IndexWriterConfig(Version.LUCENE_36, analyser))) {
Document document = new Document();
document.add(new Field("content", "blah blah
commercial blah blah \u79CB\u8449\u539F blah blah", Field.Store.NO,
Field.Index.ANALYZED));
writer.addDocument(document);
}
try (IndexReader reader = IndexReader.open(directory);
TermEnum terms = reader.terms(new Term("content", ""));
TermPositions termPositions = reader.termPositions()) {
do {
Term term = terms.term();
if (term.field() != "content") {
break;
}
System.out.println(term);
termPositions.seek(terms);
while (termPositions.next()) {
System.out.println(" " + termPositions.doc());
int freq = termPositions.freq();
for (int i = 0; i < freq; i++) {
System.out.println(" " +
termPositions.nextPosition());
}
}
}
while (terms.next());
StandardQueryParser queryParser = new
StandardQueryParser(analyser);
queryParser.setDefaultOperator(StandardQueryConfigHandler.Operator.AND);
// quoted to work around strange behaviour of
StandardQueryParser treating this as a boolean query.
Query query =
queryParser.parse("\"\u79CB\u8449\u539F\"", "content");
System.out.println(query);
TopDocs topDocs = new
IndexSearcher(reader).search(query, 10);
assertThat(topDocs.totalHits, is(1));
}
}
}
}
Migrated from LUCENE-5905 by Trejkaz, 1 vote, updated Nov 01 2017 Environment:
Java 8u5
Masaru Hasegawa (migrated from JIRA)
Hacky workaround would be to set outputCompounds in JapaneseTokenizer to false using reflection or something. (Or simply use extended or normal mode instead of search mode)
But root cause looks to be the way JapaneseTokenizer determines whether it emits 2nd best segmentation.
It looks maxCost and calculated cost differ depending on following token. Because of that, depending token after 秋葉原, it emits 2nd best segmentation. But I think behavior shouldn't change depending on next token. (we always want to have consistent output for the same token, don't we?)
If I understand it correctly, if it doesn't take connection cost to next token, it should produce consistent result.
Thoughts?
Trejkaz (migrated from JIRA)
Adding one more case we found which seems to behave the same in Normal/Extended modes as well, but where adding a space at the end somehow changes the tokenisation to break the word 1/2 instead of 2/1.
"加藤木" (no space) =>
加藤 名詞,固有名詞,人名,姓 加藤 カトウ カトー
木 名詞,一般,*,* 木 キ キ
"加藤木 " (one space at the end) =>
加 名詞,固有名詞,地域,国 加 カ カ
藤木 名詞,固有名詞,地域,一般 藤木 フジキ フジキ
記号,空白,*,* ? ?
This issue is still alive and well in v8.5.1.
The example discovered this time (in this case mode is set to SEARCH):
-
input: 非公開 tokens:
- 非 (接頭詞)
- 公開 (名詞)
-
input: を非公開 tokens:
- を (助詞)
- 非公開 (名詞)
Somehow を preceding the word makes it happen, while having は preceding the word instead does not.
This is also reproducible by typing the same examples into the text field on Atilika's site.
Editing to add that the above was with discardCompoundToken=true - we had tried setting that as a workaround before discovering that what's happening now is that only the compound token is coming out - so obviously setting that won't work for this issue.