lucenenet icon indicating copy to clipboard operation
lucenenet copied to clipboard

Alternative for SetNextReader to return all strings

Open mowali opened this issue 6 months ago • 1 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Describe the documentation issue

PaulVrugt was asking this question, but never got a response to it:

The FieldCache GetStrings method was replace by GetTerms, but GetTerms requires an AtomicReader, we used to be able to pass an IndexReader into this method and it used to return a string array containing the values. How to I get the same kind of behavior from the GetTerms method?

Is there no way to have the same behavior that GetStrings did in version 3.0.3?

Additional context

Here is the link to that thread: https://github.com/apache/lucenenet/issues/398 No response

mowali avatar Jan 30 '24 17:01 mowali

The Migration Guide covers this very issue with an example:

LUCENE-2380: FieldCache.GetStrings/Index --> FieldCache.GetDocTerms/Index

  • The field values returned when sorting by SortField.STRING are now BytesRef. You can call value.Utf8ToString() to convert back to string, if necessary.

  • In FieldCache, GetStrings (returning string[]) has been replaced with GetTerms (returning a BinaryDocValues instance). BinaryDocValues provides a Get method, taking a docID and a BytesRef to fill (which must not be null), and it fills it in with the reference to the bytes for that term.
    If you had code like this before:

    string[] values = FieldCache.DEFAULT.GetStrings(reader, field);
    ...
    string aValue = values[docID];
    

    you can do this instead:

    BinaryDocValues values = FieldCache.DEFAULT.GetTerms(reader, field);
    ...
    BytesRef term = new BytesRef();
    values.Get(docID, term);
    string aValue = term.Utf8ToString();
    

    Note however that it can be costly to convert to String, so it's better to work directly with the BytesRef.

  • Similarly, in FieldCache, GetStringIndex (returning a StringIndex instance, with direct arrays int[] order and String[] lookup) has been replaced with GetTermsIndex (returning a SortedDocValues instance). SortedDocValues provides the GetOrd(int docID) method to lookup the int order for a document, LookupOrd(int ord, BytesRef result) to lookup the term from a given order, and the sugar method Get(int docID, BytesRef result) which internally calls GetOrd and then LookupOrd.
    If you had code like this before:

    StringIndex idx = FieldCache.DEFAULT.GetStringIndex(reader, field);
    ...
    int ord = idx.order[docID];
    String aValue = idx.lookup[ord];
    

    you can do this instead:

    DocTermsIndex idx = FieldCache.DEFAULT.GetTermsIndex(reader, field);
    ...
    int ord = idx.GetOrd(docID);
    BytesRef term = new BytesRef();
    idx.LookupOrd(ord, term);
    string aValue = term.Utf8ToString();
    

    Note however that it can be costly to convert to String, so it's better to work directly with the BytesRef.
    DocTermsIndex also has a GetTermsEnum() method, which returns an iterator (TermsEnum) over the term values in the index (ie, iterates ord = 0..NumOrd-1).

Furthermore, if you drill down into the issue LUCENE-2380, there is an explanation for the change: primarily, this was done for performance reasons. There is no longer a string[] stored in the field cache, the underlying data is now a byte[] so extra steps are required to get a UTF8 string.

Do note that you are meant to reuse the BytesRef instance that is passed in to get better performance.

NightOwl888 avatar Jan 31 '24 04:01 NightOwl888