lucene
lucene copied to clipboard
Added static factory method for loading VectorValues
I have added a static factory method getVectorValues in VectorValues which returns the EMPTY VectorValues instance if the field is not found in the segment or if its not a vector field it throws an IAE else returns the VectorValues instance. Looking forward to the feedback.
Closes #11462
I'm a bit confused by this change, which is written as if it was trying to work around ghost fields on KNN vectors (fields that exist and that report that they have vectors while
getVectorValuesreturnnull), which I thought we had not allowed on vectors.
As mentioned in the issue this tries to add a static factory methods like there in DocValues and use that instead.
Thank you, I linked this issue in the PR description. So it's not about handling ghosts and we should be able to undo most of the change when field infos have already been checked?
One reasoning is we can have this to be consistent like how we have similar static functions in DocValues. There no code using it right now but this could be used maybe in future?
I'm starting to have more second thoughts about whether we should actually add this new method. On doc values, I'm seeing value because it helps save a few annoying null checks and then everything should work as expected on the empty instances. Saving null checks applies for vectors too, but the EMPTY vectors instance also returns 0 for the dimension() method, which is something I could see some code paths to complain about if they end up seeing different values for the dimension across different segments of the same index. So it's not clear to me that it would prevent more bugs than it would introduce.
but the EMPTY vectors instance also returns 0 for the dimension() method, which is something I could see some code paths to complain about if they end up seeing different values for the dimension across different segments
@jpountz +1. That's an interesting concern we don't have with doc values, and I could also see it causing issues/confusion for users. If we had internal usages for this (outside of tests), maybe we'd have more intuition around what makes more sense for users, but right now, we wouldn't really be a user of this functionality internally (as you point out), so it's really hard for me to know what would be most helpful to users here.
I also took a closer look at a couple places where we're reading VectorValues as a user, in our application built on Lucene. In general, we're expecting the vectors to be of a consistent dimensionality, so we could stop checking for null with this change, but—as you point out—we'd need to start checking dimensionality, and that's probably more confusing.
So, all things considered, without knowing more broadly what would be useful to our users, and having limited data points and intuition that returning this EMPTY VectorValues may be more confusing, I think it would make sense to hold off on adding this for now.
@shubhamvishu I'm sorry that your work didn't result in a merged commit, but it feels like it would be better not to merge this change. Thank you again for looking into this!
No problem at all. I get it it makes sense to not do this right now. Thanks!