spring-ai
spring-ai copied to clipboard
Why are embeddings stored as List<Double>?
Hi,
I am wondering why the class Embedding internally stored the vector as List<Double>. I think double[] or float[] would be more memory efficient.
List
Firstly, List is a basic collection in Java, capable of storing only objects instead of primitive types like double or float. Using collections allows for more flexible usage than basic arrays of primitive types, and depending on the use case, you can choose implementations such as ArrayList or LinkedList.
If you need to add a new value exceeding the predefined size of a double[n] array, you must allocate a new double[n+1] and copy the existing values before adding the new one. However, when using a List, you can easily handle adding values to the array through the abstracted add() method.
Double
A primitive type double variable defaults to 0.0. Can you determine if an operation was successful or failed when the result is 0.0? In such cases, 'null' can serve as a means to express the absence of a value or an impossible calculation, and by using the Double wrapper class, which inherits from Object, you can represent null.
@youngmoneee I believe that's exactly the point @agoerler is trying to make.
While the list may be more flexible, it is significantly less efficient than a simple float[] which is why pretty much all LLMs and most frameworks that work with them, including Langchain and Langchain4j, use primitive float[] to represent embeddings.
For one, the additional flexibility offered by a list is completely unnecessary, as the operations you need to do with the embeddings, such as calculations necessary for the similarity search, are just as easy, if not easier to do with arrays than with lists, and you can actually leverage SIMD support in the modern CPUs to do them more efficiently. Not to mention that you can actually use classes like FloatBuffer to access float[] allocated in native memory by ONNX, for example, directly, but can't do that with a list and have to box every single value into a Double on the way in (and likely unbox into a float on the way out).
But more importantly, storing 1,536 dimension vector returned by OpenAI, for example, requires 6,144 bytes when stored as a float[], and 36,864 bytes when stored as a List<Double>, for the exact same payload (ignoring the space used by the array or list instance itself, which is approximately the same and irrelevant in this case). That's 6x the memory cost, for no good reason, as it is neither faster nor easier to work with. Quite the opposite, actually.
Ultimately, nobody will ever store only a handful of vectors in a vector store, so these size differences add up in a hurry, and you'll need 6x the memory to store the exact same data using List<Double>. There seems to be a trend towards models that can create quantized vectors that use single byte per dimension (int8/uint8) or even a single bit per dimension (binary), in order to reduce the space required for vector storage 4x or 32x without significant accuracy loss (take a look at Cohere, for example), so going in the other direction and making vectors 6x bigger than necessary seems like a bad idea.
The bottom line is that for some of us implementing vector stores, especially in-memory vector stores, the difference between using primitive arrays/buffers and collections containing boxed wrapper types is so significant that the latter is a non-starter. The only way for us to support Spring AI at the moment would be to convert from List<Double> to a float[] on the way in, and the other way around on the way out. That is certainly doable, but is not free, and most importantly, it shouldn't be necessary.
@youngmoneee As for "the ability to store null values" in a list, but not in the array, that argument makes no sense in this context. Embeddings are by and large dense vectors, so there shouldn't be any missing values, and you can't do calculations with null anyway.
For cases where sparse vectors are used, there are are better data structures to use than either primitive array or a list of objects, but embeddings are not such a use case.
@youngmoneee I believe that's exactly the point @agoerler is trying to make.
Yes.
I think float[] would be much more memory efficient as compared to List<Double>. Moreover I doubt that the option to store null would be ever needed. As pointed out by @aseovic, float[] would be better compatible with e.g. Langchain4j and would hence require less costly conversions.
Also, I think vector embeddings are typically not manipulated but rather used in semantic search similarity search. I don't see how the List interface is useful when dealing with vector embeddings.
Hi @aseovic, @agoerler
I apologize for not considering the specific context of embeddings and making an incorrect judgment based solely on general cases. As you mentioned, even though there is overhead in memory and computation, I thought it would not pose a significant issue in modern computing servers, especially compared to the bottlenecks caused by I/O-bound tasks.
However, after considering your comments, I agree that regardless of the impact's magnitude, performing unnecessary wrapping/unwrapping and occupying additional memory and processing time is indeed unwarranted.
Solved in https://github.com/spring-projects/spring-ai/pull/1002