Douglas Hanley
Douglas Hanley
I'm seeing similar issues here with `uint8` → `float16` (or `float32`). Using nightly with an A6000. The application is quantized matrix multiplication. I've found that basically only a block size...
Yeah, right now we don't support getting token level embeddings. So generative models like llama-2 that lack pooling layers won't work. Are you looking for token level embeddings or sequence...
@r3v1 Is it still raising an error, or is it just that it's returning token level embeddings as a list of lists? Generative models like these don't do pooling intrinsically...
Yeah, the langchain interop code is unforunately broken right now for getting embeddings from generative models. For it to work in this case, we'd need to implement manual pooling somewhere....
Thanks for the comments @abetlen! Yeah, so I think that this is basically a superset of (1) right now. If you call `create_completion_parallel(n*[prompt])` you'll get back `n` independent responses for...