Investigate: compute embeddings via CoreML model
(Splitting out this discussion from #17; putting it here to document what I tried in case someone else wants to follow up)
I attempted to convert the gte-small model from HuggingFace from pytorch --> CoreML and integrated it into rem.
Attempt #1 just use the CoreML model that somebody uploaded to the HF repo a few weeks ago
Result: I was able to easily get a tokenizer imported via swift-transformers, and import the CoreML model, but the actual model prediction resulted in NaNs.
Attempt #2 convert the model myself using huggingface exporters project
Result: conversion fails in the validation phase, because it outputs NaNs... (see a pattern here? :) )
Attempt #3 manual conversion by following coremltools documentation
Result: kind of a few different things, but mostly: NaNs.
I'm unclear whether conversion of a pytorch model for embeddings specifically is something that's supported/intended by coremltools. They have a lot of models included that seem much more complicated than a BERT embedding model should be but :shrug:.
After a lot of poking and tweaking of inputs, I was able to get the pytorch model loaded into CoreML in fp16 format (it was defaulting to fp32 for some reason -- I think that's why the model uploaded to HF was so big to begin with). When I got to this point I get fp32 <--> fp16 compatibility issues from CoreML tools, which is a definite improvement, but... still not functional.
Error:
... snip ...
File "/Users/robertgay/.pyenv/versions/exporters/lib/python3.10/site-packages/coremltools/converters/mil/mil/operation.py", line 190, in __init__
self._validate_and_set_inputs(input_kv)
File "/Users/robertgay/.pyenv/versions/exporters/lib/python3.10/site-packages/coremltools/converters/mil/mil/operation.py", line 503, in _validate_and_set_inputs
self.input_spec.validate_inputs(self.name, self.op_type, input_kvs)
File "/Users/robertgay/.pyenv/versions/exporters/lib/python3.10/site-packages/coremltools/converters/mil/mil/input_type.py", line 137, in validate_inputs
raise ValueError(msg)
ValueError: In op, of type layer_norm, named input.5, the named input `epsilon` must have the same data type as the named input `gamma`. However, epsilon has dtype fp32 whereas gamma has dtype fp16.
Summary
So... I'm going to table this for now, given that there's already a more flexible/probably less finicky alternative (the rust lib + bindings). It was fun while it lasted, but there are only so many hours in the day. 😅
(feel free to close this, I just didn't want to carp up #17 given that there are ~3 discussions happening there right now.)
Super appreciate the investigation!
Crazy that it's so difficult. Fwiw I found a bug in my thinking / was paying too much attention to c bindings and not enough to the embedding logic itself and forgot to take the mean 😅 but it's now fixed in https://github.com/jasonjmcghee/rust_embedding_lib
Still haven't taken the time to do the final step to make it a framework.
I met the creator of https://github.com/unum-cloud/usearch today and they have a swift offering.
Could be another option instead of @sqlite-vss.
Here's a model they built that supports images and text. https://huggingface.co/unum-cloud/uform-vl-english
@roblg if you're interested in taking another shot at this, https://github.com/ashvardanian/SwiftSemanticSearch looks super promising!