openai-cookbook icon indicating copy to clipboard operation
openai-cookbook copied to clipboard

Running Embeddings encoding locally?

Open stephansturges opened this issue 2 years ago • 9 comments
trafficstars

Is there any plan to enable the deployment of a model locally to compute embeddings on tokenized text?

I'm currently using "text-embedding-ada-002" via the API and it's fine, but I'm trying to parse indexes with >1M items and building such an index using web requests is a pain on many levels, and I'd love to find a better-performing way to do this in the future.

stephansturges avatar Feb 09 '23 15:02 stephansturges

No plans for a local model. (That would be more complicated to get working than an API call.)

What are the pain points you'd like to see fixed?

ted-at-openai avatar Feb 09 '23 18:02 ted-at-openai

The API call works fine, I'm just tring to get embeddings for a pandas dataframe with ~500K entries and even with paralellization it's going to take about 2 days, assuming no connection trouble (which is a mess to unwind). It would be kind of cool to have a docker image that could be used to calculate embeddings locally, even if it requires a bunch of GPUs... some of us could use it :)

stephansturges avatar Feb 09 '23 19:02 stephansturges

At the moment I'm testing with subsets of 1000 units, and I regularly hit one of these: image This throws a wrench in the whole process...

stephansturges avatar Feb 09 '23 19:02 stephansturges

I'm sorry, that's annoying. How frequently are you hitting it? Can you do exponential backoff and resume? Will escalate to eng team.

ted-at-openai avatar Feb 09 '23 20:02 ted-at-openai

It doesn't appear to be correlated with the rate at which I'm making requests, rather the general level of saturation of the OpenAI API. I can chunk the dataframe in 10 row increments sequentially and it will still hit a snag, at most I've got it about 40k rows before it craps out. The error is kind of annoying too because it seems to block automatic retries of the web request, I'll catch it next time and post here.

stephansturges avatar Feb 09 '23 20:02 stephansturges

BTW I'm making a movie recommendation app :) image image

image

stephansturges avatar Feb 09 '23 21:02 stephansturges

Here's a script I wrote for mass processing embeddings in case it's helpful: https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py

ted-at-openai avatar Feb 09 '23 21:02 ted-at-openai

thanks Ted that looks great! I'll give it a try. I'd still love to do this locally at some point, In would like to be able to have 1000x more data embedded soon 😅

stephansturges avatar Feb 10 '23 06:02 stephansturges

Seems to be working great. It throws this error on every new start but it's gathering the embeddings ok.

Traceback (most recent call last): File "/Users/stephansturges/GPTs/FlixGPT/api_request_parallel_processor.py", line 302, in call_API append_to_jsonl([self.request_json, self.result], save_filepath) File "/Users/stephansturges/GPTs/FlixGPT/api_request_parallel_processor.py", line 322, in append_to_jsonl json_string = json.dumps(data) ^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/[email protected]/3.11.1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/__init__.py", line 231, in dumps return _default_encoder.encode(obj) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/[email protected]/3.11.1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/encoder.py", line 200, in encode chunks = self.iterencode(o, _one_shot=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/[email protected]/3.11.1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/encoder.py", line 258, in iterencode return _iterencode(o, 0) ^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/[email protected]/3.11.1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/encoder.py", line 180, in default raise TypeError(f'Object of type {o.__class__.__name__} ' TypeError: Object of type ClientConnectorError is not JSON serializable

stephansturges avatar Feb 10 '23 08:02 stephansturges

Thanks @ted-at-openai this actually worked great, I let it run overnight and got everything I need (>9Gb of embeddings 🤣 ). FYI there is an issue with UTF-8 coded characters messing up the tokenizer at some point. I'm going to file a PR for a quick and dirty fix but it could do with some deeper investigating.

BTW I'd love to talk to someone at OpenAI about deploying the movie recommendation app I've made with this 😉 Any chance you can put me in contact with someone?

image

stephansturges avatar Feb 11 '23 10:02 stephansturges

closed, will append PR

stephansturges avatar Feb 11 '23 11:02 stephansturges

FYI here is the app @ted-at-openai
https://gptflix.streamlit.app/

stephansturges avatar Feb 13 '23 19:02 stephansturges

BTW I'd love to talk to someone at OpenAI about deploying the movie recommendation app I've made with this 😉 Any chance you can put me in contact with someone?

We are pretty swamped these days, unfortunately. What's the ask?

ted-at-openai avatar Feb 14 '23 01:02 ted-at-openai

Sorry nothing, I deployed it myself in the end to play with pinecone and streamlit! I was considering that it would make a cool demo that OpenAI could publish to show how to build a massive DB on pinecone (this one fills up an S1 pod) and do context injection on a large scale, but I'm sure you're up to your ears in fun demos and don't have the time for more! I'll make a loom video / tutorial to explain it all in the next few weeks if I can between travel for completely unrelated work... there are some aspects of getting hundreds of thousands of embeddings and being able to use them in a vector DB that are not super intuitive yet 😄 Your parallel embeddings retrieval script was super helpful however!

You can play with the demo at -> www.gptflix.ai

stephansturges avatar Feb 14 '23 11:02 stephansturges