dlt
dlt copied to clipboard
support pg_vector
Background
If we support pg_vector
we'll have proper handling of chunking documents out of a box. Implementation should be trivial. What is not trivial is how to represent vectors/embedding in an efficient manner as we'll require user to embed certain columns themselves.
Requirements
-
- [ ] research what is the best way to represent embeddings. is it
complex
? (soonjson
). looks like that. they are just lists or np.arrays
- [ ] research what is the best way to represent embeddings. is it
-
- [ ] add
postgres
adapter to mark certain columns as vectors. not unlike ie. lancedb adapter. iex-postgres-data-type: vector
- [ ] add
-
- [ ] research which file formats we should support. insert-values and csv seem to work out of a box
-
- [ ] change type mapper in postgres so "logical" types are created instead of the physical
This is a good guide: https://www.timescale.com/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector/
- shows how to enable pg vector
- shows how to insert embeddings as np.array with execute_values (via https://www.psycopg.org/docs/extras.html#psycopg2.extras.execute_values and list adaptations)
Optional challenge optional challenge: allow inserting panda frames and arrow tables directly. arrow table should contain a column with a list of floats. a copy job in postgres will read parquet in batches and use execute_values to insert them as a bonus we get parquet insert into postgres