dlt icon indicating copy to clipboard operation
dlt copied to clipboard

support pg_vector

Open rudolfix opened this issue 6 months ago • 0 comments

Background If we support pg_vector we'll have proper handling of chunking documents out of a box. Implementation should be trivial. What is not trivial is how to represent vectors/embedding in an efficient manner as we'll require user to embed certain columns themselves.

Requirements

    • [ ] research what is the best way to represent embeddings. is it complex? (soon json). looks like that. they are just lists or np.arrays
    • [ ] add postgres adapter to mark certain columns as vectors. not unlike ie. lancedb adapter. ie x-postgres-data-type: vector
    • [ ] research which file formats we should support. insert-values and csv seem to work out of a box
    • [ ] change type mapper in postgres so "logical" types are created instead of the physical

This is a good guide: https://www.timescale.com/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector/

  • shows how to enable pg vector
  • shows how to insert embeddings as np.array with execute_values (via https://www.psycopg.org/docs/extras.html#psycopg2.extras.execute_values and list adaptations)

Optional challenge optional challenge: allow inserting panda frames and arrow tables directly. arrow table should contain a column with a list of floats. a copy job in postgres will read parquet in batches and use execute_values to insert them as a bonus we get parquet insert into postgres

rudolfix avatar Aug 21 '24 21:08 rudolfix