unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

feat: Support Indexing options for Astra DB columns

Open erichare opened this issue 2 months ago • 10 comments

This pull requests adds support for specifying the indexing options for various columns in Astra DB, allowing users to avoid a situation where long text columns are by-default indexed.

erichare avatar Apr 22 '24 14:04 erichare

@potter-potter Keeping this as a draft for now as its a fairly decent restructuring of the initialization process, but we've had users that have had issues with the integration because they have some columns that are very long text columns which by default get indexed. The goal of this PR is to allow the users to specify at creation time which columns are not to be indexed, because Astra has a limit internally.

Would love to run the lint and other checks on this if possible! I tried to run as much as possible locally for now.

erichare avatar Apr 22 '24 15:04 erichare

Marked it as ready for review now after some testing internally with the team. The primary change here is we give flexibility in which Astra DB fields to index. By default, we deny indexing on the metadata field (which can sometimes be very long due to the parsed HTML from PDFs) but users can override this either in advance, or at collection creation time.

erichare avatar Apr 22 '24 18:04 erichare

@erichare I'll check this out tomorrow. Thanks.

potter-potter avatar Apr 23 '24 00:04 potter-potter

Thanks @potter-potter !

erichare avatar Apr 24 '24 15:04 erichare

@potter-potter just tried to address your comments. agree with all of them and explained a little for the prior (misguided, lol) motivation :) let me know if this looks better!

erichare avatar Apr 25 '24 20:04 erichare

@erichare This is looking good. I can take over once you make the little dict change.

potter-potter avatar Apr 27 '24 17:04 potter-potter

@erichare This is looking good. I can take over once you make the little dict change.

Thanks @potter-potter ! I made the update, does it look okay?

erichare avatar Apr 28 '24 00:04 erichare

@erichare Looking good! I'll bring it to the finish line tomorrow. Thanks!

potter-potter avatar Apr 29 '24 01:04 potter-potter

@erichare Looking good! I'll bring it to the finish line tomorrow. Thanks!

Thank you very much!

erichare avatar Apr 29 '24 01:04 erichare

@erichare just to keep you updated. I have this in a branch. And was going to just include the feat: Astra DB Source Connector Support at the same time. (better to do everything at once to get it merged.) But Astra DB Source has some issues I need to debug. So working on that. Will keep you updated. And may ask for your help.

https://github.com/Unstructured-IO/unstructured/pull/2964

potter-potter avatar May 02 '24 21:05 potter-potter