unstructured
unstructured copied to clipboard
feat: Support Indexing options for Astra DB columns
This pull requests adds support for specifying the indexing options for various columns in Astra DB, allowing users to avoid a situation where long text columns are by-default indexed.
@potter-potter Keeping this as a draft for now as its a fairly decent restructuring of the initialization process, but we've had users that have had issues with the integration because they have some columns that are very long text columns which by default get indexed. The goal of this PR is to allow the users to specify at creation time which columns are not to be indexed, because Astra has a limit internally.
Would love to run the lint and other checks on this if possible! I tried to run as much as possible locally for now.
Marked it as ready for review now after some testing internally with the team. The primary change here is we give flexibility in which Astra DB fields to index. By default, we deny indexing on the metadata field (which can sometimes be very long due to the parsed HTML from PDFs) but users can override this either in advance, or at collection creation time.
@erichare I'll check this out tomorrow. Thanks.
Thanks @potter-potter !
@potter-potter just tried to address your comments. agree with all of them and explained a little for the prior (misguided, lol) motivation :) let me know if this looks better!
@erichare This is looking good. I can take over once you make the little dict change.
@erichare This is looking good. I can take over once you make the little dict change.
Thanks @potter-potter ! I made the update, does it look okay?
@erichare Looking good! I'll bring it to the finish line tomorrow. Thanks!
@erichare Looking good! I'll bring it to the finish line tomorrow. Thanks!
Thank you very much!
@erichare just to keep you updated. I have this in a branch. And was going to just include the feat: Astra DB Source Connector Support
at the same time. (better to do everything at once to get it merged.) But Astra DB Source has some issues I need to debug. So working on that. Will keep you updated. And may ask for your help.
https://github.com/Unstructured-IO/unstructured/pull/2964