unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

First pass at Astra Doc Ingest logic

Open erichare opened this issue 1 year ago • 7 comments

@potter-potter i don't have write access to your branch, so opened up a PR for my changes here

erichare avatar May 03 '24 14:05 erichare

@erichare Let's continue working on this from this branch.

Is this running successfully for you when you test it?

potter-potter avatar May 03 '24 22:05 potter-potter

@potter-potter i had something come up earlier and haven't had a chance to fully test, i will get back to you on that soon. Also sorry for the very undescriptive branch name here haha, I forgot to change the default 😄

I'll follow up soon

erichare avatar May 03 '24 22:05 erichare

I don't think we can fully sort of copy delta-table. I was just trying to point you to something that read from a "database table"-like source. There isn't really a good example of a source connector that parallels AstraDB as far as I can tell.

potter-potter avatar May 03 '24 22:05 potter-potter

Gotchya. I tried to pull out the logic that was specific (i.e., metadata fields, we dont really have a great notion of), but the basic idea of pulling the data, storing as CSV with pandas, and then using that as the input data seemed to make sense for Astra too. but with that said, yeah, let me go ahead and test it more before wasting your time looking at this further 😄 if you see anything obvious to adjust definitely let me know though

erichare avatar May 03 '24 22:05 erichare

It actually looks like this gets the entire table: astra_docs = list(self.astra_db_collection.paginated_find())

In most of the connectors, since they are not table based, get_ingest_docs just gets a pointer to the docs and then we can potentially fan out downloaders to download the docs with get_files

That being said, this is not a "docs" based connector (its table based) so it is a little different. I will look through our source connectors again to see if there is a better connector to model.

potter-potter avatar May 03 '24 22:05 potter-potter

It actually looks like this gets the entire table: astra_docs = list(self.astra_db_collection.paginated_find())

In most of the connectors, since they are not table based, get_ingest_docs just gets a pointer to the docs and then we can potentially fan out downloaders to download the docs with get_files

That being said, this is not a "docs" based connector (its table based) so it is a little different. I will look through our source connectors again to see if there is a better connector to model.

That is a really great point. Yep, it will pull the full table. paginated_find itself can take a filter condition, if there's any way to pass kwargs to get_ingest_docs?

erichare avatar May 03 '24 22:05 erichare

@erichare I haven't had time to really look at other examples. (mainly because I don't think there is one that maps quite right to this one) I'm going to redo the destination PR so we can get it merged and then dive into this source (Ingest) logic later.

potter-potter avatar May 10 '24 14:05 potter-potter

@potter-potter - Do you know if this PR is still being actively worked?

MthwRobinson avatar May 31 '24 15:05 MthwRobinson

@MthwRobinson I got pulled into some other stuff, but would like to get this merged... with that said, we sort of had two simultaneous PRs going, and i think this particular PR is out of date. If @potter-potter agrees, i'll open a new PR with just the Astra DB Source connector logic and we can close this one. I have some updates to make to it anyway as a result of recent changes on the Astra end.

erichare avatar May 31 '24 15:05 erichare

Thanks @erichare !

MthwRobinson avatar May 31 '24 15:05 MthwRobinson

@erichare Thanks. If you could close this and open a new one. We merged in the updates to the Astra destination (via a different PR), but this source connector needed more work.

potter-potter avatar May 31 '24 15:05 potter-potter

@erichare Thanks. If you could close this and open a new one. We merged in the updates to the Astra destination (via a different PR), but this source connector needed more work.

I'm on it @potter-potter . Thanks!

erichare avatar May 31 '24 15:05 erichare