duckdb icon indicating copy to clipboard operation
duckdb copied to clipboard

Python: More similar interfaces for fetching data using DuckDBPyRelation and DuckDBPyResult

Open wisp3rwind opened this issue 2 years ago • 14 comments

Addresses https://github.com/duckdb/duckdb/discussions/4091 by adding some aliases for fetching results to DuckDBPyRelation and DuckDBPyResult such that they have (almost) identical interfaces:

  • This adds fetchnumpy to Relation, which was previously missing entirely (this is my primary motivation for this PR)
  • Other aliases are added, to match both classes. The exception is fetch_df_chunk, since this didn't seem to make much sense for Relation without creating the Result explicitly first to serve as the iterator object.
  • I took the liberty to modify some docstrings to make them clearer (to me, at least).

What do you think?

wisp3rwind avatar Jul 11 '22 21:07 wisp3rwind

Force-pushed for style fixes and to add missing stubs (apparently, I missed some when using git add -p).

wisp3rwind avatar Jul 12 '22 06:07 wisp3rwind

Hi, @wisp3rwind thanks for your PR. Could you please add test cases to the fetch functions you added?

I'm also not sure how I feel about many aliases to the same function, I wonder if @Mause or @Alex-Monahan have an opinion about this, I've personally been meaning to clean out these aliases from the API at some point because I think they make it more confusing.

pdet avatar Jul 12 '22 07:07 pdet

@pdet yeah I'm not a huge fan of having four names for one function, that's just likely to confuse people even more. We should probably choose one name, and if it's different from the current one, throw a deprecation Warning

Mause avatar Jul 12 '22 07:07 Mause

I would say we can keep df() and fetch_df(), numpy() and fetch_numpy() and merge all arrow functions with a flag to output either an arrow table or a record batch reader and call it arrow()and fetch_arrow()

I would also be ok with supporting only one of those, I guess in that case I would give preference to fetch_* because is more explicit

pdet avatar Jul 12 '22 07:07 pdet

Hi, @wisp3rwind thanks for your PR. Could you please add test cases to the fetch functions you added?

Sure. I'll extend existing tests that seem relevant, and come back for some feedback. Anything in particular that you think should be tested?

I'm also not sure how I feel about many aliases to the same function, I wonder if @Mause or @Alex-Monahan have an opinion about this, I've personally been meaning to clean out these aliases from the API at some point because I think they make it more confusing.

@Alex-Monahan already left a comment in #4091.

I would say we can keep df() and fetch_df(), numpy() and fetch_numpy() and merge all arrow functions with a flag to output either an arrow table or a record batch reader and call it arrow() and fetch_arrow()

I'd argue to keep fetch_arrow() and fetch_arrow_reader() separate, since their return types are different.

I would also be ok with supporting only one of those, I guess in that case I would give preference to fetch_* because is more explicit

:+1:

wisp3rwind avatar Jul 12 '22 07:07 wisp3rwind

I would say we can keep df() and fetch_df(), numpy() and fetch_numpy() and merge all arrow functions with a flag to output either an arrow table or a record batch reader and call it arrow()and fetch_arrow()

I would also be ok with supporting only one of those, I guess in that case I would give preference to fetch_* because is more explicit

I'm against the flag model, if only because we can't easily type that in pythons model (if we're talking a boolean flag at least)

Edit: I'd be open to chained method calls instead? .fetch_arrow(...).as_type(...) or something? Better than a flag IMHO

Mause avatar Jul 12 '22 08:07 Mause

Hi, @wisp3rwind thanks for your PR. Could you please add test cases to the fetch functions you added?

Sure. I'll extend existing tests that seem relevant, and come back for some feedback. Anything in particular that you think should be tested?

I'm also not sure how I feel about many aliases to the same function, I wonder if @Mause or @Alex-Monahan have an opinion about this, I've personally been meaning to clean out these aliases from the API at some point because I think they make it more confusing.

@Alex-Monahan already left a comment in #4091.

I would say we can keep df() and fetch_df(), numpy() and fetch_numpy() and merge all arrow functions with a flag to output either an arrow table or a record batch reader and call it arrow() and fetch_arrow()

I'd argue to keep fetch_arrow() and fetch_arrow_reader() separate, since their return types are different.

I would also be ok with supporting only one of those, I guess in that case I would give preference to fetch_* because is more explicit

👍

For sure your fetch numpy code, try to hit all the lines of code with the tests (so also an invalid query), and maybe do empty result, < vector_size, > vector_size (vector_size is 1024 elements)

pdet avatar Jul 12 '22 08:07 pdet

Hi, @wisp3rwind thanks for your PR. Could you please add test cases to the fetch functions you added?

I'm also not sure how I feel about many aliases to the same function, I wonder if @Mause or @Alex-Monahan have an opinion about this, I've personally been meaning to clean out these aliases from the API at some point because I think they make it more confusing.

I'm flexible! Thank you for asking!! The migration on the Intel side will be easy, so a break in API compatibility shouldn't be an issue there. I do like the really short functions like .df(), but I understand the desire for consistency. I'm flexible on how to handle the two different arrow return types as well.

Alex-Monahan avatar Jul 12 '22 08:07 Alex-Monahan

Yet another variant would be to have Relation.fetch_xyz() and Result.xyz() (xyz = df, arrow, numpy, arrow_reader), to indicate that this executes in the former case, which will already be spelled out explicitly in the latter case.

Not sure whether that might make things more confusing in the end.

wisp3rwind avatar Jul 12 '22 08:07 wisp3rwind

Yet another variant would be to have Relation.fetch_xyz() and Result.xyz() (xyz = df, arrow, numpy, arrow_reader), to indicate that this executes in the former case, which will already be spelled out explicitly in the latter case.

Not sure whether that might make things more confusing in the end.

Interesting thought! I'd likely rather them be the same and have the two options in both APIs, but that's just me!

Alex-Monahan avatar Jul 12 '22 09:07 Alex-Monahan

I would say we can keep df() and fetch_df(), numpy() and fetch_numpy() and merge all arrow functions with a flag to output either an arrow table or a record batch reader and call it arrow()and fetch_arrow() I would also be ok with supporting only one of those, I guess in that case I would give preference to fetch_* because is more explicit

I'm against the flag model, if only because we can't easily type that in pythons model (if we're talking a boolean flag at least)

Edit: I'd be open to chained method calls instead? .fetch_arrow(...).as_type(...) or something? Better than a flag IMHO

Ach so, @Alex-Monahan and @Mause let's maybe go over it on discord the three of us, I'll come up with some short document of the current function names and a proposal, and you can comment/change it as you wish.

Regarding the chaining, I think I prefer two different methods than the chaining stuff.

@wisp3rwind I think this discussion is a bit orthogonal (although triggered by :-) )to your PR, maybe in this PR could you just add the fetch_numpy function?

pdet avatar Jul 12 '22 09:07 pdet

I would say we can keep df() and fetch_df(), numpy() and fetch_numpy() and merge all arrow functions with a flag to output either an arrow table or a record batch reader and call it arrow()and fetch_arrow() I would also be ok with supporting only one of those, I guess in that case I would give preference to fetch_* because is more explicit

I'm against the flag model, if only because we can't easily type that in pythons model (if we're talking a boolean flag at least)

Edit: I'd be open to chained method calls instead? .fetch_arrow(...).as_type(...) or something? Better than a flag IMHO

Ach so, @Alex-Monahan and @Mause let's maybe go over it on discord the three of us, I'll come up with some short document of the current function names and a proposal, and you can comment/change it as you wish.

Regarding the chaining, I think I prefer two different methods than the chaining stuff.

@wisp3rwind I think this discussion is a bit orthogonal (although triggered by :-) )to your PR, maybe in this PR could you just add the fetch_numpy function?

Agreed, I'd be happy to see this pr's scope reduced to the new method and the improved docstrings 😁

Mause avatar Jul 12 '22 09:07 Mause

Agreed, I'd be happy to see this pr's scope reduced to the new method and the improved docstrings grin

Sounds good; not sure though whether I can still manage to this week.

wisp3rwind avatar Jul 14 '22 08:07 wisp3rwind

Agreed, I'd be happy to see this pr's scope reduced to the new method and the improved docstrings grin

Sounds good; not sure though whether I can still manage to this week.

I'm still planning to finish this, but won't have time to do so before late August. If anyone wants to pick this up in the meantime, feel free to do so!

wisp3rwind avatar Jul 25 '22 10:07 wisp3rwind

Closing this in favor of the linked PRs.

Ach so, @Alex-Monahan and @Mause let's maybe go over it on discord the three of us, I'll come up with some short document of the current function names and a proposal, and you can comment/change it as you wish.

Out of curiosity, did that discussion already happen?

wisp3rwind avatar Aug 18 '22 12:08 wisp3rwind