asyncpg icon indicating copy to clipboard operation
asyncpg copied to clipboard

pandas dataframe

Open kernelmachine opened this issue 7 years ago • 29 comments

How would you recommend converting Records to a pandas dataframe?

Also, what do you think about giving fetch the option of returning a dataframe instead of list? There might be performance concerns.

kernelmachine avatar Aug 05 '16 17:08 kernelmachine

What is your use case? Sync driver + sqlalchemy usually is more then sufficient for such things, you even have it for free http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html

On Fri, 5 Aug 2016 at 13:26 Suchin [email protected] wrote:

How would you recommend converting Records to a pandas dataframe?

Also, what do you think about giving fetch the option of returning a dataframe instead of list? There might be performance concerns.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MagicStack/asyncpg/issues/17, or mute the thread https://github.com/notifications/unsubscribe-auth/AANoZ9YrtEJwJj72uUlP2dMZrv6ErQ5Gks5qc3JjgaJpZM4Jd5HD .

jettify avatar Aug 05 '16 17:08 jettify

I'm also curious to hear about the use cases. Also, does Pandas provide an async API?

We can add an API to register a custom row-decoder callback (function accepting bytes and returning decoded row). That would make it possible for Pandas to implement their own decoding to any kind of data-types.

1st1 avatar Aug 05 '16 17:08 1st1

Efficiently decoding data to the exact memory representation that pandas requires is currently somewhat complicated. It uses NumPy arrays as its internal representation, but null-handling adds some complexity

  • float32/float64 columns — normal np.float32/float64 arrays with NaN for nulls
  • any integer columns: for no nulls, use int8/16/32/64, otherwise use float64 (with nulls encoded as NaN). Users should be warned about data loss for int64 values outside the representable range with floating point numbers
  • boolean columns: numpy.bool_ for no nulls, and numpy.object_ if there are nulls (encode null as None)
  • string columns: numpy.object_ dtype with PyString/PyUnicode objects internally. Use None to encode nulls. It is preferable to avoid creating two copies of the same string (see https://github.com/cloudera/hs2client/blob/master/python/hs2client/converters.h#L193 where I deal with this problem for HiveServer2)
  • datetime columns: nanoseconds since the UNIX epoch as numpy.int64 (logically: datetime64[ns]) array

The business logic for producing these arrays is best written in C/C++/Cython with limited involvement of the Python C API (only where numpy.object_ arrays are involved). I am happy to help advise on the implementation / do some code review, the C++ code in https://github.com/cloudera/hs2client/blob/master/python/hs2client/converters.h can form a reference for what pandas expects in the NumPy arrays (it does not do timestamp conversion — depending on PG's datetime representation you may be able to perform the conversion in C or leave it to pandas's internal timestamp methods to convert).

pandas does not have any async APIs at the moment AFAIK

wesm avatar Aug 05 '16 18:08 wesm

As a possible added benefit for pandas users, it would be nice to have the option for any string columns to be returned as pandas categorical dtype, which has significant performance benefits in analytics

wesm avatar Aug 05 '16 18:08 wesm

@wesm I think we can built-in this functionality into asyncpg -- instead of Records we can return a list of columns (numpy arrays, with the decoding semantics you've outlined). Would that be enough to integrate pandas?

Also, would it be possible to provide a benchmark that we can tweak to use asyncpg and work with?

1st1 avatar Aug 05 '16 18:08 1st1

Yeah, that would definitely work. Could put this NumPy deserializer in an optional extension so that people can still use asyncpg if they don't have NumPy installed (since you will have to build against NumPy's headers, etc.)

We should be able to help kick the tires and come up with some typical benchmarking scenarios (i.e. numeric heavy reads, string heavy reads, etc.)

wesm avatar Aug 05 '16 18:08 wesm

Yeah, that would definitely work. Could put this NumPy deserializer in an optional extension so that people can still use asyncpg if they don't have NumPy installed (since you will have to build against NumPy's headers, etc.)

I actually wanted to make an optional dependency. Let's say hide this functionality behind a argument to the connect function (return_numpy_dataframes=False).

We should be able to help kick the tires and come up with some typical benchmarking scenarios (i.e. numeric heavy reads, string heavy reads, etc.)

That is something we need to have before we can start the process (we aren't Pandas users ourselves).

1st1 avatar Aug 05 '16 18:08 1st1

That is something we need to have before we can start the process (we aren't Pandas users ourselves).

It sounds like what you need is a test suite, not benchmarks, am I interpreting that right?

wesm avatar Aug 05 '16 20:08 wesm

It sounds like what you need is a test suite, not benchmarks, am I interpreting that right?

Sorry, I should have clarified my request. I wanted to ask for a small script that uses one type (say int32), fetches some data from the DB and performs some rudimentary calculation. We than could use that script to prototype the implementation and see how it compares to existing solutions. In any case, never mind, I think can jot a simple Pandas script myself.

1st1 avatar Aug 06 '16 17:08 1st1

I was interested in attempting to implement this - I did some naive benchmarks using asyncpg to load large tables instead of psycopg and I saw about 3x speed improvement. I suspect it could be possible to load data into pandas tables even faster by generating numpy arrays directly in the cython layer. Any pointers on where to start poking around would be welcome - I am aiming to just produce some simple benchmark to start with, to see if its worth pursuing lower level integration (as opposed to just using the normal asyncpg interface).

(Edit: context, I actually wrote quite a bit of the pandas sqlalchemy based SQL interface but i was never too satisfied with performance and type support).

mangecoeur avatar Dec 07 '17 16:12 mangecoeur

@mangecoeur Interface-wise, it would make most sense to integrate with Connection.copy_*. Basically, the algorithm for "copy_from_*_to_pandas" would look something like this:

  1. Introspect the source relation, and to get the types of the attributes.
  2. Initialize the dataframe accordingly.
  3. Build a codec profile (see _ensure_rows_decoder), using decoder versions that produce pandas native format as output.
  4. Initiate COPY OUT ... (FORMAT binary), and decode the data chunks into the target dataframe.

Step number 3 is the tricky part. For this whole thing to make sense performance-wise, the decoding pipeline must not dip into the interpreter loop, thus the decoders must be plugged in as C function pointers (maybe as a special argument to copy*()).

elprans avatar Dec 07 '17 18:12 elprans

From a data representation point of view, it may also be worth looking at Apache Arrow as an intermediary to use -- then you don't have to deal with all the weird pandas stuff and just deal with strongly-typed nullable columns. The turbodbc folks have been using Arrow for data en route to/from pandas and that's been working very well cc @mathmagique @xhochy

wesm avatar Dec 07 '17 18:12 wesm

Thanks for the pointers, i will try to implement some really simple cases just to see what the performance is like, if it's promising i could follow @wesm suggestion and integrate with Arrow (although turbodbc looks a bit baroque wrt mixing C++ and python, compared to cython only).

(Edit: I notice that there is a WIP cython api for arrow)

mangecoeur avatar Dec 07 '17 23:12 mangecoeur

The main thing you will benefit from in Arrow is that you get a simple construction of the columnar buffers using the Builder classes. The transformation from the Arrow structures to a Pandas DataFrame is then taken care of from the Arrow side. As Arrow is simpler structured then Pandas, the implementation is much simpler, still very efficient.

API-wise, I have used Arrow with Cython, boost::python and pybind11. If you're fluent in C++ and Python, then pybind11 is the best option from my perspective as you get the best of both worlds. Cython is the more natural choice when you come from a Python-only perspective but limiting you in the use of the C++ as not all C++ features are supported in Cython. Neverthelesse, in Arrow, we want to support all three options and it should be simple to to use with Cython. Always happy to help you to get it working independent of the wrapper ;)

xhochy avatar Dec 09 '17 14:12 xhochy

I would definitely be interested in being able to return the results of a query as a sequence of NumPy arrays representing the columns of the result. I don't use pandas very much, but I do use NumPy, and I would love an API that allows the user to pass in column dtypes and null handling instructions.

deepers avatar Jul 15 '19 20:07 deepers

I generally use

res = await pool.fetch(...)
df = pd.DataFrame([dict(r) for r in res])

jedmitten avatar Oct 08 '19 20:10 jedmitten

I generally use

res = await pool.fetch(...)
df = pd.DataFrame([dict(r) for r in res])

This didn't work for me, I had to use:

dfs = pd.DataFrame([dict(r.items()) for r in results ])

foxx avatar Mar 06 '20 02:03 foxx

Any news/updates on this enhancement?

RiccardoGiro avatar Jul 23 '20 12:07 RiccardoGiro

Hi all,

I tried to implement a naive version of @elprans proposal. Here's what I came up with: https://gist.github.com/0x0L/3d55d47f6329eb0d7e7a46d1c895b5be

A few remarks:

  • the cython bindings for array builders are a bit shady: default ctor are missing, no reset method
  • I couldn't find a canonical way to get an array builder from a given datatype - which looks strange to me. Am I missing something? @wesm
  • I couldn't reused py-pgproto codecs as they spit out python objects and I didn't want the overhead.

I don't know what the timings at the bottom are worth but at least it doesn't look bad.

0x0L avatar Nov 08 '21 10:11 0x0L

We haven't invested too much in the Cython API for the array builders, so using the C++ API (in Cython) would be the way to go (if Cython is the right implementation language).

I couldn't find a canonical way to get an array builder from a given datatype - which looks strange to me. Am I missing something? @wesm

We have the arrow::MakeBuilder function in C++ https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_base.h#L280

wesm avatar Nov 08 '21 14:11 wesm

@wesm Thanks! Makebuilder is exactly what I was looking for In terms of inter-op, a generic C++ lib would probably be a much better candidate here given the narrow and well-defined scope: convert pq binary buffers to table or recordbatch of a given schema. At least for primitive types, the schema translation from sql to arrow looks straightforward.

0x0L avatar Nov 08 '21 16:11 0x0L

I switched to c++. Writing python bindings will be quite easy considering how simple the interface is.

It supports list, composite types and enums Still a WIP but you can check it out at https://github.com/0x0L/pg2arrow

It basically tries to do the same as https://github.com/heterodb/pg2arrow Since we're using arrow api the code is much simpler

Also looking for advices on how to improve my ugly c++ pattern

0x0L avatar Nov 14 '21 16:11 0x0L

hi @0x0L, great news. I'm interested to see if there is a way to create a small "C middleware library" in Apache Arrow that uses the C data interface to exchange data structures:

https://arrow.apache.org/docs/format/CDataInterface.html

The idea would be to have some C code that provides a minimal Arrow builder API along with a minimal C implementation of this data interface, so downstream applications don't necessarily need to use C++ or link to the Arrow C++ library.

cc @pitrou for thoughts

wesm avatar Nov 14 '21 19:11 wesm

I'd like to hear if people have a problem with using Arrow C++ here. Since only a minimal build of Arrow C++ should be required, you can probably easily link statically to libarrow.a, and there shouldn't be any additional dependencies.

pitrou avatar Nov 16 '21 13:11 pitrou

Hi all.

I finally got a product that should be usable enough for others to test. You can have a look at https://github.com/0x0L/pgeon

Any feedback or help would be greatly welcomed :)

0x0L avatar Apr 16 '22 13:04 0x0L

I also thought about this in my blog post and will try to hack an alternative production-ready deserializer to numpy recarrays. My requirements are:

  • Drop-in replacement for asyncpg interface. I plan to prefix the query with :rocket: to activate the columnar mode.
  • Handle nulls well. Our production needs different strategies for nulls: remove rows; coerce to constant; fallback to object array.
  • Play nice with custom types such as JSON and HSTORE. Support them and not crash, that is.
  • Do not convert strings to fixed S/U, because the memory can explode. Find a way to inform the internals whether such a conversion must be done.
  • Of course, there will be no intermediate protocol parsing to Record-s. That's the whole point :smile:
  • No third-party dependencies except numpy.

Let's see how much of the original native asyncpg code will be left and how faster it will work.

vmarkovtsev avatar Apr 22 '22 21:04 vmarkovtsev

Done. Blog post Repository: https://github.com/athenianco/asyncpg-rkt

As I wrote in the post, merging back would require some effort, so I want to check my opportunities with the maintainers.

vmarkovtsev avatar Jun 02 '22 17:06 vmarkovtsev

I'd love to see it merged back, if Numpy will be an optional dependency

eirnym avatar Jun 05 '22 07:06 eirnym

@vmarkovtsev it would be awesome if you were able to merge

ale-dd avatar Jun 13 '23 21:06 ale-dd