mysqlclient
mysqlclient copied to clipboard
Feature request: to_pandas()/to_arrow()
Hi, is there an easy+performant way to convert the result of a query into a Pandas or PyArrow dataframe while keeping the column names and data types?
I don't use pandas or arrow. If it is feature request, please write a pull request, or at least be specifically. If you are just asking question, please ask elsewhere. Issue tracker is not forum and OSS maintainers are not free tech support.
You explain that this feature is non-existent; Then this is a feature request.
The target is to retrieve query results as Pandas dataframes.
Usually, this is done with pd.read_sql()
but that method lacks knowledge of the column data types and is very slow.
The SQL library knows those data types and could directly convert between the underlying C library and e.g. Arrow tables.
This solution is used by e.g. Turbodbc.
From my side, I'm missing the knowledge of this library and time to do it. I'm surprised that you do not use Pandas but I understand that you therefore do not have interest to support it.
Maybe there is someone else interested in Pandas support and knows how to actually implement it.
@Hoeze
Pandas has a read_sql
function with a chunk_size
option. As far as the implementation is concerned it piggybacks off the db connection to run a query therefore the column names and data types should be available to Pandas. Some dtypes vary though between Pandas implementation vs mysql implementation(usually dates/floats), which is where you need to tell pandas through parse_dates
by listing the specific columns. Then there is also the coerce_float
option as well for the float implementation of mysql converted to pandas implementation.
I believe it is fast already, but if you'd like you can chunk via chunk_size
. There is also other things involved to such as network connection. Could also be the query too.
Another way I have done it before is to store the results as an in memory file and read that in.
Have you ran an odbc trace as well as looking at server throughput on MySQL db?
Hi @hpca01, thanks for your answer and sorry for my late reply.
The reason why I was asking for PyArrow support are the following:
- Fixed column schema. Pandas read_sql has no idea about column types without me telling it the type. (At least, telling the dtype is possible now: https://github.com/pandas-dev/pandas/issues/13049)
- Zero-copy and interoperability. With Apache Arrow, the whole de-serialization inside Pandas gets obsolete. Also, I can re-use the same memory chunk in other languages, e.g. PySpark. See also https://github.com/pandas-dev/pandas/issues/36893