snowflake-connector-python icon indicating copy to clipboard operation
snowflake-connector-python copied to clipboard

SNOW-165815: pandas datetime columns misinterpreted as VARCHAR(16777216) instead of DATETIME

Open Zaubeerer opened this issue 5 years ago • 43 comments

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using (python --version)?

Python 3.8.2

  1. What operating system and processor architecture are you using (python -c 'import platform; print(platform.platform())')?

macOS-10.15.5-x86_64-i386-64bit

  1. What are the component versions in the environment (pip list)?

appdirs 1.4.3 appier 1.19.12 appnope 0.1.0 asn1crypto 1.3.0 astroid 2.3.3 attrs 19.3.0 azure-common 1.1.25 azure-storage-blob 2.1.0 azure-storage-common 2.1.0 backcall 0.1.0 beautifulsoup4 4.9.0 black 19.10b0 bleach 3.1.4 bokeh 2.0.1 boto3 1.11.17 botocore 1.14.17 cachetools 4.0.0 certifi 2020.4.5.1 cffi 1.13.2 cftime 1.1.1.2 chardet 3.0.4 click 7.1.1 cloudpickle 1.3.0 colorama 0.4.3 cryptography 2.9.2 cycler 0.10.0 cytoolz 0.10.1 dask 2.14.0 decorator 4.4.2 defusedxml 0.6.0 distributed 2.14.0 dnslib 0.9.12 dnspython 1.16.0 docopt 0.6.2 docutils 0.15.2 entrypoints 0.3 et-xmlfile 1.0.1 fsspec 0.7.1 google-api 0.1.12 google-api-core 1.16.0 google-api-python-client 1.8.0 google-auth 1.13.1 google-auth-httplib2 0.0.3 googleapis-common-protos 1.51.0 grpcio 1.28.1 grpcio-tools 1.28.1 h5netcdf 0.8.0 h5py 2.10.0 HeapDict 1.0.1 httplib2 0.17.1 idna 2.9 ijson 2.6.1 ipykernel 5.2.0 ipython 7.13.0 ipython-genutils 0.2.0 ipywidgets 7.5.1 isort 4.3.21 jdcal 1.4.1 jedi 0.16.0 Jinja2 2.11.1 jmespath 0.9.5 jsonschema 3.2.0 jupyter 1.0.0 jupyter-client 6.1.2 jupyter-console 6.1.0 jupyter-core 4.6.3 kiwisolver 1.1.0 lab 5.5 lazy-object-proxy 1.4.3 localstack 0.11.0 localstack-client 0.23 localstack-ext 0.11.2 locket 0.2.0 MarkupSafe 1.1.1 matplotlib 3.2.1 mccabe 0.6.1 mistune 0.8.4 mkl-fft 1.0.15 mkl-service 2.3.0 more-itertools 8.2.0 msgpack 1.0.0 mypy-extensions 0.4.3 nbconvert 5.6.1 nbformat 5.0.4 netCDF4 1.5.3 notebook 6.0.3 numpy 1.18.2 olefile 0.46 openpyxl 3.0.3 oscrypto 1.2.0 packaging 20.3 pandas 1.0.3 pandocfilters 1.4.2 parso 0.6.2 partd 1.1.0 pathspec 0.7.0 pexpect 4.8.0 pickleshare 0.7.5 Pillow 7.0.0 pip 20.1.1 pluggy 0.13.1 prometheus-client 0.7.1 prompt-toolkit 3.0.5 protobuf 3.11.3 psutil 5.7.0 ptyprocess 0.6.0 py 1.8.1 pyaes 1.6.1 pyarrow 0.16.0 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycparser 2.20 pycryptodomex 3.9.7 Pygments 2.6.1 PyJWT 1.7.1 pylint 2.4.4 pyOpenSSL 19.1.0 pyparsing 2.4.6 pyrsistent 0.16.0 pytest 5.4.1 python-dateutil 2.8.1 python-gitlab 2.1.2 pytz 2019.3 PyYAML 5.3.1 pyzmq 19.0.0 qtconsole 4.7.2 QtPy 1.9.0 read-protobuf 0.1.1 regex 2020.2.20 requests 2.23.0 rsa 3.4.2 s3transfer 0.3.3 Send2Trash 1.5.0 setuptools 46.1.1.post20200323 simplejson 3.17.0 six 1.14.0 snowflake-connector-python 2.2.5 snowflake-sqlalchemy 1.2.3 sortedcontainers 2.1.0 soupsieve 2.0 SQLAlchemy 1.3.16 subprocess32 3.5.4 tblib 1.6.0 terminado 0.8.3 testpath 0.4.4 toml 0.10.0 toolz 0.10.0 tornado 6.0.4 tqdm 4.46.0 traitlets 4.3.3 txt2tags 3.7 typed-ast 1.4.1 typing-extensions 3.7.4.1 uritemplate 3.0.1 urllib3 1.25.8 wcwidth 0.1.9 webencodings 0.5.1 wheel 0.34.2 widgetsnbextension 3.5.1 wrapt 1.12.1 xarray 0.15.1 zict 2.0.0

  1. What did you do? When trying to upload data using the snowflake pandas backend:
df.to_sql(table_name, engine, index=False, method=pd_writer, if_exists="replace")

The date time column is not identified or transformed correctly.

  1. What did you expect to see?
image
  1. What did you see instead?
image

Zaubeerer avatar Jun 09 '20 13:06 Zaubeerer

PS: I am new to snowflake and to the python connector, so looking forward to learning from and improving with you the way we can use the snowflake connector. :)

Zaubeerer avatar Jun 09 '20 13:06 Zaubeerer

PPS: I wanted to correct the title for clarification, but it seems like GitHub bot does not allow me to do so...

Zaubeerer avatar Jun 09 '20 13:06 Zaubeerer

Hey @sfc-gh-mkeller, if you need any further information, I am happy to provide and contribute.

Would like to solve this problem quickly, before ingesting large amounts of data :)

Zaubeerer avatar Jun 10 '20 10:06 Zaubeerer

I'm sorry @Zaubeerer , I got caught up in some high priority work. I promise you that I'll get to this very soon.

sfc-gh-mkeller avatar Jun 15 '20 23:06 sfc-gh-mkeller

Same issue for me, using the write_pandas method from the pandas_tools package of the Snowflake connector

AlbCM avatar Jul 06 '20 15:07 AlbCM

@sfc-gh-mkeller same issue here using write_pandas!

mrestay avatar Jul 06 '20 16:07 mrestay

Sorry, the new write_pandas functionality has revealed some issues with our internals. So far we know that datetimes do not get parsed properly (which is where this error comes from) and lists also don't get parsed properly.

As a short-term solution, please don't use write_pandas with datetimes. I'll be switching it over to use CSV files instead of Parquet as a long-term solution, let's use this ticket to track progress.

sfc-gh-mkeller avatar Jul 06 '20 16:07 sfc-gh-mkeller

Is there any other alternative to write datetimes in the meantime? Thanks for the reply btw.

mrestay avatar Jul 06 '20 16:07 mrestay

Yes, of course. You can always use the to_sql built into pandas. It's only the snowflake.connector.pandas_tools module that won't work with datetimes.

I apologize for this issue!

sfc-gh-mkeller avatar Jul 06 '20 17:07 sfc-gh-mkeller

Hey @sfc-gh-mkeller I found this post on stackoverflow maybe its useful to pinpoint the problem. https://stackoverflow.com/questions/60190210/how-can-i-reliably-use-datetime-values-in-parquet-files-to-fill-snowflake-tabl

mrestay avatar Jul 08 '20 00:07 mrestay

Thank you @sfc-gh-mkeller , we are using to_sql and writes perfect with datetetimes, but the performance it's a quite low, We have been doing some test playing with the params but without significant changes, any suggestion?

AlbCM avatar Jul 08 '20 12:07 AlbCM

The problem comes from our Parquet parser, so the fastest way to ingest data would be to dump the data into csv files, upload those and then copy those into a table. If anyone would like to write their own solution for this please use write_pandas as a starting point, just use to_csv and then play with the settings until Snowflake and the pandas csv engine agree on things. Please note that there was a bug found in it just a few days ago, here's the unreleased fix: https://github.com/snowflakedb/snowflake-connector-python/commit/fa7d80d4bedaa2ab933683576719c2d1bdd2ddcf Take this into account too!

sfc-gh-mkeller avatar Jul 08 '20 16:07 sfc-gh-mkeller

Hey @sfc-gh-mkeller I found this post on stackoverflow maybe its useful to pinpoint the problem. https://stackoverflow.com/questions/60190210/how-can-i-reliably-use-datetime-values-in-parquet-files-to-fill-snowflake-tabl

fastparquet seems to solve the issue, so if anyone would like to work around this issue for now install fastparquet and add engine="fastparquet" to the following call: https://github.com/snowflakedb/snowflake-connector-python/blob/cbf422897d42b7bcb441d4345d73267e16879c53/pandas_tools.py#L109

Please note that installing fastparquet is only simple if anaconda is installed, please see: https://pypi.org/project/fastparquet/ for more information.

Rest be assured we are working on a long term fix and appreciate your patience!

sfc-gh-mkeller avatar Jul 11 '20 02:07 sfc-gh-mkeller

Hi @sfc-gh-mkeller,

any updates on how quick the longterm fix will be available?

I will be working with the snowflake-connector-python in the next days and wonder whether or not I have to use the workaround.

Bests Robin

Zaubeerer avatar Jul 22 '20 13:07 Zaubeerer

@Zaubeerer we managed to find a temporary workaround by modifying the write_pandas method to write the temp files to csv not using parquet at all (based on the suggestion of @sfc-gh-mkeller). We did not use fastparquet as suggested by @sfc-gh-mkeller because installing it in our docker container trough conda was infeasible. You can try this as as temporary solution while they fixed it long term

mrestay avatar Jul 22 '20 20:07 mrestay

I ran into the same issue. However, for us, it doesn't make too much of a difference and so we simply passed all dates and timestamps as strings. If needed, we can cast them back while presenting the data. Of course, we'd ideally like this to just work out of the box, but for now, this workaround is pretty okay.

akshayi1 avatar Aug 14 '20 03:08 akshayi1

Hey @sfc-gh-mkeller, after being able to workaround the issue, we are finally coming back to it. 🙂

So first question: Is this issue still open?

If so, I just tried to pip install -e . the latest snowflake-connector-python, but got a cryptic error on a windows machine:

ERROR: Command errored out with exit status 1: '\anaconda3\envs...\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'...\snowflake-connector-python\setup.py'"'"'; file='"'"'...\snowflake-connector-python\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.

That seems to make it difficult to use the workaround with fastparquet 🤔

Furthermore, we now not only get problems with datetimes when using write_pandas, but also with unix time stamps and serial numbers in integer format. 😮

So we would be happy about any update! 🙏

Zaubeerer avatar Sep 10 '20 19:09 Zaubeerer

Hi @Zaubeerer please don't use -e it doesn't work with snowflake-connector-python for 2 reasons:

  1. Using a namespace package. The snowflake part of snowflake.connector is causing the issue. No one else seems to be using namespace packages anymore, so this issue flies under the radar. I have given up on trying to get editable installs to work because of this and mostly because of my next point.
  2. Even if our namespace package worked with editable installs what should happen to the compiled C code? I think this is undefined behavior by the Python/pypi team. Although I could be wrong about this, but I can't find how other packages handle this case.

Maybe someone came up with a solution to this, but I have been unable to find a good solution to this.

Usually just trying to editable install the connector will corrupt the whole virtualenv, you might have to recreate the whole thing.

sfc-gh-mkeller avatar Sep 10 '20 19:09 sfc-gh-mkeller

OK, thanks for the instant feedback!

I assume the issue is not resolved yet?

So I will try it just with pip install .. :)

Zaubeerer avatar Sep 10 '20 19:09 Zaubeerer

Not yet, sorry!

sfc-gh-mkeller avatar Sep 10 '20 19:09 sfc-gh-mkeller

So here is a quick summary of the small performance test so far for a table of 100000 rows that needs 15s to read from our clients API into a dataframe.

We use to_sql with chunksize=16000:

  1. simple (no pd_writer, no fastparquet): ~ 4 min
  2. fastparquet (no pd_writer, fastparquet): ~ 2.5 min
  3. pd_writer (with pd_writer, no fastparquet): ~ 1 min
  4. pd_writer-fastparquet (both pd_writer and fastparquet): ~ 50 s
  5. fastparquet-nochunksize (no pd_writer, fastparquet): doesn't work

All this is not really satisfying, as I would have rather expected something like 1-10 s for ingesting 100000 rows.

We actually have thousands of systems with each ~ 3 million rows, so this would take ages to transfer...

Did I do it right?

Zaubeerer avatar Sep 10 '20 21:09 Zaubeerer

Here's my original benchmarks that I ran when I added this feature:

Description Old Insert to_sql insert no pd_writer pd_writer
1M rows - 10 000 chunking 828.96 s 112.60 s 283.42 s
1M rows - no chunking N/A 31.18 s 41.63 s

sfc-gh-mkeller avatar Sep 10 '20 21:09 sfc-gh-mkeller

If we could reproduce the performance of no pd_writer, that would indeed be a sufficient work around for us.

How did you implement or rather call no pd_writer - no chunking - using write_pandas?

Zaubeerer avatar Sep 11 '20 08:09 Zaubeerer

@Zaubeerer we managed to find a temporary workaround by modifying the write_pandas method to write the temp files to csv not using parquet at all (based on the suggestion of @sfc-gh-mkeller). We did not use fastparquet as suggested by @sfc-gh-mkeller because installing it in our docker container trough conda was infeasible. You can try this as as temporary solution while they fixed it long term

Hey @m-restrepo11, could you share your solution? Have you done some performance testing similar to the ones mentioned above?

Zaubeerer avatar Sep 11 '20 21:09 Zaubeerer

How did you implement or rather call no pd_writer - no chunking - using write_pandas?

By using https://github.com/snowflakedb/snowflake-connector-python/blob/a7ac99cafd6c2b01e2ed76fef319ca077ee259be/test/pandas/test_pandas_tools.py#L29

Note that you need to make sure that the table is available beforehand with this function, but if it already exists then this is the fastest way to insert data, no chunking just means that

sfc-gh-mkeller avatar Sep 11 '20 21:09 sfc-gh-mkeller

I ran into the same issue. However, for us, it doesn't make too much of a difference and so we simply passed all dates and timestamps as strings. If needed, we can cast them back while presenting the data. Of course, we'd ideally like this to just work out of the box, but for now, this workaround is pretty okay.

@akshayi1, how do you pass the datetimes etc. as strings? I just tried to use that work around (both casting the unix time stamps as well as using datetimes to str in Python, but we still get NULLs 🤷 ) 😞

Zaubeerer avatar Sep 11 '20 21:09 Zaubeerer

By using

https://github.com/snowflakedb/snowflake-connector-python/blob/a7ac99cafd6c2b01e2ed76fef319ca077ee259be/test/pandas/test_pandas_tools.py#L29

Note that you need to make sure that the table is available beforehand with this function, but if it already exists then this is the fastest way to insert data, no chunking just means that

Yep, I worked around the "table has to be available" issue by first calling df.to_sql to write just 1 row and then calling write_pandas to write the other millions of rows. This could actually be added to write_pandas. Also, a hint in the documentation would be great!

However, with both the fastparquet and the "cast-to-str" work around we still get NULLs. 😞

Btw, highly appreciate the tests using pytest as well as the type checks. Speaking of which: Wouldn't it be a good idea to add all the above mentioned cases as tests as well as some assertions to check for NULL values?

Maybe increases the motivation to work on this issue a bit if it actually shows up in worsened code coverage. Would really appreciate if this issue gets a higher priority! 🙏

I am also really up for contributing here in the above described ways.

Zaubeerer avatar Sep 11 '20 21:09 Zaubeerer

PS: Is there already a branch for this issue for contributions?

Zaubeerer avatar Sep 11 '20 22:09 Zaubeerer

PS: Is there already a branch for this issue for contributions?

Could you check if the following PR would remedy your issue @Zaubeerer ? https://github.com/snowflakedb/snowflake-connector-python/pull/362

sfc-gh-mkeller avatar Sep 11 '20 22:09 sfc-gh-mkeller

PS: Is there already a branch for this issue for contributions?

Could you check if the following PR would remedy your issue @Zaubeerer ? #362

Maybe :)

I will clone and test it right now.

Zaubeerer avatar Sep 11 '20 22:09 Zaubeerer