snowflake-connector-python
snowflake-connector-python copied to clipboard
SNOW-165815: pandas datetime columns misinterpreted as VARCHAR(16777216) instead of DATETIME
Please answer these questions before submitting your issue. Thanks!
- What version of Python are you using (
python --version)?
Python 3.8.2
- What operating system and processor architecture are you using (
python -c 'import platform; print(platform.platform())')?
macOS-10.15.5-x86_64-i386-64bit
- What are the component versions in the environment (
pip list)?
appdirs 1.4.3 appier 1.19.12 appnope 0.1.0 asn1crypto 1.3.0 astroid 2.3.3 attrs 19.3.0 azure-common 1.1.25 azure-storage-blob 2.1.0 azure-storage-common 2.1.0 backcall 0.1.0 beautifulsoup4 4.9.0 black 19.10b0 bleach 3.1.4 bokeh 2.0.1 boto3 1.11.17 botocore 1.14.17 cachetools 4.0.0 certifi 2020.4.5.1 cffi 1.13.2 cftime 1.1.1.2 chardet 3.0.4 click 7.1.1 cloudpickle 1.3.0 colorama 0.4.3 cryptography 2.9.2 cycler 0.10.0 cytoolz 0.10.1 dask 2.14.0 decorator 4.4.2 defusedxml 0.6.0 distributed 2.14.0 dnslib 0.9.12 dnspython 1.16.0 docopt 0.6.2 docutils 0.15.2 entrypoints 0.3 et-xmlfile 1.0.1 fsspec 0.7.1 google-api 0.1.12 google-api-core 1.16.0 google-api-python-client 1.8.0 google-auth 1.13.1 google-auth-httplib2 0.0.3 googleapis-common-protos 1.51.0 grpcio 1.28.1 grpcio-tools 1.28.1 h5netcdf 0.8.0 h5py 2.10.0 HeapDict 1.0.1 httplib2 0.17.1 idna 2.9 ijson 2.6.1 ipykernel 5.2.0 ipython 7.13.0 ipython-genutils 0.2.0 ipywidgets 7.5.1 isort 4.3.21 jdcal 1.4.1 jedi 0.16.0 Jinja2 2.11.1 jmespath 0.9.5 jsonschema 3.2.0 jupyter 1.0.0 jupyter-client 6.1.2 jupyter-console 6.1.0 jupyter-core 4.6.3 kiwisolver 1.1.0 lab 5.5 lazy-object-proxy 1.4.3 localstack 0.11.0 localstack-client 0.23 localstack-ext 0.11.2 locket 0.2.0 MarkupSafe 1.1.1 matplotlib 3.2.1 mccabe 0.6.1 mistune 0.8.4 mkl-fft 1.0.15 mkl-service 2.3.0 more-itertools 8.2.0 msgpack 1.0.0 mypy-extensions 0.4.3 nbconvert 5.6.1 nbformat 5.0.4 netCDF4 1.5.3 notebook 6.0.3 numpy 1.18.2 olefile 0.46 openpyxl 3.0.3 oscrypto 1.2.0 packaging 20.3 pandas 1.0.3 pandocfilters 1.4.2 parso 0.6.2 partd 1.1.0 pathspec 0.7.0 pexpect 4.8.0 pickleshare 0.7.5 Pillow 7.0.0 pip 20.1.1 pluggy 0.13.1 prometheus-client 0.7.1 prompt-toolkit 3.0.5 protobuf 3.11.3 psutil 5.7.0 ptyprocess 0.6.0 py 1.8.1 pyaes 1.6.1 pyarrow 0.16.0 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycparser 2.20 pycryptodomex 3.9.7 Pygments 2.6.1 PyJWT 1.7.1 pylint 2.4.4 pyOpenSSL 19.1.0 pyparsing 2.4.6 pyrsistent 0.16.0 pytest 5.4.1 python-dateutil 2.8.1 python-gitlab 2.1.2 pytz 2019.3 PyYAML 5.3.1 pyzmq 19.0.0 qtconsole 4.7.2 QtPy 1.9.0 read-protobuf 0.1.1 regex 2020.2.20 requests 2.23.0 rsa 3.4.2 s3transfer 0.3.3 Send2Trash 1.5.0 setuptools 46.1.1.post20200323 simplejson 3.17.0 six 1.14.0 snowflake-connector-python 2.2.5 snowflake-sqlalchemy 1.2.3 sortedcontainers 2.1.0 soupsieve 2.0 SQLAlchemy 1.3.16 subprocess32 3.5.4 tblib 1.6.0 terminado 0.8.3 testpath 0.4.4 toml 0.10.0 toolz 0.10.0 tornado 6.0.4 tqdm 4.46.0 traitlets 4.3.3 txt2tags 3.7 typed-ast 1.4.1 typing-extensions 3.7.4.1 uritemplate 3.0.1 urllib3 1.25.8 wcwidth 0.1.9 webencodings 0.5.1 wheel 0.34.2 widgetsnbextension 3.5.1 wrapt 1.12.1 xarray 0.15.1 zict 2.0.0
- What did you do? When trying to upload data using the snowflake pandas backend:
df.to_sql(table_name, engine, index=False, method=pd_writer, if_exists="replace")
The date time column is not identified or transformed correctly.
- What did you expect to see?
- What did you see instead?
PS: I am new to snowflake and to the python connector, so looking forward to learning from and improving with you the way we can use the snowflake connector. :)
PPS: I wanted to correct the title for clarification, but it seems like GitHub bot does not allow me to do so...
Hey @sfc-gh-mkeller, if you need any further information, I am happy to provide and contribute.
Would like to solve this problem quickly, before ingesting large amounts of data :)
I'm sorry @Zaubeerer , I got caught up in some high priority work. I promise you that I'll get to this very soon.
Same issue for me, using the write_pandas method from the pandas_tools package of the Snowflake connector
@sfc-gh-mkeller same issue here using write_pandas!
Sorry, the new write_pandas functionality has revealed some issues with our internals.
So far we know that datetimes do not get parsed properly (which is where this error comes from) and lists also don't get parsed properly.
As a short-term solution, please don't use write_pandas with datetimes. I'll be switching it over to use CSV files instead of Parquet as a long-term solution, let's use this ticket to track progress.
Is there any other alternative to write datetimes in the meantime? Thanks for the reply btw.
Yes, of course. You can always use the to_sql built into pandas. It's only the snowflake.connector.pandas_tools module that won't work with datetimes.
I apologize for this issue!
Hey @sfc-gh-mkeller I found this post on stackoverflow maybe its useful to pinpoint the problem. https://stackoverflow.com/questions/60190210/how-can-i-reliably-use-datetime-values-in-parquet-files-to-fill-snowflake-tabl
Thank you @sfc-gh-mkeller , we are using to_sql and writes perfect with datetetimes, but the performance it's a quite low, We have been doing some test playing with the params but without significant changes, any suggestion?
The problem comes from our Parquet parser, so the fastest way to ingest data would be to dump the data into csv files, upload those and then copy those into a table. If anyone would like to write their own solution for this please use write_pandas as a starting point, just use to_csv and then play with the settings until Snowflake and the pandas csv engine agree on things. Please note that there was a bug found in it just a few days ago, here's the unreleased fix: https://github.com/snowflakedb/snowflake-connector-python/commit/fa7d80d4bedaa2ab933683576719c2d1bdd2ddcf
Take this into account too!
Hey @sfc-gh-mkeller I found this post on stackoverflow maybe its useful to pinpoint the problem. https://stackoverflow.com/questions/60190210/how-can-i-reliably-use-datetime-values-in-parquet-files-to-fill-snowflake-tabl
fastparquet seems to solve the issue, so if anyone would like to work around this issue for now install fastparquet and add engine="fastparquet" to the following call: https://github.com/snowflakedb/snowflake-connector-python/blob/cbf422897d42b7bcb441d4345d73267e16879c53/pandas_tools.py#L109
Please note that installing fastparquet is only simple if anaconda is installed, please see: https://pypi.org/project/fastparquet/ for more information.
Rest be assured we are working on a long term fix and appreciate your patience!
Hi @sfc-gh-mkeller,
any updates on how quick the longterm fix will be available?
I will be working with the snowflake-connector-python in the next days and wonder whether or not I have to use the workaround.
Bests Robin
@Zaubeerer we managed to find a temporary workaround by modifying the write_pandas method to write the temp files to csv not using parquet at all (based on the suggestion of @sfc-gh-mkeller). We did not use fastparquet as suggested by @sfc-gh-mkeller because installing it in our docker container trough conda was infeasible. You can try this as as temporary solution while they fixed it long term
I ran into the same issue. However, for us, it doesn't make too much of a difference and so we simply passed all dates and timestamps as strings. If needed, we can cast them back while presenting the data. Of course, we'd ideally like this to just work out of the box, but for now, this workaround is pretty okay.
Hey @sfc-gh-mkeller, after being able to workaround the issue, we are finally coming back to it. 🙂
So first question: Is this issue still open?
If so, I just tried to pip install -e . the latest snowflake-connector-python, but got a cryptic error on a windows machine:
ERROR: Command errored out with exit status 1: '\anaconda3\envs...\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'...\snowflake-connector-python\setup.py'"'"'; file='"'"'...\snowflake-connector-python\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.
That seems to make it difficult to use the workaround with fastparquet 🤔
Furthermore, we now not only get problems with datetimes when using write_pandas, but also with unix time stamps and serial numbers in integer format. 😮
So we would be happy about any update! 🙏
Hi @Zaubeerer please don't use -e it doesn't work with snowflake-connector-python for 2 reasons:
- Using a namespace package. The
snowflakepart ofsnowflake.connectoris causing the issue. No one else seems to be using namespace packages anymore, so this issue flies under the radar. I have given up on trying to get editable installs to work because of this and mostly because of my next point. - Even if our namespace package worked with editable installs what should happen to the compiled C code? I think this is undefined behavior by the Python/pypi team. Although I could be wrong about this, but I can't find how other packages handle this case.
Maybe someone came up with a solution to this, but I have been unable to find a good solution to this.
Usually just trying to editable install the connector will corrupt the whole virtualenv, you might have to recreate the whole thing.
OK, thanks for the instant feedback!
I assume the issue is not resolved yet?
So I will try it just with pip install .. :)
Not yet, sorry!
So here is a quick summary of the small performance test so far for a table of 100000 rows that needs 15s to read from our clients API into a dataframe.
We use to_sql with chunksize=16000:
- simple (no pd_writer, no fastparquet): ~ 4 min
- fastparquet (no pd_writer, fastparquet): ~ 2.5 min
- pd_writer (with pd_writer, no fastparquet): ~ 1 min
- pd_writer-fastparquet (both pd_writer and fastparquet): ~ 50 s
- fastparquet-nochunksize (no pd_writer, fastparquet): doesn't work
All this is not really satisfying, as I would have rather expected something like 1-10 s for ingesting 100000 rows.
We actually have thousands of systems with each ~ 3 million rows, so this would take ages to transfer...
Did I do it right?
Here's my original benchmarks that I ran when I added this feature:
| Description | Old Insert to_sql insert | no pd_writer | pd_writer |
|---|---|---|---|
| 1M rows - 10 000 chunking | 828.96 s | 112.60 s | 283.42 s |
| 1M rows - no chunking | N/A | 31.18 s | 41.63 s |
If we could reproduce the performance of no pd_writer, that would indeed be a sufficient work around for us.
How did you implement or rather call no pd_writer - no chunking - using write_pandas?
@Zaubeerer we managed to find a temporary workaround by modifying the write_pandas method to write the temp files to csv not using parquet at all (based on the suggestion of @sfc-gh-mkeller). We did not use fastparquet as suggested by @sfc-gh-mkeller because installing it in our docker container trough conda was infeasible. You can try this as as temporary solution while they fixed it long term
Hey @m-restrepo11, could you share your solution? Have you done some performance testing similar to the ones mentioned above?
How did you implement or rather call no pd_writer - no chunking - using write_pandas?
By using https://github.com/snowflakedb/snowflake-connector-python/blob/a7ac99cafd6c2b01e2ed76fef319ca077ee259be/test/pandas/test_pandas_tools.py#L29
Note that you need to make sure that the table is available beforehand with this function, but if it already exists then this is the fastest way to insert data, no chunking just means that
I ran into the same issue. However, for us, it doesn't make too much of a difference and so we simply passed all dates and timestamps as strings. If needed, we can cast them back while presenting the data. Of course, we'd ideally like this to just work out of the box, but for now, this workaround is pretty okay.
@akshayi1, how do you pass the datetimes etc. as strings? I just tried to use that work around (both casting the unix time stamps as well as using datetimes to str in Python, but we still get NULLs 🤷 ) 😞
By using
https://github.com/snowflakedb/snowflake-connector-python/blob/a7ac99cafd6c2b01e2ed76fef319ca077ee259be/test/pandas/test_pandas_tools.py#L29
Note that you need to make sure that the table is available beforehand with this function, but if it already exists then this is the fastest way to insert data, no chunking just means that
Yep, I worked around the "table has to be available" issue by first calling df.to_sql to write just 1 row and then calling write_pandas to write the other millions of rows. This could actually be added to write_pandas. Also, a hint in the documentation would be great!
However, with both the fastparquet and the "cast-to-str" work around we still get NULLs. 😞
Btw, highly appreciate the tests using pytest as well as the type checks. Speaking of which: Wouldn't it be a good idea to add all the above mentioned cases as tests as well as some assertions to check for NULL values?
Maybe increases the motivation to work on this issue a bit if it actually shows up in worsened code coverage. Would really appreciate if this issue gets a higher priority! 🙏
I am also really up for contributing here in the above described ways.
PS: Is there already a branch for this issue for contributions?
PS: Is there already a branch for this issue for contributions?
Could you check if the following PR would remedy your issue @Zaubeerer ? https://github.com/snowflakedb/snowflake-connector-python/pull/362
PS: Is there already a branch for this issue for contributions?
Could you check if the following PR would remedy your issue @Zaubeerer ? #362
Maybe :)
I will clone and test it right now.