hdfs3
hdfs3 copied to clipboard
how does this lib access the hdfs without any username pwd?
can you please explain how we can add username, pwd when connecting to hdfs and what config currently allows this lib to access all hdfs without usrname?
btw my dfs.permissions.enabled config is set to true
Working with a similar setup myself, the critical component here seems to be using pandas
> 1.0.5.
the critical component here seems to be using pandas > 1.0.5.
Do you know why this makes a difference? Perhaps we should drop the use of query
in favour of an explicit expression.
Sorry, I don't. I just know that given two setups identical apart from pandas
version with throw this error for me with pandas >= 1.1.0
Hm, I have pandas 1.1.0, and it still passes for me locally :|
Bisecting nixpkgs points at https://github.com/NixOS/nixpkgs/commit/2dafde493f153dba0eb4b34cd49763ee78eda3d9 as the first bad commit.
Indeed, if you simply revert pandas back to that prior version on an otherwise unmodified master
, the error reoccurs.
@martindurant if you have Nix installed, we can guide you to a reproducible installation that demonstrates this.
@TomAugspurger , in case you are bored and fancy tracing a pandas thing
Nothing comes to mind immediately, and I won't have time to debug this short-term.
It seems like the difference is occuring in the generation of the file path https://github.com/dask/fastparquet/blob/a8cb8d1a28eb2db4ada233052cbc01bf815c2551/fastparquet/writer.py#L952-L971
There are difference in behaviour of groupby for multi index, it can be seen in a following example:
import numpy as np
import pandas as pd
print(pd.DataFrame([(np.datetime64(\"2020-01-01\"), 12345)]).groupby([0, 1]).indices)
In a previous version it used to preserve the type
# nix-shell -p python3Packages.pandas -p python3Packages.numpy -I nixpkgs=https://github.com/NixOS/nixpkgs/archive/69cb94ebb3193fc5077ee99ab2b50353151466ae.tar.gz --run 'python3 -c "import numpy as np; import pandas as pd; print(pd.__version__); print(pd.DataFrame([(np.datetime64(\"2020-01-01\"), 12345)]).groupby([0, 1]).indices)"'
1.0.5
{(numpy.datetime64('2020-01-01T00:00:00.000000000'), 12345): array([0])}
but now started to perform a conversion
# nix-shell -p python3Packages.pandas -p python3Packages.numpy -I nixpkgs=https://github.com/NixOS/nixpkgs/archive/2dafde493f153dba0eb4b34cd49763ee78eda3d9.tar.gz --run 'python3 -c "import numpy as np; import pandas as pd; print(pd.__version__); print(pd.DataFrame([(np.datetime64(\"2020-01-01\"), 12345)]).groupby([0, 1]).indices)"'
1.1.0
{(Timestamp('2020-01-01 00:00:00'), 12345): array([0])}
Hm, I have pandas 1.1.0, and it still passes for me locally :|
@martindurant That might be because you've changed the compared value as a part of a8cb8d1a28eb2db4ada233052cbc01bf815c2551. That should have broken the test on older pandas versions such as 1.0.5.
Hm, reflexive coding. We could put in a pandas version-dependent block in the test, then. This is already a longer thread than I had thought this would cause!
OK, so changing the test matrix element to
[('dtTrade', '==',
Timestamp('2005-01-02 00:00:00'))]),
should fix it! I see this was already done for another element. The comparison with Timestamp should cast the value whether its a string or numpy value.
I had some random thoughts on the issue:
The names of partitioning directories in the "hive" were changed because the dates were rendered to string with default format of the type. Would that be an issue?
Also, it seems like pandas has some aversion to storing np.datetime64 in the index, so it appears that the behaviour in 1.1.0 is not a bug.
The names of partitioning directories in the "hive" were changed because the dates were rendered to string with default format of the type
Correct, we think this is what's going on
it appears that the behaviour in 1.1.0 is not a bug
Well, it's a change in behaviour, hence the problem for us. Perhaps wrapping in Timestamp in the expected value solves this for all cases.