hdfs3 icon indicating copy to clipboard operation
hdfs3 copied to clipboard

how does this lib access the hdfs without any username pwd?

Open venkat-vs-id opened this issue 7 years ago • 1 comments

can you please explain how we can add username, pwd when connecting to hdfs and what config currently allows this lib to access all hdfs without usrname?

btw my dfs.permissions.enabled config is set to true

venkat-vs-id avatar Nov 08 '17 01:11 venkat-vs-id

Working with a similar setup myself, the critical component here seems to be using pandas > 1.0.5.

risicle avatar Sep 23 '20 22:09 risicle

the critical component here seems to be using pandas > 1.0.5.

Do you know why this makes a difference? Perhaps we should drop the use of query in favour of an explicit expression.

martindurant avatar Sep 24 '20 13:09 martindurant

Sorry, I don't. I just know that given two setups identical apart from pandas version with throw this error for me with pandas >= 1.1.0

risicle avatar Sep 24 '20 18:09 risicle

Hm, I have pandas 1.1.0, and it still passes for me locally :|

martindurant avatar Sep 24 '20 21:09 martindurant

Bisecting nixpkgs points at https://github.com/NixOS/nixpkgs/commit/2dafde493f153dba0eb4b34cd49763ee78eda3d9 as the first bad commit.

veprbl avatar Sep 25 '20 14:09 veprbl

Indeed, if you simply revert pandas back to that prior version on an otherwise unmodified master, the error reoccurs.

@martindurant if you have Nix installed, we can guide you to a reproducible installation that demonstrates this.

risicle avatar Sep 25 '20 18:09 risicle

@TomAugspurger , in case you are bored and fancy tracing a pandas thing

martindurant avatar Sep 25 '20 18:09 martindurant

Nothing comes to mind immediately, and I won't have time to debug this short-term.

TomAugspurger avatar Sep 25 '20 20:09 TomAugspurger

It seems like the difference is occuring in the generation of the file path https://github.com/dask/fastparquet/blob/a8cb8d1a28eb2db4ada233052cbc01bf815c2551/fastparquet/writer.py#L952-L971

There are difference in behaviour of groupby for multi index, it can be seen in a following example:

import numpy as np
import pandas as pd
print(pd.DataFrame([(np.datetime64(\"2020-01-01\"), 12345)]).groupby([0, 1]).indices)

In a previous version it used to preserve the type

# nix-shell -p python3Packages.pandas -p python3Packages.numpy -I nixpkgs=https://github.com/NixOS/nixpkgs/archive/69cb94ebb3193fc5077ee99ab2b50353151466ae.tar.gz --run 'python3 -c "import numpy as np; import pandas as pd; print(pd.__version__); print(pd.DataFrame([(np.datetime64(\"2020-01-01\"), 12345)]).groupby([0, 1]).indices)"'
1.0.5
{(numpy.datetime64('2020-01-01T00:00:00.000000000'), 12345): array([0])}

but now started to perform a conversion

# nix-shell -p python3Packages.pandas -p python3Packages.numpy -I nixpkgs=https://github.com/NixOS/nixpkgs/archive/2dafde493f153dba0eb4b34cd49763ee78eda3d9.tar.gz --run 'python3 -c "import numpy as np; import pandas as pd; print(pd.__version__); print(pd.DataFrame([(np.datetime64(\"2020-01-01\"), 12345)]).groupby([0, 1]).indices)"'
1.1.0
{(Timestamp('2020-01-01 00:00:00'), 12345): array([0])}

Hm, I have pandas 1.1.0, and it still passes for me locally :|

@martindurant That might be because you've changed the compared value as a part of a8cb8d1a28eb2db4ada233052cbc01bf815c2551. That should have broken the test on older pandas versions such as 1.0.5.

veprbl avatar Sep 27 '20 16:09 veprbl

Hm, reflexive coding. We could put in a pandas version-dependent block in the test, then. This is already a longer thread than I had thought this would cause!

martindurant avatar Sep 28 '20 14:09 martindurant

OK, so changing the test matrix element to

                              [('dtTrade', '==',
                                Timestamp('2005-01-02 00:00:00'))]),

should fix it! I see this was already done for another element. The comparison with Timestamp should cast the value whether its a string or numpy value.

martindurant avatar Sep 29 '20 16:09 martindurant

I had some random thoughts on the issue:

The names of partitioning directories in the "hive" were changed because the dates were rendered to string with default format of the type. Would that be an issue?

Also, it seems like pandas has some aversion to storing np.datetime64 in the index, so it appears that the behaviour in 1.1.0 is not a bug.

veprbl avatar Sep 29 '20 17:09 veprbl

The names of partitioning directories in the "hive" were changed because the dates were rendered to string with default format of the type

Correct, we think this is what's going on

it appears that the behaviour in 1.1.0 is not a bug

Well, it's a change in behaviour, hence the problem for us. Perhaps wrapping in Timestamp in the expected value solves this for all cases.

martindurant avatar Sep 29 '20 19:09 martindurant