hdfs3 icon indicating copy to clipboard operation
hdfs3 copied to clipboard

libhdfs3 woes

Open nlevitt opened this issue 7 years ago • 4 comments

Some more documentation around libhdfs3 would be helpful. It's difficult to figure out which of these is most canonical, and how they relate to each other.

https://github.com/Pivotal-Data-Attic/pivotalrd-libhdfs3 https://github.com/apache/incubator-hawq/tree/master/depends/libhdfs3 https://github.com/martindurant/libhdfs3-downstream https://github.com/ContinuumIO/libhdfs3-downstream https://github.com/bdrosen96/libhdfs3

The readme points to pivotalrd-libhdfs3 but that one does not seem to work with this library (missing hdfsCreateDirectoryEx). I found that function was added here https://github.com/martindurant/libhdfs3-downstream/commit/868cd49db7b56 which was cherry picked from the bdrosen96 fork. So I tried building (on mac) the head of https://github.com/martindurant/libhdfs3-downstream but I ran into problems (don't remember what exactly). I had had success on linux using the package supplied by anaconda, and I found that https://anaconda.org/conda-forge/libhdfs3/files was built "1 month and 29 days ago". So I looked for the commit on the martindurant fork that corresponded roughly to that date. Now I'm working from https://github.com/martindurant/libhdfs3-downstream/tree/7842951deab2d and I'm still getting build errors, but it feels like I'm getting close to success.

But this is crazy. It would be great if the readme could clarify and give some guidance on building or otherwise obtaining libhdfs3.

nlevitt avatar Oct 27 '17 18:10 nlevitt

Can you please cross-post on pandas? fastparquet certainly does handle doing this, so apparently the call is being made incorrectly, but I'm not sure exactly how.

(cc https://github.com/pandas-dev/pandas/issues/33452 )

martindurant avatar May 18 '20 18:05 martindurant

So pandas seems to assume that the first argument to the api.write function can either be a path or a buffer. In the case of an S3 file, it passes an S3File object (buffer), not the string of the filepath. Here is the function that does this (https://github.com/pandas-dev/pandas/blob/master/pandas/io/s3.py#L23). I think this behavior is intended though.

However, the write function in fastparquet expects a filename (https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py#L764). The write_simple function works fine with both a filepath and a File object (https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py#L735). But the rest of the logic in the write function relies on the argument being a string.

Ideally, I suppose pandas should pass an argument to write which is always a same type of object with the same interface (so even when its just a string, it should be wrapped by some class). This way the write function in fastparquet would not have to handle paths and buffers differently. I assume a change like this in pandas would likely break other parts of that code, since the get_filepath_or_buffer function is used quite a lot in pandas (https://github.com/pandas-dev/pandas/search?p=1&q=get_filepath_or_buffer&unscoped_q=get_filepath_or_buffer).

tammymendt avatar May 19 '20 08:05 tammymendt

I believe this should now be fixed in at least pandas master (but probably released too).

martindurant avatar Sep 08 '20 13:09 martindurant

@martindurant cool thanks, I will check and if its fixed I'll close the issue.

tammymendt avatar Sep 16 '20 07:09 tammymendt