hdfs3
hdfs3 copied to clipboard
libhdfs3 woes
Some more documentation around libhdfs3 would be helpful. It's difficult to figure out which of these is most canonical, and how they relate to each other.
https://github.com/Pivotal-Data-Attic/pivotalrd-libhdfs3 https://github.com/apache/incubator-hawq/tree/master/depends/libhdfs3 https://github.com/martindurant/libhdfs3-downstream https://github.com/ContinuumIO/libhdfs3-downstream https://github.com/bdrosen96/libhdfs3
The readme points to pivotalrd-libhdfs3 but that one does not seem to work with this library (missing hdfsCreateDirectoryEx
). I found that function was added here
https://github.com/martindurant/libhdfs3-downstream/commit/868cd49db7b56
which was cherry picked from the bdrosen96 fork. So I tried building (on mac) the head of https://github.com/martindurant/libhdfs3-downstream but I ran into problems (don't remember what exactly). I had had success on linux using the package supplied by anaconda, and I found that https://anaconda.org/conda-forge/libhdfs3/files was built "1 month and 29 days ago". So I looked for the commit on the martindurant fork that corresponded roughly to that date. Now I'm working from https://github.com/martindurant/libhdfs3-downstream/tree/7842951deab2d and I'm still getting build errors, but it feels like I'm getting close to success.
But this is crazy. It would be great if the readme could clarify and give some guidance on building or otherwise obtaining libhdfs3.
Can you please cross-post on pandas? fastparquet certainly does handle doing this, so apparently the call is being made incorrectly, but I'm not sure exactly how.
(cc https://github.com/pandas-dev/pandas/issues/33452 )
So pandas seems to assume that the first argument to the api.write
function can either be a path or a buffer. In the case of an S3 file, it passes an S3File object (buffer), not the string of the filepath. Here is the function that does this (https://github.com/pandas-dev/pandas/blob/master/pandas/io/s3.py#L23). I think this behavior is intended though.
However, the write function in fastparquet
expects a filename (https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py#L764). The write_simple
function works fine with both a filepath and a File object (https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py#L735). But the rest of the logic in the write function relies on the argument being a string.
Ideally, I suppose pandas should pass an argument to write
which is always a same type of object with the same interface (so even when its just a string, it should be wrapped by some class). This way the write function in fastparquet
would not have to handle paths and buffers differently. I assume a change like this in pandas would likely break other parts of that code, since the get_filepath_or_buffer
function is used quite a lot in pandas (https://github.com/pandas-dev/pandas/search?p=1&q=get_filepath_or_buffer&unscoped_q=get_filepath_or_buffer).
I believe this should now be fixed in at least pandas master (but probably released too).
@martindurant cool thanks, I will check and if its fixed I'll close the issue.