filesystem_spec icon indicating copy to clipboard operation
filesystem_spec copied to clipboard

fsspec.implementations.hdfs.PyArrowHDFS → fsspec.implementations.arrow.HadoopFileSystem

Open DimitriPapadopoulos opened this issue 3 years ago • 34 comments

This warning appears in tests:

fsspec/implementations/tests/test_hdfs.py::test_put_file_get_file
  /home/runner/work/filesystem_spec/filesystem_spec/fsspec/implementations/tests/test_hdfs.py:16: FutureWarning: pyarrow.hdfs.HadoopFileSystem is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
    hdfs = pyarrow.hdfs.HadoopFileSystem()

According to the pyarrow release history, current version is 6.0.1 and version 2.0.0 was released on 19 October 2020. The Arrow project is moving fast, perhaps it's time to require pyarrow >= 2.0.0 and use pyarrow.fs.HadoopFileSystem instead of pyarrow.hdfs.HadoopFileSystem.

DimitriPapadopoulos avatar Dec 29 '21 10:12 DimitriPapadopoulos

Please see the fsspec.implementations.arrow for the new version https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/arrow.py#L190-L241. Perhaps we should advertise it better in the docs?

isidentical avatar Dec 29 '21 10:12 isidentical

Thank you for pointing me to the new implementation, I hadn't noticed it.

~~Yet, I think the existence of the new version is not directly related to updating pyarrow.hdfs.HadoopFileSystempyarrow.fs.HadoopFileSystem in the old version.~~

DimitriPapadopoulos avatar Dec 29 '21 10:12 DimitriPapadopoulos

Ah, got it.: pyarrow.hdfs.HadoopFileSystem is part of Filesystem Interface (legacy), while pyarrow.fs.HadoopFileSystem is part of the new Filesystem Interface. They really are different, not just an alias.

Since pyarrow.hdfs.HadoopFileSystem is deprecated, shouldn't fsspec.implementations.hdfs.PyArrowHDFS be deprecated too?

Right now, fsspec.implementations.hdfs.PyArrowHDFS has extensive documentation: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.hdfs.PyArrowHDFS and fsspec.implementations.hdfs.HadoopFileSystem is mentioned: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.hdfs.HadoopFileSystem

alias of pyarrow.hdfs.

Meanwhile, fsspec.implementations.arrow.HadoopFileSystem is not documented as far as I can see, just fsspec.implementations.arrow.ArrowFSWrapper: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.arrow.ArrowFSWrapper

Yes, I guess the new version needs to be better advertised and the old version should point to the new version.

DimitriPapadopoulos avatar Dec 29 '21 11:12 DimitriPapadopoulos

fsspec.implementations.hdfs.PyArrowHDFS is used by default today when letting fsspec guess the filesystem from the prefix should we make fsspec.implementations.arrow.HadoopFileSystem the new default?

rom1504 avatar Feb 04 '22 09:02 rom1504

actually I'm not so sure, I tried using arrow_hdfs prefix and now getting only valid on seekable files error

rom1504 avatar Feb 04 '22 10:02 rom1504

Agreed, we ought to move forward with arrow's new interface for "hdfs" if we are convinced it does all we need. So the issue @rom1504 faces should be investigated.

martindurant avatar Feb 04 '22 13:02 martindurant

I'm working on this at Wikimedia, and was just wondering the same thing! I was just considering adding an entry_point to override the existing hdfs registry to use arrow_hdfs. However that feels a little too magic: if my python project is installed, the way fsspec works would change.

@rom1504 how did you get the error you saw?

The only difference I've noticed is that CLASSPATH and maybe JAVA_HOME need to be set properly for pyarrow.fs.HadoopFileSystem to work, whereas the old pyarrow.hdfs.HadoopFileSystem seems to be able to discover the proper classpaths and default Hadoop configs on its own.

This means that if we make the new fsspec.implementations.hdfs.HadoopFileSystem the default for 'hdfs://', it may break some peoples usages of fsspec + hdfs:// URLs unless they are aware of the need to set these env vars.

I'd much prefer if we could make the default for hdfs:// URLs if we can though.

ottomata avatar Feb 10 '22 20:02 ottomata

Note that we are planning to remove the deprecated pyarrow.hdfs.HadoopFileSystem shortly (probably the next release), so if you don't think you are ready to use the new implementation as the default in fsspec for hdfs://, please speak up / report concrete issues over at https://issues.apache.org/jira/projects/ARROW/issues

jorisvandenbossche avatar Feb 23 '22 09:02 jorisvandenbossche

Ah great, good to know. Perhaps we will just wait for that then.

ottomata avatar Feb 28 '22 13:02 ottomata

I would personally recommend to not wait (at least with giving it a try), so we can try to resolve any issues with the new implementation before it is the only option.

jorisvandenbossche avatar Mar 01 '22 09:03 jorisvandenbossche

Hm okay, then perhaps I will do it now. My python project if installed will change the way fsspec hdfs:// urls work if it is installed (or imported? tbd).

Thanks.

ottomata avatar Mar 01 '22 13:03 ottomata

@ottomata you simply need to use the arrow_hdfs:// prefix instead of hdfs://

rom1504 avatar Mar 01 '22 13:03 rom1504

I saw that error with seek when trying to do file.seek on a file coming from the new hdfs filesystem implementation is seek supported in that implementation?

rom1504 avatar Mar 01 '22 13:03 rom1504

you simply need to use the arrow_hdfs:// prefix instead of hdfs://

Yes, but I'm writing a library that is trying to abstract that kind of detail away. Folks are used to using hdfs:// URLs.

ottomata avatar Mar 01 '22 13:03 ottomata

then you need to either wait for arrow_hdfs to become the default and be used as hdfs or you can put file_path.replace("hdfs://", "arrow_hdfs://") in your lib

I would definitely be interested as well if you can try your lib with the new implementation to see if things are working for you. (and whether they're faster/slower as well)

rom1504 avatar Mar 01 '22 13:03 rom1504

So far my lib uses a only a few fsspec API calls. Mostly check if exists, open, read, write. I've tried with arrow_hdfs:// and it seems to work just fine, as long as I remember to set CLASSPATH and JAVA_HOME correctly before (which it took me a while to realize I needed to do).

ottomata avatar Mar 01 '22 13:03 ottomata

I saw that error with seek when trying to do file.seek on a file coming from the new hdfs filesystem implementation is seek supported in that implementation?

@rom1504 Do you have a full error traceback?

I am not very familiar with HDFS myself, but normally "seek" operations should be supported (it should return a random access file object).

jorisvandenbossche avatar Mar 01 '22 13:03 jorisvandenbossche

as long as I remember to set CLASSPATH and JAVA_HOME correctly before (which it took me a while to realize I needed to do).

There is a bit of discussion about this at https://issues.apache.org/jira/browse/ARROW-13141, where we decided to not mimic the previous logic of "automatically" setting CLASSPATH.

The docs are now also updated to explicitly mention that the env variable needs to be specified: https://arrow.apache.org/docs/python/filesystems.html#hadoop-distributed-file-system-hdfs

jorisvandenbossche avatar Mar 01 '22 13:03 jorisvandenbossche

I can confirm that the protocol string will be just "hdfs", but we will keep the old one too. It's good to get people testing it now before we make the switch - since "hdfs" has so far been pointing to the now deprecated interface, that's the one that most people will have been using.

martindurant avatar Mar 01 '22 13:03 martindurant

@rom1504 Do you have a full error traceback?

no I don't have it anymore, but I'll try it again and share what I get

rom1504 avatar Mar 01 '22 14:03 rom1504

I discovered this while researching the differences between each class. Is this intended behavior?

In [1]: import fsspec

In [2]: fsspec.core.url_to_fs("hdfs://host/path/to/file")
Out[2]: (<fsspec.implementations.hdfs.PyArrowHDFS>at 0xXXXXXXXXX>, 
         '/path/to/file')

In [2]: fsspec.core.url_to_fs("arrow_hdfs://host/path/to/file")
Out[2]: (<fsspec.implementations.arrow.HadoopFileSystem>at 0xXXXXXXXXX>, 
         'host/path/to/file')

As shown above, just by changing the protocol, even the returned urlpathstr has changed.

After a little investigation, I noticed that at in _strip_protocol(cls, path), fsspec.utils.infer_storage_options is only called at hdfs.PyArrowHDFS.

https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/implementations/hdfs.html https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/implementations/arrow.html

And at first glance, the hdfs.PyArrowHDFS behavior seems to be correct.

Any comments would be appreciated.

reoono avatar Mar 26 '22 17:03 reoono

We will be migrating the old "hdfs" towards what is currently known as "arrow_hdfs", so indeed it is good to check this kind of behaviour out! I agree that host should probably not be part of the output path, and that therefore the _strip_protocol probably needs work.

martindurant avatar Mar 28 '22 18:03 martindurant

@isidentical , do you have a comment on which _strip_protocol behaviour is the right one here? I'm actually not sure what it means to have a "host" in the HDFS URL, given that the actual host(s) would be defined in the site.xml conf files.

martindurant avatar Mar 29 '22 17:03 martindurant

Perhaps the current fsspec.implementations.arrow.HadoopFileSystem implementation only works correctly when connecting to the default cluster. If the host isn't explicitly written as hdfs:///path (three slashes), the default host will be loaded from XML.

In [1]: import fsspec

In [2]: fsspec.core.url_to_fs("hdfs:///path/to/file")
Out[2]: (<fsspec.implementations.hdfs.PyArrowHDFS>at 0xXXXXXXXXX>, 
         '/path/to/file')

In [2]: fsspec.core.url_to_fs("arrow_hdfs:///path/to/file")
Out[2]: (<fsspec.implementations.arrow.HadoopFileSystem>at 0xXXXXXXXXX>, 
         '/path/to/file')

reoono avatar Apr 01 '22 16:04 reoono

@isidentical Please let me know your thoughts and comments on this if you can find the time.

reoono avatar Apr 18 '22 15:04 reoono

I'm actually not sure what it means to have a "host" in the HDFS URL

I don't maintain that part of the code anymore on DVC, but this is how it does it right now. So we always include the host/port etc. in the path. There might be some edge cases that might not be handled properly in the original fsspec code. https://github.com/iterative/dvc/blob/55219e9089005ac15d668ecf735aeaf31a771d0b/dvc/fs/hdfs.py#L28-L32

isidentical avatar Apr 18 '22 16:04 isidentical

My apologies for the delay in replying. I am relieved that it is not different from my understanding of host and port. I would like to create a PR based on the code you taught me.

reoono avatar Apr 25 '22 18:04 reoono

You are welcome to contribute! I don't know that we'll have a way to test the solution, though.

martindurant avatar Apr 25 '22 18:04 martindurant

Ah ha! I am encountering an issue with arrow_hdfs.

When opening a non existent file with mode='wb', I get an AttributeErrror:

fsspec.open("arrow_hdfs:///tmp/a5", mode='wb').open()

AttributeError                            Traceback (most recent call last)
<ipython-input-10-8bdebba006ca> in <module>
----> 1 f1 = fsspec.open("arrow_hdfs:///tmp/a5", mode='wb').open()

/usr/lib/airflow/lib/python3.7/site-packages/fsspec/core.py in open(self)
    148             _close(fobjects, mode)  # call close on other dependent file-likes
    149
--> 150         out.close = close
    151         return out
    152

AttributeError: can't set attribute

Using hdfs:/// and using either mode 'w' or 'wt' does not throw this error.

I'm using fsspec 2022.01.0.

ottomata avatar Apr 27 '22 13:04 ottomata

I assume

with fsspec.open("arrow_hdfs:///tmp/a5", mode='wb') as f1:

or

fsspec.filesystem("arrow_hdfs").open("/tmp/a5", mode="wb")

are both fine?

martindurant avatar Apr 27 '22 14:04 martindurant