[Ray Tune/ Train] Auth with aws_web_identity_token or use the provided file system provider in runtime config
Description
Support web identity token for aws, it is a common setup for most kubernetes based clusters
Use case
Background:
- Pyarrow doesn't allow aws webidentitytoken file, it's implemented in C++, hasn't been ported yet.
- Ray uses pyarrow file system for almost everything
- Ray allows for specifying FileSystemProvider, but doesn't always use that provided provider for all calls.
Suggestions:
- Support awswebidentitytoken
- Don't use pyarrow unless it supports webidentitytoken
- Use the provided filesystemprovider everywhere.
[36m(TunerInternal pid=498719)[0m File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/syncer.py", line 136, in entrypoint
[36m(TunerInternal pid=498719)[0m result = self._fn(*args, **kwargs)
[36m(TunerInternal pid=498719)[0m File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/storage.py", line 212, in _upload_to_fs_path
[36m(TunerInternal pid=498719)[0m _pyarrow_fs_copy_files(local_path, fs_path, destination_filesystem=fs)
[36m(TunerInternal pid=498719)[0m File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/storage.py", line 110, in _pyarrow_fs_copy_files
[36m(TunerInternal pid=498719)[0m return pyarrow.fs.copy_files(
[36m(TunerInternal pid=498719)[0m File "/home/ray/anaconda3/lib/python3.9/site-packages/pyarrow/fs.py", line 269, in copy_files
[36m(TunerInternal pid=498719)[0m _copy_files_selector(source_fs, source_sel,
[36m(TunerInternal pid=498719)[0m File "pyarrow/_fs.pyx", line 1616, in pyarrow._fs._copy_files_selector
[36m(TunerInternal pid=498719)[0m File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
[36m(TunerInternal pid=498719)[0m OSError: When testing for existence of bucket 'aa.ray.storage': AWS Error NETWORK_CONNECTION during HeadBucket operation: curlCode: 43, A libcurl function was given a bad argument
[36m(TunerInternal pid=498719)[0m
[36m(TunerInternal pid=498719)[0m Caught exception when creating directory at (, aa.ray.storage/ltv_revenue_first_90d_model_20240418_tune):
[36m(TunerInternal pid=498719)[0m Traceback (most recent call last):
[36m(TunerInternal pid=498719)[0m File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/storage.py", line 281, in _create_directory
[36m(TunerInternal pid=498719)[0m fs.create_dir(fs_path)
[36m(TunerInternal pid=498719)[0m File "pyarrow/_fs.pyx", line 593, in pyarrow._fs.FileSystem.create_dir
[36m(TunerInternal pid=498719)[0m File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
[36m(TunerInternal pid=498719)[0m OSError: When testing for existence of bucket 'aa.ray.storage': AWS Error NETWORK_CONNECTION during HeadBucket operation: curlCode: 43, A libcurl function was given a bad argument
^^ that is because it is using pyarrow provider and not the provided filesystem provider, so credentials are expired.
Related issue: https://github.com/ray-project/ray/issues/41137
@ramannanda9 What is the FileSystemProvider that you're mentioning?
Yeah, We tried patching the S3FileSystem by extending the class and reauthenticating, it was still not working, but what works for us though is the following, so perhaps this should be made a default?
storage_filesysyem = fs.PyFileSystem(fs.FSSpecHandler(s3fs.S3FileSystem()))
@ramannanda9 Good to see that you found a solution. pyarrow's default implementations do the job for most cases, and we felt that the flexibility of slotting in fsspec filesystems like you did is good enough for more advanced usage. Making fsspec the default would involve adding a few more required dependencies for all Ray Train users which is not ideal.
@justinvyu perhaps it is better to document this as a special section highlighting where pyarrow would work, but feel free to close this issue out.