gcsfs icon indicating copy to clipboard operation
gcsfs copied to clipboard

Getting 'GS' key error when reading a csv from GCS using gcsfc

Open ohashmi1 opened this issue 5 years ago • 10 comments

Hi I upgraded gcsfs and now I get the following error:

My code is pretty simple:

data = dd.read_csv(file_path, parse_dates=[date_column])\
        .compute()
    return data```

It used to work but all of a sudden it stopped working.
file_path = gs://mybuck/res.csv

```File "main.py", line 51, in run
    data = load_parse_file(file_path=args.input_file)
  File "/FbProphet/prophet_gcp/utils.py", line 15, in load_parse_file
    data = dd.read_csv(file_path, parse_dates=[date_column])\
  File "/work/miniconda/lib/python3.7/site-packages/dask/dataframe/io/csv.py", line 578, in read
    **kwargs
  File "/work/miniconda/lib/python3.7/site-packages/dask/dataframe/io/csv.py", line 405, in read_pandas
    **(storage_options or {})
  File "/work/miniconda/lib/python3.7/site-packages/dask/bytes/core.py", line 93, in read_bytes
    fs, fs_token, paths = get_fs_token_paths(urlpath, mode="rb", storage_options=kwargs)
  File "/work/miniconda/lib/python3.7/site-packages/dask/bytes/core.py", line 425, in get_fs_token_paths
    fs, fs_token = get_fs(protocol, options)
  File "/work/miniconda/lib/python3.7/site-packages/dask/bytes/core.py", line 571, in get_fs
    cls = _filesystems[protocol]
KeyError: 'gs'

ohashmi1 avatar Jul 17 '19 17:07 ohashmi1

Ah yes, sorry - my fault. For now, you can replace "gs" with "gcs".

martindurant avatar Jul 17 '19 17:07 martindurant

I have tried both, still get the same issue

ohashmi1 avatar Jul 17 '19 17:07 ohashmi1

Hm, actually on second thoughts, you are not using the new code at all.

I don't know why you are seeing this, there has been no change in dask (master) or gcsfs (release) yet. Can you show the contents of dask.bytes.code._filesystems, try import gcsfs explicitly, or run dask.bytes.core.get_fs('gs')?

martindurant avatar Jul 17 '19 18:07 martindurant

Of course, the workaround for you may be simply to downgrade gcsfs until we have completed the transition to fsspec (which is the reason for a little turbulence right now).

martindurant avatar Jul 17 '19 18:07 martindurant

seeing the same behavior w/ 0.3.0:

[ins] In [2]: dd.read_csv('gs://gcp-public-data-landsat/index.csv.gz', compression=
         ...: 'gzip')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-2-2c32b0849045> in <module>
----> 1 dd.read_csv('gs://gcp-public-data-landsat/index.csv.gz', compression='gzip')

~/venvs/model/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read(urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
    576             storage_options=storage_options,
    577             include_path_column=include_path_column,
--> 578             **kwargs
    579         )
    580

~/venvs/model/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read_pandas(reader, urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
    403         compression=compression,
    404         include_path=include_path_column,
--> 405         **(storage_options or {})
    406     )
    407

~/venvs/model/lib/python3.7/site-packages/dask/bytes/core.py in read_bytes(urlpath, delimiter, not_zero, blocksize, sample, compression, include_path, **kwargs)
     91
     92     """
---> 93     fs, fs_token, paths = get_fs_token_paths(urlpath, mode="rb", storage_options=kwargs)
     94
     95     if len(paths) == 0:

~/venvs/model/lib/python3.7/site-packages/dask/bytes/core.py in get_fs_token_paths(urlpath, mode, num, name_function, storage_options)
    423         update_storage_options(options, storage_options)
    424
--> 425         fs, fs_token = get_fs(protocol, options)
    426
    427         if "w" in mode:

~/venvs/model/lib/python3.7/site-packages/dask/bytes/core.py in get_fs(protocol, storage_options)
    569             "    pip install gcsfs",
    570         )
--> 571         cls = _filesystems[protocol]
    572
    573     elif protocol in ["adl", "adlfs"]:

KeyError: 'gs'

re: the q's you asked above

[nav] In [7]: import dask.bytes.core
         ...: dask.bytes.core._filesystems
         ...:
Out[7]: {'file': dask.bytes.local.LocalFileSystem}


[nav] In [9]: import dask.bytes.core
         ...: dask.bytes.core.get_fs('gs')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-9-855d81a61db6> in <module>
      1 import dask.bytes.core
----> 2 dask.bytes.core.get_fs('gs')

~/venvs/model/lib/python3.7/site-packages/dask/bytes/core.py in get_fs(protocol, storage_options)
    569             "    pip install gcsfs",
    570         )
--> 571         cls = _filesystems[protocol]
    572
    573     elif protocol in ["adl", "adlfs"]:

KeyError: 'gs'

bnaul avatar Jul 19 '19 05:07 bnaul

I'm afraid you need to use the master version of dask to pick this up, following https://github.com/dask/dask/pull/5064

martindurant avatar Jul 19 '19 12:07 martindurant

sg, seems like this is resolved then?

bnaul avatar Jul 19 '19 15:07 bnaul

`C:\Anaconda3\lib\site-packages\dask\dataframe\io\csv.py in read(urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
    576             storage_options=storage_options,
    577             include_path_column=include_path_column,
--> 578             **kwargs
    579         )
    580 

C:\Anaconda3\lib\site-packages\dask\dataframe\io\csv.py in read_pandas(reader, urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
    403         compression=compression,
    404         include_path=include_path_column,
--> 405         **(storage_options or {})
    406     )
    407 

C:\Anaconda3\lib\site-packages\dask\bytes\core.py in read_bytes(urlpath, delimiter, not_zero, blocksize, sample, compression, include_path, **kwargs)
     91 
     92     """
---> 93     fs, fs_token, paths = get_fs_token_paths(urlpath, mode="rb", storage_options=kwargs)
     94 
     95     if len(paths) == 0:

C:\Anaconda3\lib\site-packages\dask\bytes\core.py in get_fs_token_paths(urlpath, mode, num, name_function, storage_options)
    423         update_storage_options(options, storage_options)
    424 
--> 425         fs, fs_token = get_fs(protocol, options)
    426 
    427         if "w" in mode:

C:\Anaconda3\lib\site-packages\dask\bytes\core.py in get_fs(protocol, storage_options)
    569             "    pip install gcsfs",
    570         )
--> 571         cls = _filesystems[protocol]
    572 
    573     elif protocol in ["adl", "adlfs"]:

KeyError: 'gcs'

Have the same issue now dask 2.1.0 py_0 dask-core 2.1.0 py_0 gcsfs 0.3.0 py_0 conda-forge

PoradaKev avatar Jul 25 '19 08:07 PoradaKev

@PoradaKev - you a version of gcsfs that is too new for Dask. Either downgrade, or install Dask from master.

martindurant avatar Jul 25 '19 12:07 martindurant

Tried that dask==2.1.0 and gcsfs==0.2.3 would work.

chwonghk01 avatar Jul 26 '19 01:07 chwonghk01