gcsfs
gcsfs copied to clipboard
Resolved to no files error. using dd.read_csv and a globstring
Hi!
dask 2.8.0, gcsfs 0.4.0 I have the following error when trying to use a glob string in a file path:
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-14-559125deb7ec> in <module>
5 ) and clusterid !=32003
6 ''',
----> 7 compute=True)
<ipython-input-10-3ef47a2c7f4f> in df_from_bq(query, table, compute, output, dtype)
60 pass
61
---> 62 df = dd.read_csv(gs_path+'{0}'.format(destination_file), storage_options={'token': key}, dtype=dtype, low_memory=False)
63
64 if compute==True and output==True:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\csv.py in read(urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
576 storage_options=storage_options,
577 include_path_column=include_path_column,
--> 578 **kwargs
579 )
580
~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\csv.py in read_pandas(reader, urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
403 compression=compression,
404 include_path=include_path_column,
--> 405 **(storage_options or {})
406 )
407
~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\bytes\core.py in read_bytes(urlpath, delimiter, not_zero, blocksize, sample, compression, include_path, **kwargs)
94
95 if len(paths) == 0:
---> 96 raise IOError("%s resolved to no files" % urlpath)
97
98 if blocksize is not None:
OSError: gs://BQ-Extracts/result_20191118_111323_*.csv resolved to no files
Everything works fine when I indicate a filepath without a globstring like this gs://BQ-Extracts/result_20191118_111323_00000.csv however it fails with the abovementioned error with a globstring. Need to fix this issue as the code is used inside a big function
The issue appeared after upgrading gcsfs to the latest version. Everything works with globstring if I downgrade gcsfs to 0.3.0
UPD: Just a run the same query a few more time with gcsfs 0.3.0 and received the 'resolved to no files' error. Probably the newer version of dask or fsspec 0.6.0 cause the issues
Can you reproduce without dask?
Can you bisect your the commit that broke it?
On Nov 18, 2019, at 05:16, PoradaKev [email protected] wrote:
Hi!
dask 2.8.0, gcsfs 0.4.0 I have the following error when trying to use a glob string in a file path:
OSError Traceback (most recent call last)
in 5 ) and clusterid !=32003 6 ''', ----> 7 compute=True)
in df_from_bq(query, table, compute, output, dtype) 60 pass 61 ---> 62 df = dd.read_csv(gs_path+'{0}'.format(destination_file), storage_options={'token': key}, dtype=dtype, low_memory=False) 63 64 if compute==True and output==True: ~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\csv.py in read(urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs) 576 storage_options=storage_options, 577 include_path_column=include_path_column, --> 578 **kwargs 579 ) 580
~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\csv.py in read_pandas(reader, urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs) 403 compression=compression, 404 include_path=include_path_column, --> 405 **(storage_options or {}) 406 ) 407
~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\bytes\core.py in read_bytes(urlpath, delimiter, not_zero, blocksize, sample, compression, include_path, **kwargs) 94 95 if len(paths) == 0: ---> 96 raise IOError("%s resolved to no files" % urlpath) 97 98 if blocksize is not None:
OSError: gs://BQ-Extracts/result_20191118_111323_*.csv resolved to no files
Everything works fine when I indicate a filepath without a globstring like this gs://BQ-Extracts/result_20191118_111323_00000.csv however it fails with the abovementioned error with a globstring. Need to fix this issue as the code is used inside a big function
The issue appeared after upgrading gcsfs to the latest version. Everything works with globstring if I downgrade gcsfs to 0.3.0
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Didn't understand your question regarding commit.
I didn't try to reproduct it without dask, but I've just figured out that ffspec 0.6.0 and gcsfs 0.4.0 cause the issue.
Everything works fine with dask 2.8.0, gcsfs 0.3.0 and ffspec 0.5.2
CSV resolving seems to be working on latest master of gcsfs, fsspec:
In [1]: import dask.dataframe as dd
In [2]: df = dd.read_csv("gcs://anaconda-public-data/airline/*.csv")
In [3]: df = dd.read_csv("gcs://anaconda-public-data/airline/*.csv", storage_options={'token': 'anon'})
In [4]: df.npartitions
Out[4]: 196
This error happened when I create the gcsfs file system first, and modify/reload/.. files.
After the file was modified, when we use the same gcsfs file system, it will raise the FileNotFoundError.
This error existed in both of Pandas and Dask,
One solution I found out is to set "cache_timeout=0" when creating the gcsfs file system.