intake
intake copied to clipboard
Remote HTTP data is truncated
It seems like data fetched through http is being truncated. Is that intentional?
>>> x = intake.open_csv("https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.csv").read()
>>> y = pd.read_csv("https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.csv")
>>> len(x), len(y)
(519, 1458)
If I download the data and read it with intake, the full dataset is read
!wget https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.csv
len(intake.open_csv("airports.csv").read())
The trouble is with dask.dataframe, not intake. I suspect that if this were using fsspec's version of HTTPFileSystem, rather than the old one in dask, it would work as required.
I suspect that if this were using fsspec's version of HTTPFileSystem, rather than the old one in dask, it would work as required.
Thanks. Do you have a branch where dask uses fsspec to test this out, or am I misremembering?
You would do
from fsspec.implementations.http import HTTPFileSystem
dask.bytes.core._filesystems['https'] = HTTPFileSystem
but it does not seem to make any difference, so I don't know what's going on.
OK. I'll take a look to see if I can figure out where things go wrong.
On Mon, Jun 3, 2019 at 9:26 AM Martin Durant [email protected] wrote:
You would do
from fsspec.implementations.http import HTTPFileSystem dask.bytes.core._filesystems['https'] = HTTPFileSystem
but it does not seem to make any difference, so I don't know what's going on.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/intake/intake/issues/367?email_source=notifications&email_token=AAKAOITCVXTQQFHFEQGRLWDPYUSXRA5CNFSM4HSDPAM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWZSGBQ#issuecomment-498279174, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIRIBZY3KAMESUDUMRDPYUSXRANCNFSM4HSDPAMQ .
Hmm the Content-Length we're getting from requests looks fishy
(Pdb) pp url
'https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.csv'
(Pdb) pp kwargs
{}
(Pdb) pp r.headers['Content-Length']
'38160'
According to wget, and Github's HTML UI, that should be 105Kb
wget --spider https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.csv
Spider mode enabled. Check if remote file exists.
--2019-06-03 09:30:33-- https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.csv
wget: /Users/taugspurger/.netrc:4: unknown token "method"
wget: /Users/taugspurger/.netrc:4: unknown token "interactive"
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.4.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.4.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 107218 (105K) [text/plain]
I don't immediately see anything wrong with file_size in dask/bytes/http.py...
As an aside, we seem to make three calls to file_size. I haven't looked at why that's done multiple times.
I bet the server is not respecting the "identity" keyword in the HEAD headers, and is giving the compressed size
Something like that seems likely. The ratio isn't some even number like 8
In [43]: len(r.content) / int(r.headers['Content-Length'])
Out[43]: 2.8096960167714884
In [9]: r = requests.get('https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.csv')
In [11]: r.headers['Content-Length']
Out[11]: '38160'
In [13]: len(r.text)
Out[13]: 107218
right; so for dask it is always necessary to know the file size, unless you specify no chunking. The no-chunking option, blocks-ze=None, together with size_policy='none' should work, but that also fails in read_bytes (fsspec has a slightly different version of that too)
Do you have a recommendation for where to move this issue? Is there anything Dask / fsspec can reasonably do here? The failure (silently not returning rows) is unfortunate.
The failure (silently not returning rows) is unfortunate.
Agreed on that.
So fsspec is not yet ready to be inserted into dask, although I have started to work on it again. Compatability with s3/gcs, and their releases, is a necessary precondition from dask's point of view. However, it is the place that an issue like this ought to go.
A separate issue can be in Intake, to allow direct pandas reads upon request; or maybe automatic when the blocksize is none. That could be dasky (dask.delayed(pd.read_csv)(f) for f in files) or not dask at all for a single file.
A separate issue can be in Intake, to allow direct pandas reads upon request;
I was a bit surprised that a non-glob URL still went through dask, but that's maybe not too meaningful feedback.
Given #381 , we should be in a better place to fix this issue