intake icon indicating copy to clipboard operation
intake copied to clipboard

Remote HTTP data is truncated

Open TomAugspurger opened this issue 6 years ago • 13 comments

It seems like data fetched through http is being truncated. Is that intentional?

>>> x = intake.open_csv("https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.csv").read()
>>> y = pd.read_csv("https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.csv")
>>> len(x), len(y)
(519, 1458)

If I download the data and read it with intake, the full dataset is read

!wget https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.csv
len(intake.open_csv("airports.csv").read())

TomAugspurger avatar Jun 02 '19 20:06 TomAugspurger

The trouble is with dask.dataframe, not intake. I suspect that if this were using fsspec's version of HTTPFileSystem, rather than the old one in dask, it would work as required.

martindurant avatar Jun 03 '19 13:06 martindurant

I suspect that if this were using fsspec's version of HTTPFileSystem, rather than the old one in dask, it would work as required.

Thanks. Do you have a branch where dask uses fsspec to test this out, or am I misremembering?

TomAugspurger avatar Jun 03 '19 13:06 TomAugspurger

You would do

from fsspec.implementations.http import HTTPFileSystem
dask.bytes.core._filesystems['https'] = HTTPFileSystem

but it does not seem to make any difference, so I don't know what's going on.

martindurant avatar Jun 03 '19 14:06 martindurant

OK. I'll take a look to see if I can figure out where things go wrong.

On Mon, Jun 3, 2019 at 9:26 AM Martin Durant [email protected] wrote:

You would do

from fsspec.implementations.http import HTTPFileSystem dask.bytes.core._filesystems['https'] = HTTPFileSystem

but it does not seem to make any difference, so I don't know what's going on.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/intake/intake/issues/367?email_source=notifications&email_token=AAKAOITCVXTQQFHFEQGRLWDPYUSXRA5CNFSM4HSDPAM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWZSGBQ#issuecomment-498279174, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIRIBZY3KAMESUDUMRDPYUSXRANCNFSM4HSDPAMQ .

TomAugspurger avatar Jun 03 '19 14:06 TomAugspurger

Hmm the Content-Length we're getting from requests looks fishy

(Pdb) pp url
'https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.csv'
(Pdb) pp kwargs
{}
(Pdb) pp r.headers['Content-Length']
'38160'

According to wget, and Github's HTML UI, that should be 105Kb

wget --spider https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.csv
Spider mode enabled. Check if remote file exists.
--2019-06-03 09:30:33--  https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.csv
wget: /Users/taugspurger/.netrc:4: unknown token "method"
wget: /Users/taugspurger/.netrc:4: unknown token "interactive"
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.4.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.4.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 107218 (105K) [text/plain]

I don't immediately see anything wrong with file_size in dask/bytes/http.py...

As an aside, we seem to make three calls to file_size. I haven't looked at why that's done multiple times.

TomAugspurger avatar Jun 03 '19 14:06 TomAugspurger

I bet the server is not respecting the "identity" keyword in the HEAD headers, and is giving the compressed size

martindurant avatar Jun 03 '19 14:06 martindurant

Something like that seems likely. The ratio isn't some even number like 8

In [43]: len(r.content) / int(r.headers['Content-Length'])
Out[43]: 2.8096960167714884

TomAugspurger avatar Jun 03 '19 14:06 TomAugspurger

In [9]: r = requests.get('https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.csv')
In [11]: r.headers['Content-Length']
Out[11]: '38160'
In [13]: len(r.text)
Out[13]: 107218

martindurant avatar Jun 03 '19 14:06 martindurant

right; so for dask it is always necessary to know the file size, unless you specify no chunking. The no-chunking option, blocks-ze=None, together with size_policy='none' should work, but that also fails in read_bytes (fsspec has a slightly different version of that too)

martindurant avatar Jun 03 '19 14:06 martindurant

Do you have a recommendation for where to move this issue? Is there anything Dask / fsspec can reasonably do here? The failure (silently not returning rows) is unfortunate.

TomAugspurger avatar Jun 03 '19 15:06 TomAugspurger

The failure (silently not returning rows) is unfortunate.

Agreed on that.

So fsspec is not yet ready to be inserted into dask, although I have started to work on it again. Compatability with s3/gcs, and their releases, is a necessary precondition from dask's point of view. However, it is the place that an issue like this ought to go.

A separate issue can be in Intake, to allow direct pandas reads upon request; or maybe automatic when the blocksize is none. That could be dasky (dask.delayed(pd.read_csv)(f) for f in files) or not dask at all for a single file.

martindurant avatar Jun 03 '19 15:06 martindurant

A separate issue can be in Intake, to allow direct pandas reads upon request;

I was a bit surprised that a non-glob URL still went through dask, but that's maybe not too meaningful feedback.

TomAugspurger avatar Jun 03 '19 20:06 TomAugspurger

Given #381 , we should be in a better place to fix this issue

martindurant avatar Jul 11 '19 19:07 martindurant