ipwb icon indicating copy to clipboard operation
ipwb copied to clipboard

0 length response bodies cause IPFS add to fail

Open machawk1 opened this issue 5 years ago • 17 comments

This occurs in WARCs with WARC Response records containing HTTP responses with a 204 No Content status code. Sample provided in samples/warcs/HTTP204.warc.

When pushToIPFS() in the indexer calls pushBytesToIPFS(), which calls IPFS_API.add_bytes(), an ipfsapi.exceptions.ConnectionError exception occurs.

The start to a fix would be to check these values' length upon extraction.

machawk1 avatar Aug 09 '18 00:08 machawk1

Replication:

> import ipfsapi
> IPFS_API = ipfsapi.Client('localhost', 5001)
> IPFS_API.add_bytes('lorem')
u'QmeHxdVgh5BVLL7S7AvjrXqum8977XcPcfyHYUkoCG624N'
> IPFS_API.add_bytes('')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/ipfsapi/utils.py", line 150, in wrapper
    res = cmd(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/ipfsapi/client.py", line 2087, in add_bytes
    data=body, headers=headers, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/ipfsapi/http.py", line 37, in wrapper
    return func(self, *args, **merged)
  File "/usr/local/lib/python2.7/site-packages/ipfsapi/http.py", line 170, in request
    files, headers, data)
  File "/usr/local/lib/python2.7/site-packages/ipfsapi/http.py", line 100, in _request
    files=files, headers=headers, data=data)
  File "/usr/local/lib/python2.7/site-packages/ipfsapi/http.py", line 77, in _do_request
    six.raise_from(exceptions.ConnectionError(error), error)
  File "/Users/machawk1/Library/Python/2.7/lib/python/site-packages/six.py", line 737, in raise_from
    raise value
ipfsapi.exceptions.ConnectionError: ConnectionError: [Errno 32] Broken pipe
>>> 

machawk1 avatar Aug 09 '18 00:08 machawk1

This arose in the WARC provided by @james-e-powell as generated by wget, which encountered a 204 No Content when archiving a GitHub status page.

machawk1 avatar Aug 09 '18 00:08 machawk1

Also reported https://github.com/ipfs/py-ipfs-api/issues/137 but was reminded that py-ipfs-api has deprecated Python 2 usage (I think...the README line about the deprecation is ambiguous).

machawk1 avatar Aug 09 '18 00:08 machawk1

Sample WARC to replicate this added to ipwb/samples/warcs/HTTP204.warc in c9166792c52cf64b4a92e5805fd259c3805121fe.

machawk1 avatar Sep 04 '18 19:09 machawk1

I performed some manual regression testing on this with the recent release of go-ipfs v0.4.18.

Using go-ipfs v0.4.17 via ipfs-update and ipfs daemon:

$ python3 --version
Python 3.7.0
$ python3
> import ipfsapi
> IPFS_API = ipfsapi.Client('localhost', 5001)
> IPFS_API.add_bytes(b'lorem')
'QmeHxdVgh5BVLL7S7AvjrXqum8977XcPcfyHYUkoCG624N'
> IPFS_API.add_bytes(b'')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/ipfsapi/utils.py", line 151, in wrapper
    return res[self.field]
TypeError: list indices must be integers or slices, not str

Using go-ipfs v0.4.18 after killing the daemon, updating via ipfs-update install v0.4.18 then again running the daemon via ipfs daemon:

> import ipfsapi
> IPFS_API = ipfsapi.Client('localhost', 5001)
> IPFS_API.add_bytes(b'lorem')
'QmeHxdVgh5BVLL7S7AvjrXqum8977XcPcfyHYUkoCG624N'
> IPFS_API.add_bytes(b'')
'QmaRwA91m9Rdfaq9u3FH1fdMVxw1wFPjKL38czkWMxh3KB'

This seems to indicate, with Python 3 at least, that the issue of zero-length byte strings has been resolved in go-ipfs 0.4.18. I have yet to test this in Python 2.7.

/cc @alexander255, related to https://github.com/ipfs/py-ipfs-api/issues/137

EDIT: Also related to #597, using ipfsapi 0.4.3 per pip3 freeze

machawk1 avatar Nov 05 '18 15:11 machawk1

@machawk1 did you also try to cat that hash to see if you are actually getting an empty bytesarray back?

ibnesayeed avatar Nov 05 '18 15:11 ibnesayeed

@ibnesayeed I just did via py-ipfs-api:

>>> IPFS_API.cat('QmeHxdVgh5BVLL7S7AvjrXqum8977XcPcfyHYUkoCG624N')
b'lorem'
>>> IPFS_API.cat('QmaRwA91m9Rdfaq9u3FH1fdMVxw1wFPjKL38czkWMxh3KB')
[]
>>> IPFS_API.cat(b'QmeHxdVgh5BVLL7S7AvjrXqum8977XcPcfyHYUkoCG624N')
b'lorem'
>>> IPFS_API.cat(b'QmaRwA91m9Rdfaq9u3FH1fdMVxw1wFPjKL38czkWMxh3KB')
[]

Odd that an empty list is returned for the multihashes of the empty string instead of a string or byte-string (resp).

machawk1 avatar Nov 05 '18 15:11 machawk1

For now, the safe bet would be to handle it like this:

payload = IPFS_API.cat(hash) or b''

ibnesayeed avatar Nov 05 '18 15:11 ibnesayeed

@ibnesayeed I agree and verified that an empty list will defer to the empty byte string.

I think we still have a mix of py2-style strings and explicit byte strings in replay. The .cat() function seems to return a string if a string is passed and a byte-string if a byte-string is passed. This should be normalized with #51 but needs verification to ensure it does not break anything down the line (e.g., other functions expecting a regular string but receiving b''.

EDIT: Disregard the above stricken text, I mis-remembered what was being returned per our in-person discussion.

machawk1 avatar Nov 05 '18 15:11 machawk1

@ibnesayeed The above indexing of 0-length payload now returns a hash using ipfs version 0.4.18:

>>> IPFS_API.add_bytes('lorem')
u'QmeHxdVgh5BVLL7S7AvjrXqum8977XcPcfyHYUkoCG624N'
>>> IPFS_API.add_bytes('')
u'QmaRwA91m9Rdfaq9u3FH1fdMVxw1wFPjKL38czkWMxh3KB'

Should we revert the code within pushToIPFS() in the indexer to remove

if len(payload) == 0:  # py-ipfs-api issue #137
  return

...and just let ipwb via py-ipfs-api submit the empty string to IPFS?

machawk1 avatar Nov 14 '18 15:11 machawk1

Sure, but we need to test the change using IPWB, not just the direct interaction with IPFS. Also, we should add a note in the README about the minimum IPFS version requirement.

ibnesayeed avatar Nov 14 '18 15:11 ibnesayeed

Ok. A sample WARC exhibiting just this ought to be created as well. The ones we have (redirect.warc and redirectRelative.warc) both have entity bodies. IAH-20080430204825-00000-blackbook.warc.gz has some 3XXs with 0-length payloads, which could be tested as well (the records are currently "skipped" as facilitated by the above logic).

  • [ ] Create sample WARC whose sole warc-response record is one with a 3XX redirect with 0-length payload
  • [ ] Make the change in ipwb and validate/evaluate both the generated CDXJ as well as the replay behavior.
  • [ ] Update README to require IPFS v0.4.18

machawk1 avatar Nov 14 '18 15:11 machawk1

It doesn't necessarily have to be a 3XX response code to have an empty payload. Redirects can have payload, but it is perfectly alright to not have one. Status code 204 is another way to have no content (though this response code is often returned for PUT requests, not GET). Even a 200 response can have an empty body and have Content-Length: 0 header.

ibnesayeed avatar Nov 14 '18 15:11 ibnesayeed

@ibnesayeed I know this per the original message of this issue. Should a WARC be created exhibiting a number of these 0-length response scenarios? I was hoping to establish a task list to finally close this ticket now that the issue that caused it has been resolved in the dependencies.

machawk1 avatar Nov 14 '18 15:11 machawk1

Creating a specific WARC for this case is not necessary as long as zero-length responses are present already with any status code in an existing WARC. However, we should note that a non-redirect code is easier to test in browsers.

ibnesayeed avatar Nov 14 '18 16:11 ibnesayeed

zero-length responses are present already with any status code in an existing WARC.

There are examples but they're amidst 100+ other records.

a non-redirect code is easier to test in browsers.

Can you explain this more?

machawk1 avatar Nov 14 '18 16:11 machawk1

Can you explain this more?

Redirects are not terminal responses as web browsers are wired to follow redirects automatically. We try to handle redirects manually in SW, but in that case we don't have much visibility in the response as it becomes opaque. So, in order to test an empty response from the client side (using web browser and not something like cURL) it would be easier to have a non-redirect response.

ibnesayeed avatar Nov 14 '18 16:11 ibnesayeed