filesystem_spec HTTPFilesystem has a race condition on data size between the open and read calls, if content changes at server between the 2 calls

The following script reproduces the issues

The script spins up an http server that makes the json content 1 chat longer every 1s.

Then the client bit hits the url with a 1s sleep between the open and read calls. The json parsing immediately fails because the terminating { gets chopped.

For good measure the same test url is also hit with a requests.get call that does not seem to have any issues.

import json
from time import sleep
import datetime as dt
import requests
from threading import Thread
from urllib.parse import urlparse
import fsspec
from fsspec.implementations.dirfs import DirFileSystem

dummy_base_url = 'http://localhost:8080/'
dummy_uri = 'foo/bar'
dummy_url = dummy_base_url + dummy_uri

protocol = urlparse(dummy_url).scheme
dummy_fs = fsspec.filesystem(protocol)
dummy_dirfs = DirFileSystem(path=dummy_base_url, fs=dummy_fs)


def start_dummy_http_server():
    """Spin up a dummy HTTP Server to test the observed issue
    The returned response from a GET request is in JSON format
    The returned message is of varying size; with the size changing every second

    :return:
    """
    from http.server import HTTPServer, BaseHTTPRequestHandler

    class Serv(BaseHTTPRequestHandler):
        def do_GET(self):
            t = dt.datetime.now()
            s = t.second
            t_str = t.strftime('%Y-%m-%d %H:%M:%S')
            response = {'time': t_str, 'extra': 'x'*s}  # YYYY-MM-DD HH:MM:SS xxxxxxxx (x repeated s times)
            response = json.dumps(response)
            self.send_response(200)
            self.send_header('Content-Length', str(len(response)))
            self.end_headers()
            self.wfile.write(bytes(response, 'utf-8'))

    httpd = HTTPServer(('localhost', 8080), Serv)
    httpd.serve_forever()

print("starting http server")
thread = Thread(target=start_dummy_http_server, daemon=True)
thread.start()
print("http server started")

count = 0
while count < 500:
    count += 1
    t = dt.datetime.now()
    old_r = requests.get(dummy_url).content  # this works fine
    with dummy_dirfs.open(dummy_uri) as f:
        open_size = f.size
        sleep(1)
        # the message on the server has changed between open and read
        # the server returns the new message, but truncates it to size of original message
        data = f.read()
    new_r = requests.get(dummy_url).content  # this works fine
    old_requests_size = len(old_r)
    new_requests_size = len(new_r)
    read_size = len(data)  # len(json.dumps(y))
    print(f"#{count} - t = {t}")
    print(f"open size = {open_size}")
    print(f"read size = {read_size}")
    print(f"orig requests size = {old_requests_size}")
    print(f"new requests size = {new_requests_size}")
    print(data)
    j = json.loads(data)  # this will fail; because the JSON is not valid
    num_keys = len(j.keys())
    print("success")
    sleep(1)


thread.join()
print("script ended")

Mar 08 '24 12:03 masariello

I am not convinced that this should be an issue. When the server changes state during our operations, ideally we should raise an exception rather than silently change our state. The size of a file is needed to be able to do any random access, so just saying None (in the linked PR) will limit us to streaming the response HTTPStream File rather than HTTPFile. That might be OK for JSON, which must be parsed in a stream anyway, but no good for many different formats.

In general, I agree that getting the size could be deferred at least until the first read and that the first read could also contain some data. I think all of the HTTP-based backends would support that, but it would take some reworking.

Mar 08 '24 14:03 martindurant

When the server changes state during our operations, ideally we should raise an exception rather than silently change our state

I can see this would be consistent with the concept of a file on disk changing during a read, but one cannot help noticing that in the case of an HTTP "file", a requests.get would be able to download the full consistent state of the object.

I agree that getting the size could be deferred at least until the first read

Yes, that would at least remove one round-trip to the server and reduce the probability of the race condition hitting the client.

Mar 08 '24 16:03 masariello

I agree that getting the size could be deferred at least until the first read

Yes, that would at least remove one round-trip to the server and reduce the probability of the race condition hitting the client.

If you are interested, I would love to see an implementation. Else, keep pinging me util I do something about it.

Any workflow like

f = open()
f.seek()  # not with whence=2
f.read()  # read-to-end

should not need two requests. The server ought to tell us the total byte size at that point too, but I'm not sure this is guaranteed.

'Content-Range': 'bytes 3-30/31'
                             ^

Mar 15 '24 18:03 martindurant

Unless you feel like trying to implement my suggestion, do keep pinging me until I have time to do something about it myself.

Mar 25 '24 14:03 martindurant

One way to avoid this race condition is to fs.open(path, block_size=0). This returns a non-seekable HTTPStreamFile, which will probably fit many use-cases. Does fit ours, so will go with that.

Apr 11 '24 12:04 masariello

filesystem_spec filesystem_spec copied to clipboard

HTTPFilesystem has a race condition on data size between the open and read calls, if content changes at server between the 2 calls

filesystem_spec
filesystem_spec copied to clipboard