smart_open [http/azure/s3] Avoid unnecessary GET when seeking within current buffer

Hello,

As for me smart_open http module can improve buffering, please look on code sample:

import smart_open
import pandas as pd
import http.client as http_client

http_client.HTTPConnection.debuglevel = 1

fp = smart_open.open("https://github.com/airbytehq/airbyte/files/9280856/test.xlsx", mode="rb")
df = pd.read_excel(fp)
print(df)

$ ./test.py | grep airbytehq/airbyte/files/9280856/test.xlsx
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=0-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8685-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8665-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8045-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=0-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8685-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8665-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=8045-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=0-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=6478-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=1724-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=6694-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=3832-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=2536-\r\n\r\n'
send: b'GET /airbytehq/airbyte/files/9280856/test.xlsx HTTP/1.1\r\nHost: github.com\r\nUser-Agent: python-requests/2.28.1\r\nAccept-Encoding: identity\r\nAccept: */*\r\nConnection: keep-alive\r\nrange: bytes=3099-\r\n\r\n'

pandas.read_excel read file in random access way, it does a lot of seek and read calls. I suspected if on first HTTP request we read all file contents, subsequent read calls will be from some internal buffer, but I still see that library under the hood continue to make HTTP requests inside small bytes range which already was read on 1-st HTTP request.

Can we improve it? Can we skip additional HTTP request if we already have all needed data from 1-st HTTP request?

Versions

print(platform.platform())
Linux-5.14.0-1047-oem-x86_64-with-glibc2.31
print("Python", sys.version)
Python 3.9.11 (main, Aug  9 2022, 09:22:28) 
[GCC 9.4.0]
print("smart_open", smart_open.__version__)
smart_open 6.0.0

Checklist

Before you create the issue, please make sure you have:

[x] Described the problem clearly
[x] Provided a minimal reproducible example, including any required data
[x] Provided the version numbers of the relevant software

Aug 09 '22 06:08 grubberr

smart_open's main use case is streaming. If your application does a lot of seeking, then it may be better for you to handle buffering separately (e.g. using tempfile).

Ideally, yes, smart_open would be smart enough to buffer the contents of the stream itself, but how do you determine the ideal size of the buffer? Automatically? Using some sort of parameter? It's a fair bit of work.

Aug 12 '22 07:08 mpenkov

As for me it can be any buffer size with some LRU mechanism. The main idea was - don't re-read data from upstream if it's already was read recenently as much as possible.

Yes I agree, it's can be pretty complex task which complicate librabry too much and can entroduce new errors.

Aug 12 '22 07:08 grubberr

smart_open smart_open copied to clipboard

[http/azure/s3] Avoid unnecessary GET when seeking within current buffer

Versions

Checklist

smart_open
smart_open copied to clipboard