smart_open Random seeking from S3 is very slow

Problem description

What are you trying to achieve?
- I am trying to merge PDFs hosted on s3 using pikepdf
What is the expected result?
- s3 files are read and result is saved as combined PDF
What are you seeing instead?
- The code is stuck, I have to kill the process.

Steps/code to reproduce the problem

Add two small PDFs (200-300 KB) to s3 bucket. Reference them in the below script and run it. See that the script gets stuck. Now reference the same files from the local file system. See that it works fine.

from contextlib import ExitStack

from pikepdf import Pdf
from smart_open import open

input_pdfs = [
    "s3://somebucket/somefile1.pdf",
    "s3://somebucket/somefile2.pdf",
]

with open("merged.pdf", "wb") as fout:
    out_pdf = Pdf.new()
    version = out_pdf.pdf_version
    with ExitStack() as stack:
        pdf_fps = [stack.enter_context(open(path, "rb")) for path in input_pdfs]
        for fp in pdf_fps:
             src = Pdf.open(fp)
             version = max(version, src.pdf_version)
             out_pdf.pages.extend(src.pages)

    out_pdf.remove_unreferenced_resources()
    out_pdf.save(fout, min_version=version)

Versions

>>> import platform, sys, smart_open
>>> print(platform.platform())
Linux-5.3.18-lp152.75-default-x86_64-with-glibc2.3.4
>>> print("Python", sys.version)
Python 3.6.12 (default, Dec 02 2020, 09:44:23) [GCC]
>>> print("smart_open", smart_open.__version__)
smart_open 5.0.0

Anyhow, platform details are irrelevant, I can reproduce this on my box as well as on AWS lambda Python 3.8.

Checklist

Before you create the issue, please make sure you have:

[x] Described the problem clearly
[x] Provided a minimal reproducible example, including any required data
[x] Provided the version numbers of the relevant software

May 20 '21 00:05 Kilavagora

What is the stack trace when you kill the process? What part is it getting stuck on?

May 20 '21 02:05 mpenkov

@mpenkov Thing is that if I wait long enough, then keyboard interrupt does not work, so I have to kill the process externally. So I don't get any stack trace. That seems is happening in out_pdf.pages.extend(src.pages).

May 20 '21 13:05 Kilavagora

I tested with azure:// and sftp:// and the issue persists. But it works fine if I mount remotes with fuse 🤔

May 20 '21 17:05 Kilavagora

Yeah, that is weird. Are you able to step through with a debugger to see where the problem is?

May 20 '21 21:05 mpenkov

I cannot step into, pikepdf is a wrapper of c++ library, that's where it gets stuck.

May 20 '21 22:05 Kilavagora

Not sure there's much we (smart_open) can do here at the moment. There could be a bug in smart_open, one of its dependencies (that read GCS, S3, Azura, SFTP, etc) but it could also be in pikepdf, or the wrapper. It's a pretty deep rabbithole.

May 21 '21 00:05 mpenkov

smart_open dumps the read buffer after every seek: https://github.com/RaRe-Technologies/smart_open/blob/f8e60da5a53e1e7ead9d8ca4d3f09cbea04fc337/smart_open/s3.py#L668

pikepdf (actually, its C++ library QPDF) seeks very often. When combined with smart_open, the entire file will be downloaded hundreds of times. I got a very simple test to pass with smart_open - I expect the above test would take several minutes and ring up a decent sized bill from AWS.

IMHO this would need to be resolved in smart_open. Many programs assume seeking is usually fast, especially small seeks that are already in the active read buffer. That is why performance was fine for fuse-mounted remotes. If I did a workaround in pikepdf, there are likely many other applications that would still be affected. I'm sure you'd have similar problems if someone tried to use smart_open with another library that relies heavily on seek, like sqlite.

The easiest win would be to retain the read buffer if a seek lands within it. Ideally, you'd maintain a read cache.

May 25 '21 06:05 jbarlow83

Thank you for pointing out the problem. Are you interested in making a PR?

May 25 '21 09:05 mpenkov

I'm afraid not. Someone who knows this project's codebase (i.e. not me) should check for the same issue in other storage types and decide how to solve it project wide.

May 25 '21 09:05 jbarlow83

I think that my PR #748 should fix some of this if the seek() calls are to the current position in the file.

There is still outstanding work to do if the seek() is to a position contained in the read buffer.

Dec 18 '22 21:12 rustyconover

smart_open smart_open copied to clipboard

Random seeking from S3 is very slow

Problem description

Steps/code to reproduce the problem

Versions

Checklist

smart_open
smart_open copied to clipboard