smart_open
smart_open copied to clipboard
Random seeking from S3 is very slow
Problem description
- What are you trying to achieve?
- I am trying to merge PDFs hosted on s3 using pikepdf
- What is the expected result?
- s3 files are read and result is saved as combined PDF
- What are you seeing instead?
- The code is stuck, I have to kill the process.
Steps/code to reproduce the problem
Add two small PDFs (200-300 KB) to s3 bucket. Reference them in the below script and run it. See that the script gets stuck. Now reference the same files from the local file system. See that it works fine.
from contextlib import ExitStack
from pikepdf import Pdf
from smart_open import open
input_pdfs = [
"s3://somebucket/somefile1.pdf",
"s3://somebucket/somefile2.pdf",
]
with open("merged.pdf", "wb") as fout:
out_pdf = Pdf.new()
version = out_pdf.pdf_version
with ExitStack() as stack:
pdf_fps = [stack.enter_context(open(path, "rb")) for path in input_pdfs]
for fp in pdf_fps:
src = Pdf.open(fp)
version = max(version, src.pdf_version)
out_pdf.pages.extend(src.pages)
out_pdf.remove_unreferenced_resources()
out_pdf.save(fout, min_version=version)
Versions
>>> import platform, sys, smart_open
>>> print(platform.platform())
Linux-5.3.18-lp152.75-default-x86_64-with-glibc2.3.4
>>> print("Python", sys.version)
Python 3.6.12 (default, Dec 02 2020, 09:44:23) [GCC]
>>> print("smart_open", smart_open.__version__)
smart_open 5.0.0
Anyhow, platform details are irrelevant, I can reproduce this on my box as well as on AWS lambda Python 3.8.
Checklist
Before you create the issue, please make sure you have:
- [x] Described the problem clearly
- [x] Provided a minimal reproducible example, including any required data
- [x] Provided the version numbers of the relevant software
What is the stack trace when you kill the process? What part is it getting stuck on?
@mpenkov
Thing is that if I wait long enough, then keyboard interrupt does not work, so I have to kill the process externally. So I don't get any stack trace. That seems is happening in out_pdf.pages.extend(src.pages)
.
I tested with azure:// and sftp:// and the issue persists. But it works fine if I mount remotes with fuse 🤔
Yeah, that is weird. Are you able to step through with a debugger to see where the problem is?
I cannot step into, pikepdf is a wrapper of c++ library, that's where it gets stuck.
Not sure there's much we (smart_open) can do here at the moment. There could be a bug in smart_open, one of its dependencies (that read GCS, S3, Azura, SFTP, etc) but it could also be in pikepdf, or the wrapper. It's a pretty deep rabbithole.
smart_open dumps the read buffer after every seek: https://github.com/RaRe-Technologies/smart_open/blob/f8e60da5a53e1e7ead9d8ca4d3f09cbea04fc337/smart_open/s3.py#L668
pikepdf (actually, its C++ library QPDF) seeks very often. When combined with smart_open, the entire file will be downloaded hundreds of times. I got a very simple test to pass with smart_open
- I expect the above test would take several minutes and ring up a decent sized bill from AWS.
IMHO this would need to be resolved in smart_open
. Many programs assume seeking is usually fast, especially small seeks that are already in the active read buffer. That is why performance was fine for fuse-mounted remotes. If I did a workaround in pikepdf, there are likely many other applications that would still be affected. I'm sure you'd have similar problems if someone tried to use smart_open with another library that relies heavily on seek, like sqlite.
The easiest win would be to retain the read buffer if a seek lands within it. Ideally, you'd maintain a read cache.
Thank you for pointing out the problem. Are you interested in making a PR?
I'm afraid not. Someone who knows this project's codebase (i.e. not me) should check for the same issue in other storage types and decide how to solve it project wide.
I think that my PR #748 should fix some of this if the seek()
calls are to the current position in the file.
There is still outstanding work to do if the seek()
is to a position contained in the read buffer.