pdfquery
pdfquery copied to clipboard
cache collision
Scrapping two different PDFs yields the exact same results when using the FileCache
.
The problem is that set_hash_key()
always computes the same key because the file is already seek at the end (md5("") == "d41d8cd98f00b204e9800998ecf8427e"
) and pdfquery ends up using the same cached data for both PDFs.
Adding file.seek(0)
before computing the md5 seems to solve the issue.
Temporary workaround until the issue is fixed, define a custom cache class:
from pdfquery.cache import FileCache as _FileCache
class FileCache(_FileCache):
def set_hash_key(self, file):
file.seek(0)
return super().set_hash_key(file)