pdfquery icon indicating copy to clipboard operation
pdfquery copied to clipboard

cache collision

Open patxoca opened this issue 5 years ago • 1 comments

Scrapping two different PDFs yields the exact same results when using the FileCache.

The problem is that set_hash_key() always computes the same key because the file is already seek at the end (md5("") == "d41d8cd98f00b204e9800998ecf8427e") and pdfquery ends up using the same cached data for both PDFs.

Adding file.seek(0) before computing the md5 seems to solve the issue.

patxoca avatar Feb 20 '20 10:02 patxoca

Temporary workaround until the issue is fixed, define a custom cache class:

from pdfquery.cache import FileCache as _FileCache

class FileCache(_FileCache):

    def set_hash_key(self, file):
        file.seek(0)
        return super().set_hash_key(file)

patxoca avatar Mar 11 '20 13:03 patxoca