GoBooDo icon indicating copy to clipboard operation
GoBooDo copied to clipboard

[Bug] "Image not available" pages are downloaded and detected as valid pages

Open Darthagnon opened this issue 4 years ago • 7 comments

Some books have "Image not available" pages. These are detected as valid pages and are downloaded by the script, where it would be preferable for it to detect such pages as "unavailable" and retry on a different proxy, to try and get content, rather than blanks.

Example:

  • Attempt to download Google Book ID gPIpQg0lRbMC

  • The script will download a few hundred pages, most of which will be:

Image not available page

Darthagnon avatar Sep 22 '20 15:09 Darthagnon

GoBooDo uses Tesseract for this purpose, can you please check if itscorrectly configured and working.

vaibhavk97 avatar Sep 22 '20 17:09 vaibhavk97

I downloaded Tesseract v5 alpha from here, and it's at the path specified in settings.json. I run it and it doesn't complain. I'm wondering if maybe they're left over from earlier unsuccessful attempts before I realised I needed tesseract, not just pytesseract.

EDIT: I tested by wiping and trying again. Much fewer Image Not Found pages, but still a few have appeared.

EDIT2: Once it got all the images it could via my regular IP, I connected via Wireguard to some random config, to get a new IP address, and it starts pulling loads of Image Not Found pages.

Darthagnon avatar Sep 22 '20 20:09 Darthagnon

Having to install Tesseract seems overkill for this purpose. Is there a way to get the bytecode of the standard "image not available" and compare it to the downloaded figure? It seems to be the approach taken in this script.

PuffingColly avatar Sep 29 '20 13:09 PuffingColly

I am also having this issue.

I think Tesseract seems to deliver a spurious character on the end of the relevant 'image not available'.

I adjusted the code in storeImages.pageEmpty to search for 'image not available' rather than test for equality. That seems to help, but I'm still only early testing.

simon-20 avatar Dec 11 '20 17:12 simon-20

Any news regarding this issue i'm trying to download book 63U8axvG8V0C and I'm getting a lot of pages with image not avaliable

Jorzef avatar Feb 09 '21 15:02 Jorzef

same issue :/

liukliukliuk avatar Feb 22 '21 16:02 liukliukliuk

Change the pageEmpty function code on storeImages.py as follows:

    def pageEmpty(self, image):
        im = Image.open(BytesIO(image))
        width, height = im.size
        im = im.resize((int(width / 5), int(height / 5)))
        gray = im.convert('L')
        bw = gray.point(lambda x: 0 if x < 250 else 255, '1')
        try:
            text = pytesseract.image_to_string(bw)
        except:
            pytesseract.pytesseract.tesseract_cmd = self.tesserPath
            text = pytesseract.image_to_string(bw)
        if text.find("image") == -1:
            return False
        else:
            return True

augustaklug avatar Mar 03 '21 15:03 augustaklug