GoBooDo
GoBooDo copied to clipboard
[Bug] "Image not available" pages are downloaded and detected as valid pages
Some books have "Image not available" pages. These are detected as valid pages and are downloaded by the script, where it would be preferable for it to detect such pages as "unavailable" and retry on a different proxy, to try and get content, rather than blanks.
Example:
-
Attempt to download Google Book ID
gPIpQg0lRbMC
-
The script will download a few hundred pages, most of which will be:
GoBooDo uses Tesseract for this purpose, can you please check if itscorrectly configured and working.
I downloaded Tesseract v5 alpha from here, and it's at the path specified in settings.json. I run it and it doesn't complain. I'm wondering if maybe they're left over from earlier unsuccessful attempts before I realised I needed tesseract, not just pytesseract.
EDIT: I tested by wiping and trying again. Much fewer Image Not Found pages, but still a few have appeared.
EDIT2: Once it got all the images it could via my regular IP, I connected via Wireguard to some random config, to get a new IP address, and it starts pulling loads of Image Not Found
pages.
Having to install Tesseract seems overkill for this purpose. Is there a way to get the bytecode of the standard "image not available" and compare it to the downloaded figure? It seems to be the approach taken in this script.
I am also having this issue.
I think Tesseract seems to deliver a spurious character on the end of the relevant 'image not available'.
I adjusted the code in storeImages.pageEmpty
to search for 'image not available' rather than test for equality. That seems to help, but I'm still only early testing.
Any news regarding this issue i'm trying to download book 63U8axvG8V0C
and I'm getting a lot of pages with image not avaliable
same issue :/
Change the pageEmpty function code on storeImages.py as follows:
def pageEmpty(self, image):
im = Image.open(BytesIO(image))
width, height = im.size
im = im.resize((int(width / 5), int(height / 5)))
gray = im.convert('L')
bw = gray.point(lambda x: 0 if x < 250 else 255, '1')
try:
text = pytesseract.image_to_string(bw)
except:
pytesseract.pytesseract.tesseract_cmd = self.tesserPath
text = pytesseract.image_to_string(bw)
if text.find("image") == -1:
return False
else:
return True