GoBooDo icon indicating copy to clipboard operation
GoBooDo copied to clipboard

"image not available" was not filtered out

Open augustaklug opened this issue 3 years ago • 6 comments

The "image not available" images were not being filtered out, so I looked at the code and changed it as follows (on storeImages.py):

    def pageEmpty(self, image):
        im = Image.open(BytesIO(image))
        width, height = im.size
        im = im.resize((int(width / 5), int(height / 5)))
        gray = im.convert('L')
        bw = gray.point(lambda x: 0 if x < 250 else 255, '1')
        try:
            text = pytesseract.image_to_string(bw)
        except:
            pytesseract.pytesseract.tesseract_cmd = self.tesserPath
            text = pytesseract.image_to_string(bw)
        if text.find("image") == -1:
            return False
        else:
            return True

Seems to be working as expected now.

augustaklug avatar Mar 03 '21 15:03 augustaklug

unindent does not match any outer indentation level

liukliukliuk avatar Mar 03 '21 15:03 liukliukliuk

thanks,it works. But a book of about 300 pages, up to about 160 pages can no longer be obtained successfully, neither the page link nor the page can be obtained, and changing the ip and request header does not help. So now I wonder if there is a problem with the page link dictionary.When I delete the page-link dictionary (only pagesFetched.pkl is kept), it shows "Please delete the corresponding folder and start again or the book is not available for preview. Received invalid response" I think it is possible to re-optimize this aspect and re-fetch for links that fail too much

MODOUser avatar Mar 17 '21 05:03 MODOUser

The "image not available" images were not being filtered out, so I looked at the code and changed it as follows (on storeImages.py):

    def pageEmpty(self, image):
        im = Image.open(BytesIO(image))
        width, height = im.size
        im = im.resize((int(width / 5), int(height / 5)))
        gray = im.convert('L')
        bw = gray.point(lambda x: 0 if x < 250 else 255, '1')
        try:
            text = pytesseract.image_to_string(bw)
        except:
            pytesseract.pytesseract.tesseract_cmd = self.tesserPath
            text = pytesseract.image_to_string(bw)
        if text.find("image") == -1:
            return False
        else:
            return True

Seems to be working as expected now.

First thank you for the quick fix it works great! Second I think someone needs to close this issue, I don't think the devs are paying attention to this anymore.... too bad it's a great program with lots of potential!

Yolakalemowa avatar Apr 15 '21 21:04 Yolakalemowa

~~In my environment, following code also work (I found this issue after I wrote this code):~~

    def pageEmpty(self,image):
        im = Image.open(BytesIO(image))
        width, height = im.size
        im = im.resize((int(width / 5), int(height / 5)))
        gray = im.convert('L')
        bw = gray.point(lambda x: 0 if x < 250 else 255, '1')
        try:
            text = pytesseract.image_to_string(bw)
        except:
            pytesseract.pytesseract.tesseract_cmd = self.tesserPath
            text = pytesseract.image_to_string(bw)
        return text == 'image\nnot\navailable\n\x0c'

~~Only the last line is changed. This code seems strict than if text.find("image") == -1:, and I don’t know if this code works in other environments or not. Please report work or not work and environment.~~ Run following script and check:

#!/usr/bin/env python3

import pytesseract
from PIL import Image
from io import BytesIO
import sys

path = sys.argv[1] # or path to image printed “image not available”

im = Image.open(path)
width, height = im.size
im = im.resize((int(width / 5), int(height / 5)))
gray = im.convert('L')
bw = gray.point(lambda x: 0 if x < 250 else 255, '1')
try:
    text = pytesseract.image_to_string(bw)
except:
    pytesseract.pytesseract.tesseract_cmd = self.tesserPath
    text = pytesseract.image_to_string(bw)

print(text.encode('unicode_escape'))
print(text == 'image\nnot\navailable\n\x0c')

By the way, should someone send a pull request for update the repository?

minamotorin avatar Jul 29 '21 14:07 minamotorin

text == 'image\nnot\navailable\n\x0c' is sometimes doesn't work fine (text: image.\\nnot \\u2014\\navailable\\n\\x0c). text.find("image") == -1 is good!

minamotorin avatar Sep 04 '21 19:09 minamotorin

I ran into the same issue. To track it down, I had made a script very similar to the one recommended in https://github.com/vaibhavk97/GoBooDo/issues/41#issuecomment-889194688

When testing it on my png "image not available" files, it showed that the OCR text had a trailing space (after replacing newlines with spaces). So, I changed the test to return text.replace('\n', " ").strip() == 'image not available'

sanjoymahajan avatar Feb 07 '24 09:02 sanjoymahajan