GoBooDo
GoBooDo copied to clipboard
"image not available" was not filtered out
The "image not available" images were not being filtered out, so I looked at the code and changed it as follows (on storeImages.py):
def pageEmpty(self, image):
im = Image.open(BytesIO(image))
width, height = im.size
im = im.resize((int(width / 5), int(height / 5)))
gray = im.convert('L')
bw = gray.point(lambda x: 0 if x < 250 else 255, '1')
try:
text = pytesseract.image_to_string(bw)
except:
pytesseract.pytesseract.tesseract_cmd = self.tesserPath
text = pytesseract.image_to_string(bw)
if text.find("image") == -1:
return False
else:
return True
Seems to be working as expected now.
unindent does not match any outer indentation level
thanks,it works. But a book of about 300 pages, up to about 160 pages can no longer be obtained successfully, neither the page link nor the page can be obtained, and changing the ip and request header does not help. So now I wonder if there is a problem with the page link dictionary.When I delete the page-link dictionary (only pagesFetched.pkl is kept), it shows "Please delete the corresponding folder and start again or the book is not available for preview. Received invalid response" I think it is possible to re-optimize this aspect and re-fetch for links that fail too much
The "image not available" images were not being filtered out, so I looked at the code and changed it as follows (on storeImages.py):
def pageEmpty(self, image): im = Image.open(BytesIO(image)) width, height = im.size im = im.resize((int(width / 5), int(height / 5))) gray = im.convert('L') bw = gray.point(lambda x: 0 if x < 250 else 255, '1') try: text = pytesseract.image_to_string(bw) except: pytesseract.pytesseract.tesseract_cmd = self.tesserPath text = pytesseract.image_to_string(bw) if text.find("image") == -1: return False else: return True
Seems to be working as expected now.
First thank you for the quick fix it works great! Second I think someone needs to close this issue, I don't think the devs are paying attention to this anymore.... too bad it's a great program with lots of potential!
~~In my environment, following code also work (I found this issue after I wrote this code):~~
def pageEmpty(self,image):
im = Image.open(BytesIO(image))
width, height = im.size
im = im.resize((int(width / 5), int(height / 5)))
gray = im.convert('L')
bw = gray.point(lambda x: 0 if x < 250 else 255, '1')
try:
text = pytesseract.image_to_string(bw)
except:
pytesseract.pytesseract.tesseract_cmd = self.tesserPath
text = pytesseract.image_to_string(bw)
return text == 'image\nnot\navailable\n\x0c'
~~Only the last line is changed. This code seems strict than if text.find("image") == -1:
, and I don’t know if this code works in other environments or not. Please report work or not work and environment.~~ Run following script and check:
#!/usr/bin/env python3
import pytesseract
from PIL import Image
from io import BytesIO
import sys
path = sys.argv[1] # or path to image printed “image not available”
im = Image.open(path)
width, height = im.size
im = im.resize((int(width / 5), int(height / 5)))
gray = im.convert('L')
bw = gray.point(lambda x: 0 if x < 250 else 255, '1')
try:
text = pytesseract.image_to_string(bw)
except:
pytesseract.pytesseract.tesseract_cmd = self.tesserPath
text = pytesseract.image_to_string(bw)
print(text.encode('unicode_escape'))
print(text == 'image\nnot\navailable\n\x0c')
By the way, should someone send a pull request for update the repository?
text == 'image\nnot\navailable\n\x0c'
is sometimes doesn't work fine (text: image.\\nnot \\u2014\\navailable\\n\\x0c
). text.find("image") == -1
is good!
I ran into the same issue. To track it down, I had made a script very similar to the one recommended in https://github.com/vaibhavk97/GoBooDo/issues/41#issuecomment-889194688
When testing it on my png "image not available" files, it showed that the OCR text had a trailing space (after replacing newlines with spaces). So, I changed the test to
return text.replace('\n', " ").strip() == 'image not available'