tesserocr icon indicating copy to clipboard operation
tesserocr copied to clipboard

How to iterate inside an iterator level

Open shreyansh26 opened this issue 5 years ago • 5 comments

I have some problems with the following code snippet. I am extracting text from the PARA level, but then since I want to identify superscripts as well, I am going into word level for that PARA. However the loop breaks after the first PARA itself. How to go about it?

api.SetImage(resized_patch)
api.Recognize()
# The level on which extraction is to be done. Choices - BLOCK, TEXTLINE, WORD, SYMBOL
level = RIL.PARA
# Get the iterator
iterator = api.GetIterator()
word_attr = []
for r in iterate_level(iterator, level):
	try:
		# Get text and font attributes of the current patch
		text = r.GetUTF8Text(level)
		font_attr = r.WordFontAttributes()

		for s in iterate_level(r, RIL.SYMBOL):
			print("SYMBOL: ", s.GetUTF8Text(RIL.SYMBOL))
			print("IsSuperscript: ", s.SymbolIsSuperscript())
		
		# print(font_attr)
	except:
		text = None
		font_attr = None

shreyansh26 avatar Jul 18 '18 06:07 shreyansh26

Take a look at PageIterator's IsAtFinalElement API method. You can try iterating manually while checking this condition, something like:

while not r.IsAtFinalElement(RIL.PARA, RIL.SYMBOL):
    r.Next(RIL.SYMBOL)

Just an idea. You may also find IsAtBeginningOf useful.

sirfz avatar Jul 18 '18 19:07 sirfz

Thanks for the response. But this just tells me if there is an element (symbol) remaining in the para. How do I access the symbol, like print it or check for superscript etc.?

shreyansh26 avatar Jul 19 '18 04:07 shreyansh26

When you call r.Next(RIL.SYMBOL) you're basically moving to the next symbol, at which point you can call any other API method on the operator, such as r.SymbolIsSuperscript() and others (check the code to view available methods).

sirfz avatar Jul 20 '18 13:07 sirfz

The issue is somewhat related, but is it possible to copy PyResultInterator instance (ResultIterator class is copy constructable in the original)? I check for words' confidence (word level) and then want to go to character level in separate iterator in case confidence is not sufficient. Or probably there is an efficient alternative?

prizmatic avatar Sep 07 '18 10:09 prizmatic

I check for words' confidence (word level) and then want to go to character level in separate iterator in case confidence is not sufficient.

You don't have to. You can use the same result iterator at the lower level, too. It will advance the RIL.WORD level when you have called Next(RIL.SYMOL) on all its constituent characters. That works across all levels (except ChoiceIterator), even multiple at once.

bertsky avatar Dec 18 '19 02:12 bertsky