tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Fix for LSTM Diplopia issue

Open woodjohndavid opened this issue 4 years ago • 27 comments

This is a proposed fix for the LSTM diplopia problem where 2 characters are included in the LSTM output stream for the same physical position in the original image. According to my review of trace output for a limited number of test cases, the issue occurs when there are 2 possible characters essentially 'competing' for the same spot, where one of those characters is a better match in the earlier timesteps but the second character (usually the better eventual match) becomes the better choice in later timesteps. In this scenario, it is possible that there will be a beam which includes the first character choice and then adds the second character choice in the same beam after the first, once the first choice score has been reduced and it no longer appears in the TopN list.

This solution is limited to solving diplopia for 2 characters, but could be expanded to deal with a multiple character scenario.

This solution is also dependent upon assigning a value to variable kMinDiplopiaKey which is the minimum score (key value) for an output entry coming from the matrix which would be considered a likely valid character. See the code for further details.

I have tested this only on a very small set of diplopia problems. I am assuming that you folks have a much more extensive set of test cases to run this proposed change on to ensure that it has no unexpected results.

woodjohndavid avatar Jun 29 '21 18:06 woodjohndavid

Stefan, I just committed the changes you suggested to branch JDWDIPLOPIA

Thanks,

Dave

woodjohndavid avatar Jun 29 '21 20:06 woodjohndavid

Stefan, I just committed the changes you suggested to branch JDWDIPLOPIA

I still don't see that latest changes. Did you push them to GitHub?

stweil avatar Jun 30 '21 05:06 stweil

Hi Stefan,

OK, apologies, newbie inexperience at this end. I have no previous Git experience. I am using GitHub Desktop for source control. I had committed the changes at my end, but had not done the 'push to origin'. Please check again now and hopefully you will find the changes there.

Sorry about that.

Dave

woodjohndavid avatar Jun 30 '21 19:06 woodjohndavid

I'm sorry for making this kind of comment, but what's the status on this?

MinmoTech avatar Jan 22 '22 11:01 MinmoTech

what's the status on this

Meanwhile there are some merge conflicts which must be resolved.

But the most important thing is that we need to test the changes that they improve the diplopia issue and that they don't introduce a regression. So it would help if whoever has done OCR before and after those changes and can confirm that it improves the OCR result without making some recognition worse could report it here.

stweil avatar Jan 22 '22 13:01 stweil

It doesn't seem to resolve the issue for me:

Recognized text: けげんこうちょう…?知らない町だな。 (The first character is duplicated)

Original Image: screenshot(575)

After applying my optimizations: image

(I.e. tesseract only sees the second image)

To make sure I'm actually using the right build, here are my steps to compile:

  • checking out the repo
  • checking out the PR with gh pr checkout 3476
  • ./autogen.sh
  • ./configure
  • make

MinmoTech avatar Jan 23 '22 08:01 MinmoTech

Hello all:

Sorry, but I have no previous experience with open source development mechanisms, so not sure how to move things forward.

The changes I made in this pull request have been successfully tested by me, but admittedly only on a limited set of data and the fairly specific requirements I have, namely English text from computer screen contents with specific limited fonts being used. So indeed there is much more testing needed, but I have no mechanism for doing this.

For the test results reported by Juligreen, I would suggest that you investigate different values for the variable kMinDiplopiaKey. This value is critical to the identification of potential diplopia, and probably should be made into a configuration setting. But again, how to do that is beyond my level of experience.

OK I have fixed the merging issue I think and committed them to my branch. Hopefully I did it properly. @stweil please let me know.

woodjohndavid avatar Jan 30 '22 22:01 woodjohndavid

With a dataset of around 4000 images that produced 103 "diplopia" affected results, this branch reduced it down to only 38.

6A61736F6E206E61646572 avatar Feb 17 '22 04:02 6A61736F6E206E61646572

Thanks @ohk2kt3t4 that is good to know. Could you please attach a couple of the images you used where the diplopia was not successfully eliminated? I would like to take a look and see if I can figure out why.

woodjohndavid avatar Feb 17 '22 17:02 woodjohndavid

Unfortunately I cannot share the images, but I can try help you investigate if it is trivial enough.

6A61736F6E206E61646572 avatar Feb 18 '22 04:02 6A61736F6E206E61646572

@ohk2kt3t4

With a dataset of around 4000 images that produced 103 "diplopia" affected results, this branch reduced it down to only 38.

Can you describe how you classified diplopia? Automatically with some heuristic rules (which ones?) or manually?

Then others can setup regression tests with license-free images.

wollmers avatar Feb 18 '22 10:02 wollmers

All my images are of part numbers which have the same number of characters and follow the same character placement pattern (e.g. the first two characters are always digits): image.

When "diplopia" occurs, the resulting length of the OCR is always longer because tesseract has output two characters in the place of one (see a lot of "0O", "B8" etc pairs). So in that sense I am able to automatically detect them, but it might not be useful for other use-cases. Perhaps could try generating some images programmatically and see if can end up with some that will trigger diplopia.. then those could be shared and used for testing.

6A61736F6E206E61646572 avatar Feb 21 '22 01:02 6A61736F6E206E61646572

I am also seeing this problem in some cases, were we try to automatically OCR a lot of text, store it in spreadsheets, and then manually verify the results. I can pull in this pull request and re-run that OCR, and check if the diplopia problems are reduced or gone, if that's helpful. It'd be a narrow/specific set of mostly images like this:

image

Please let me know if that's helpful, and I'll try to do the evaluation. The images should be public, so I could share results where diplopia is (potentially) not eliminated.

MerlijnWajer avatar Feb 28 '22 13:02 MerlijnWajer

@MerlijnWajer it would be very helpful if you could test my diplopia fix as much as possible, and report your findings here. Also, if there are any diplopia examples which remain, and if you are able to share your images, please attach them here after your testing so I can take a further look.

I do know that the fix I posted was limited in scope, but it seemed to correct the diplopia cases we have generally encountered in our own use of Tesseract. Typically those cases involved situations where one particular character is fairly closely matched at the beginning of the given image segment, but then a second character turns out to be a better match as that image segment is further traversed. This can result in one or more beams which contain both characters, with the likelihood that such a beam will get a higher score and therefore be considered as the best match.

The fix as it currently stands does not address cases where there are more than 2 potential matching characters.

woodjohndavid avatar Mar 03 '22 17:03 woodjohndavid

@ohk2kt3t4 I understand that you are unable to share your images on this site, but perhaps you could just extract a few specific part numbers where diplopia is still occurring using my fix, and just attach partial images of those.

woodjohndavid avatar Mar 03 '22 17:03 woodjohndavid

As mentioned earlier on, I have no previous experience participating in an open source community. However, I am a very experienced developer and am interested in working on Tesseract issues, particularly those that are related to OCR of non-textual data, like part numbers, codes, etc. It is particularly in those cases that diplopia is a problem when it occurs.

I have asked earlier but need to ask again. How is it decided that things like the change that I have put forward in this pull request actually get included in the primary Tesseract release? What should I be doing to move this forward?

As indicated earlier in this thread, there have been some fairly large test runs done by @ohk2kt3t4 which seem to indicate that the fix in this pull request does fix a large percentage of diplopia cases (although not all) and does not seem to have negative side effects. So I am not sure why it hasn't moved forward, even though it is not a full solution to the diplopia issue.

In the meanwhile, I am continuing to work with the code, and trying to see if I can come up with a more universal solution.

woodjohndavid avatar May 09 '22 23:05 woodjohndavid

@woodjohndavid Hello, I myself have come here to see if this fixes my wrongly positioned boxes issues, but it doesn't, I wrote a lot here: https://github.com/tesseract-ocr/tesseract/issues/3477

I am interested in this:

As I see it, therefore the LSTM matrix processing using the NetworkIO interface needs to add to its return values (in addition to the possible character and the likelihood score) the starting pixel location of the possible match, and the horizontal size of the potential match image from the train data. Once that is done, the rest should be relatively straightforward.

Can you point me to the code? I have a hard time navigating it.

exander77 avatar May 09 '22 23:05 exander77

Yes, the code is hard to follow. And I agree with you that the ultimate solution should be that the LSTM engine returns the match coordinates in some form. However, it does not at this point, and I do not understand the LSTM engine operation sufficiently well to figure out how to get it to return those coordinates.

So what I have been working on is the code that runs after the LSTM engine does its thing, to see if there is a way to solve the diplopia problem at that stage. This pull request is my initial attempt on this, which is partially but not entirely successful. It does NOT make any attempt to correct the inaccurate box dimensions.

One thing you can try (which I did also) is to reduce the "timestep" size. That does improve the box dimensions with the code as it is otherwise, although when diplopia occurs they are still messed up. However, it seems that reducing the "timestep" would require full re-training of the LSTM model.

woodjohndavid avatar May 09 '22 23:05 woodjohndavid

However, it does not at this point, and I do not understand the LSTM engine operation sufficiently well to figure out how to get it to return those coordinates.

Yeah, that pretty much summarizes my experiences so far. There is a tonne of comments in the code, but not really any explanation of the operation.

exander77 avatar May 09 '22 23:05 exander77

@woodjohndavid I am not ever sure where are the character positions calculated.

exander77 avatar May 10 '22 00:05 exander77

@MerlijnWajer Thanks for guiding me to this PR! @woodjohndavid Hello, I tested the below image with your fixes in the PR with different kMinDiplopiaKey values (0.25,0.5,0.75), but the result (DOT 0O4N 6VHPPC) keeps unchanged with an extra ZERO in bold. TesseractInputImageSingle

Could you please take a look if it is possible to improve on this case?

liuyl07 avatar Jun 13 '22 12:06 liuyl07

Updates for my question above.

By printing the below log in the function RecodeBeamSearch::ExtractBestPathAsWords(), it is obvious that there are 2 possible characters (zero 0 and big O) essentially 'competing' for the same spot. Both of them eventually are shown in the final recognition result. image

However, it seems that various kMinDiplopiaKey values don't help for this particular case...

Another interesting clue which may help us to resolve the diplopia issue to some extent is that, before recoding beam, the number of extracted blobs are exactly 12, after recoding beam, we get 13 letters in the recognition result with one dummy zero as shown in my last comment.

liuyl07 avatar Jun 27 '22 10:06 liuyl07

@woodjohndavid Sorry to trouble you. From all your comments in the pull request, I think you should be definitely interested in some cases where your fixes can be further improved. Could you take a look at my example which meets the conditions of your fixes, i.e., diplopia for 2 characters (0 and O).

liuyl07 avatar Jul 16 '22 09:07 liuyl07

Hello @liuyl07

Sorry, have been busy with other things and don't get back to Tesseract very often. Anyway, it is not surprising that there are some diplopia cases that don't work with this fix, and some that do. If you notice posts by @ohk2kt3t4 earlier in this thread, he has found with a fairly big sample that this fix seems to work for about 70% of the diplopia examples, so it is an improvement, but not a cure.

However, I have been looking at this in more detail and have some additional ideas that might help further. So while I can't promise any particular timing, I will let you know if/when I have another attempt. I will use your sample as one test case.

woodjohndavid avatar Jul 18 '22 21:07 woodjohndavid

Another update @liuyl07

I did run your sample on my system, and did NOT have the same result as you did. I did not get the diplopia '0O' I just got the 'O'

woodjohndavid avatar Jul 18 '22 22:07 woodjohndavid

Is this code merged? I am getting a lot of diplopia in my projects lately. I was wondering if it can reduce it.

DesBw avatar Sep 08 '23 14:09 DesBw