Correct handling of TM sign
Copied from 59
[reply to @Shreeshrii]
@theraysmith commented
TM is also difficult, as it is in conflict with the needs of fi/fl, which should not appear in the output.
Copied from: issue https://github.com/tesseract-ocr/tesseract/issues/761
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/JvEF7f0KU8I/La50m7SzEgAJ
the trademark symbol is still not recognized properly. With the newly generated traineddata the symbol is recognized as TM
Looking at Latin.unicharset I see that the normalized form of these is just the regular numbers or TM.
@theraysmith Does this need to be changed?
™ 0 63,201,209,255,101,273,0,59,104,293 Common 1496 10 1496 TM # ™ [2122 ] ² 0 3,192,209,255,50,248,0,105,0,293 Common 1090 2 1090 2 # ² [b2 ] ³ 0 0,192,209,255,48,268,0,99,0,293 Common 1091 2 1091 3 # ³ [b3 ]
After thinkiing about this carefully, I decided to undo a change I had made for the LSTM engine, and better solve tatweel.
The fi/fl ligatures will no longer be included in unicharsets, but will still be included in the training text, by replacing them with fi/fl pairs at the same time that tatweel is deleted. This allows the output to be un-normalized, shaped quotes to be brought back, and the TM symbol recognized as a single character. It doesn't help with the sub/superscript problem, and I have another idea that I want to try that is more important first...
On Fri, Mar 31, 2017 at 2:10 AM, Shreeshrii [email protected] wrote:
Copied from: issue tesseract-ocr/tesseract#761 https://github.com/tesseract-ocr/tesseract/issues/761
https://groups.google.com/forum/?utm_medium=email&utm_ source=footer#!msg/tesseract-ocr/JvEF7f0KU8I/La50m7SzEgAJ
the trademark symbol is still not recognized properly. With the newly generated traineddata the symbol is recognized as TM
Looking at Latin.unicharset I see that the normalized form of these is just the regular numbers or TM.
@theraysmith https://github.com/theraysmith Does this need to be changed?
™ 0 63,201,209,255,101,273,0,59,104,293 Common 1496 10 1496 TM # ™ [2122 ] ² 0 3,192,209,255,50,248,0,105,0,293 Common 1090 2 1090 2 # ² [b2 ] ³ 0 0,192,209,255,48,268,0,99,0,293 Common 1091 2 1091 3 # ³ [b3 ]
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/63#issuecomment-290660919, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056S7lNKS0yOgG9kjZBsmxKT7ziM9Dks5rrMMRgaJpZM4MvZ7k .
-- Ray.
https://github.com/tesseract-ocr/langdata/issues/59#issuecomment-294256255
theraysmith commented
Thanks for opening the new issues 62, 63. I will continue to think about the best approach. I tried to include TM in the current round of training, but it is too infrequent to have made the cut line. I will have to add it to the desired_characters list.
https://github.com/tesseract-ocr/tesseract/commit/b0ead95d64a366
@theraysmith
tesseract-ocr/tesseract@b0ead95 does not seem to solve this. Does it also require your newer language models?
I did replace a layer training with fonts FreeSerif and FreeSans till 0.01% error rate. However, it seems to still recognize TM trademark sign as letters TM and not the sign, while testing with same tif which was used for training.
zip file with training text, synthetic training images, generated traineddata and OCR output with --oem1 is attached.
tesseract -v
tesseract b0ead95
leptonica-1.74.4
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.0 : libopenjp2 2.1.2
tesseract eng.FreeSerif.exp0.tif eng.FreeSerif.englayer -l englayer --oem 1 --psm 6 --tessdata-dir ../../tessdata
I notice that the unicharset still has TM as normalized version instead of sign. Does latin.unicharset need updating?
™ 0 63,201,209,255,101,273,0,59,104,293 Common 112 10 112 TM # ™ [2122 ]
· 10 64,148,129,255,13,238,5,125,39,293 Common 113 10 113 · # · [b7 ]p
℠ 0 130,152,235,249,167,228,3,30,192,234 Common 114 10 114 SM # ℠ [2120 ]
℗ 0 13,65,229,255,165,244,0,30,169,273 Common 115 10 115 ℗ # ℗ [2117 ]
No there are still one or two commits to go before that will work. I might get them in today.
On Tue, Jul 25, 2017 at 5:14 AM, Shreeshrii [email protected] wrote:
I notice that the unicharset still has TM as normalized version instead of sign. Does latin.unicharset need updating?
™ 0 63,201,209,255,101,273,0,59,104,293 Common 112 10 112 TM # ™ [2122 ] · 10 64,148,129,255,13,238,5,125,39,293 Common 113 10 113 · # · [b7 ]p ℠ 0 130,152,235,249,167,228,3,30,192,234 Common 114 10 114 SM # ℠ [2120 ] ℗ 0 13,65,229,255,165,244,0,30,169,273 Common 115 10 115 ℗ # ℗ [2117 ]
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/63#issuecomment-317718938, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056XZjuyi-GE1ZoAK1_71EoZThUYcRks5sRdwYgaJpZM4MvZ7k .
-- Ray.
Right try it now. You need commits b0ead95d..0e95e2ca and 1a0f501..3e32be3 (in langdata) I think they are everything you need. The new English model will contain TM.
On Tue, Jul 25, 2017 at 8:29 AM, Ray Smith [email protected] wrote:
No there are still one or two commits to go before that will work. I might get them in today.
On Tue, Jul 25, 2017 at 5:14 AM, Shreeshrii [email protected] wrote:
I notice that the unicharset still has TM as normalized version instead of sign. Does latin.unicharset need updating?
™ 0 63,201,209,255,101,273,0,59,104,293 Common 112 10 112 TM # ™ [2122 ] · 10 64,148,129,255,13,238,5,125,39,293 Common 113 10 113 · # · [b7 ]p ℠ 0 130,152,235,249,167,228,3,30,192,234 Common 114 10 114 SM # ℠ [2120 ] ℗ 0 13,65,229,255,165,244,0,30,169,273 Common 115 10 115 ℗ # ℗ [2117 ]
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/63#issuecomment-317718938, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056XZjuyi-GE1ZoAK1_71EoZThUYcRks5sRdwYgaJpZM4MvZ7k .
-- Ray.
-- Ray.
Ray, I updated langdata and tesseract and built tesseract again.
With the new traineddata, TM is not being recognized at all - it is getting dropped.
with eng.traineddata
The trademark symbol (*), in Unicode U+2122 *~ trade mark sign (HTML ™ — ™),
\texttrademark in LaTeX,[1] [2] is a symbol used to indicate an assertion that the preceding mark
is a trademark. Registered trademarks are indicated using the registered trademark symbol (®),
with new englayer.traineddata
The trademark symbol (), in Unicode U+2122 " trade mark sign (HTML ™ · ™),
\texttrademark in LaTeX,[1] [2] is a symbol used to indicate an assertion that the preceding mark
is a trademark. Registered trademarks are indicated using the registered trademark symbol (®),
I used the old .lstmf files to do training - would that be a problem?
@theraysmith
I trained again after creating new box/tiff and lstmf files using the new code and new langdata.
TM sign is now being recognized correctly.
It is also NOT treating fl and fi as ligatures but as separate letters in words such as film, first, flounder, reflect etc.
Thanks!
Great! That is the objective with fi and fl ligatures. They now have similar status as tatweel: used for rendering, but not for output, except of course that fi and fl produce output characters, but tatweel disappears completely.
On Thu, Jul 27, 2017 at 8:31 PM, Shreeshrii [email protected] wrote:
@theraysmith https://github.com/theraysmith
I trained again after creating new box/tiff and lstmf files using the new code and new langdata.
TM sign is now being recognized correctly.
It is also NOT treating fl and fi as glyphs but as separate letters in words such as film, first, flounder, reflect etc.
Thanks!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/63#issuecomment-318548187, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056c1Ib60ZnotAwdlQhSlJ2uw0gtnbks5sSVYcgaJpZM4MvZ7k .
-- Ray.