langdata icon indicating copy to clipboard operation
langdata copied to clipboard

Correct handling of TM sign

Open amitdo opened this issue 9 years ago • 11 comments

Copied from 59


[reply to @Shreeshrii]

@theraysmith commented

TM is also difficult, as it is in conflict with the needs of fi/fl, which should not appear in the output.

amitdo avatar Mar 31 '17 09:03 amitdo

Copied from: issue https://github.com/tesseract-ocr/tesseract/issues/761

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/JvEF7f0KU8I/La50m7SzEgAJ

the trademark symbol is still not recognized properly. With the newly generated traineddata the symbol is recognized as TM

Looking at Latin.unicharset I see that the normalized form of these is just the regular numbers or TM.

@theraysmith Does this need to be changed?

™ 0 63,201,209,255,101,273,0,59,104,293 Common 1496 10 1496 TM # ™ [2122 ] ² 0 3,192,209,255,50,248,0,105,0,293 Common 1090 2 1090 2 # ² [b2 ] ³ 0 0,192,209,255,48,268,0,99,0,293 Common 1091 2 1091 3 # ³ [b3 ]

Shreeshrii avatar Mar 31 '17 09:03 Shreeshrii

After thinkiing about this carefully, I decided to undo a change I had made for the LSTM engine, and better solve tatweel.

The fi/fl ligatures will no longer be included in unicharsets, but will still be included in the training text, by replacing them with fi/fl pairs at the same time that tatweel is deleted. This allows the output to be un-normalized, shaped quotes to be brought back, and the TM symbol recognized as a single character. It doesn't help with the sub/superscript problem, and I have another idea that I want to try that is more important first...

On Fri, Mar 31, 2017 at 2:10 AM, Shreeshrii [email protected] wrote:

Copied from: issue tesseract-ocr/tesseract#761 https://github.com/tesseract-ocr/tesseract/issues/761

https://groups.google.com/forum/?utm_medium=email&utm_ source=footer#!msg/tesseract-ocr/JvEF7f0KU8I/La50m7SzEgAJ

the trademark symbol is still not recognized properly. With the newly generated traineddata the symbol is recognized as TM

Looking at Latin.unicharset I see that the normalized form of these is just the regular numbers or TM.

@theraysmith https://github.com/theraysmith Does this need to be changed?

™ 0 63,201,209,255,101,273,0,59,104,293 Common 1496 10 1496 TM # ™ [2122 ] ² 0 3,192,209,255,50,248,0,105,0,293 Common 1090 2 1090 2 # ² [b2 ] ³ 0 0,192,209,255,48,268,0,99,0,293 Common 1091 2 1091 3 # ³ [b3 ]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/63#issuecomment-290660919, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056S7lNKS0yOgG9kjZBsmxKT7ziM9Dks5rrMMRgaJpZM4MvZ7k .

-- Ray.

theraysmith avatar Mar 31 '17 22:03 theraysmith

https://github.com/tesseract-ocr/langdata/issues/59#issuecomment-294256255

theraysmith commented

Thanks for opening the new issues 62, 63. I will continue to think about the best approach. I tried to include TM in the current round of training, but it is too infrequent to have made the cut line. I will have to add it to the desired_characters list.

amitdo avatar Apr 15 '17 10:04 amitdo

https://github.com/tesseract-ocr/tesseract/commit/b0ead95d64a366

amitdo avatar Jul 24 '17 18:07 amitdo

@theraysmith

tesseract-ocr/tesseract@b0ead95 does not seem to solve this. Does it also require your newer language models?

I did replace a layer training with fonts FreeSerif and FreeSans till 0.01% error rate. However, it seems to still recognize TM trademark sign as letters TM and not the sign, while testing with same tif which was used for training.

zip file with training text, synthetic training images, generated traineddata and OCR output with --oem1 is attached.

eng.englayer.zip

tesseract -v
tesseract b0ead95
 leptonica-1.74.4
  libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.0 : libopenjp2 2.1.2

tesseract eng.FreeSerif.exp0.tif eng.FreeSerif.englayer -l englayer --oem 1 --psm 6 --tessdata-dir ../../tessdata

Shreeshrii avatar Jul 25 '17 12:07 Shreeshrii

I notice that the unicharset still has TM as normalized version instead of sign. Does latin.unicharset need updating?

™ 0 63,201,209,255,101,273,0,59,104,293 Common 112 10 112 TM	# ™ [2122 ]
· 10 64,148,129,255,13,238,5,125,39,293 Common 113 10 113 ·	# · [b7 ]p
℠ 0 130,152,235,249,167,228,3,30,192,234 Common 114 10 114 SM	# ℠ [2120 ]
℗ 0 13,65,229,255,165,244,0,30,169,273 Common 115 10 115 ℗	# ℗ [2117 ]

Shreeshrii avatar Jul 25 '17 12:07 Shreeshrii

No there are still one or two commits to go before that will work. I might get them in today.

On Tue, Jul 25, 2017 at 5:14 AM, Shreeshrii [email protected] wrote:

I notice that the unicharset still has TM as normalized version instead of sign. Does latin.unicharset need updating?

™ 0 63,201,209,255,101,273,0,59,104,293 Common 112 10 112 TM # ™ [2122 ] · 10 64,148,129,255,13,238,5,125,39,293 Common 113 10 113 · # · [b7 ]p ℠ 0 130,152,235,249,167,228,3,30,192,234 Common 114 10 114 SM # ℠ [2120 ] ℗ 0 13,65,229,255,165,244,0,30,169,273 Common 115 10 115 ℗ # ℗ [2117 ]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/63#issuecomment-317718938, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056XZjuyi-GE1ZoAK1_71EoZThUYcRks5sRdwYgaJpZM4MvZ7k .

-- Ray.

theraysmith avatar Jul 25 '17 15:07 theraysmith

Right try it now. You need commits b0ead95d..0e95e2ca and 1a0f501..3e32be3 (in langdata) I think they are everything you need. The new English model will contain TM.

On Tue, Jul 25, 2017 at 8:29 AM, Ray Smith [email protected] wrote:

No there are still one or two commits to go before that will work. I might get them in today.

On Tue, Jul 25, 2017 at 5:14 AM, Shreeshrii [email protected] wrote:

I notice that the unicharset still has TM as normalized version instead of sign. Does latin.unicharset need updating?

™ 0 63,201,209,255,101,273,0,59,104,293 Common 112 10 112 TM # ™ [2122 ] · 10 64,148,129,255,13,238,5,125,39,293 Common 113 10 113 · # · [b7 ]p ℠ 0 130,152,235,249,167,228,3,30,192,234 Common 114 10 114 SM # ℠ [2120 ] ℗ 0 13,65,229,255,165,244,0,30,169,273 Common 115 10 115 ℗ # ℗ [2117 ]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/63#issuecomment-317718938, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056XZjuyi-GE1ZoAK1_71EoZThUYcRks5sRdwYgaJpZM4MvZ7k .

-- Ray.

-- Ray.

theraysmith avatar Jul 25 '17 16:07 theraysmith

Ray, I updated langdata and tesseract and built tesseract again.

With the new traineddata, TM is not being recognized at all - it is getting dropped.

with eng.traineddata

The trademark symbol (*), in Unicode U+2122 *~ trade mark sign (HTML ™ — ™),
\texttrademark in LaTeX,[1] [2] is a symbol used to indicate an assertion that the preceding mark

is a trademark. Registered trademarks are indicated using the registered trademark symbol (®),

with new englayer.traineddata

The trademark symbol (), in Unicode U+2122 " trade mark sign (HTML ™ · ™),
\texttrademark in LaTeX,[1] [2] is a symbol used to indicate an assertion that the preceding mark

is a trademark. Registered trademarks are indicated using the registered trademark symbol (®),

I used the old .lstmf files to do training - would that be a problem?

Shreeshrii avatar Jul 26 '17 03:07 Shreeshrii

@theraysmith

I trained again after creating new box/tiff and lstmf files using the new code and new langdata.

TM sign is now being recognized correctly.

It is also NOT treating fl and fi as ligatures but as separate letters in words such as film, first, flounder, reflect etc.

Thanks!

eng.FreeSerif.engTM.txt

Shreeshrii avatar Jul 28 '17 03:07 Shreeshrii

Great! That is the objective with fi and fl ligatures. They now have similar status as tatweel: used for rendering, but not for output, except of course that fi and fl produce output characters, but tatweel disappears completely.

On Thu, Jul 27, 2017 at 8:31 PM, Shreeshrii [email protected] wrote:

@theraysmith https://github.com/theraysmith

I trained again after creating new box/tiff and lstmf files using the new code and new langdata.

TM sign is now being recognized correctly.

It is also NOT treating fl and fi as glyphs but as separate letters in words such as film, first, flounder, reflect etc.

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/63#issuecomment-318548187, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056c1Ib60ZnotAwdlQhSlJ2uw0gtnbks5sSVYcgaJpZM4MvZ7k .

-- Ray.

theraysmith avatar Jul 28 '17 04:07 theraysmith