grobid-quantities icon indicating copy to clipboard operation
grobid-quantities copied to clipboard

Value expressed with alphabetic characters

Open kermitt2 opened this issue 8 years ago • 4 comments

For example "twenty kilos" - currently the recognition is very bad and there is no normalization into numerical values.

We should:

  • add a matching feature for this very limited vocabulary in the quantity model (it will generalize the examples in the training data),
  • add a dedicated normalization.

kermitt2 avatar Mar 22 '16 20:03 kermitt2

Basic words to number normalization - only English - with commit 18379a25a49b17bf61e4d9bf0cdcd3504ad90cb3

kermitt2 avatar Apr 07 '16 03:04 kermitt2

Add a number word matching feature in the quantity model with commit 5e0b1b72be33e0a25bdff769ec1d5f1f182e6cdc -> the detection works much better now

screenshot from 2016-04-10 18 01 07

kermitt2 avatar Apr 10 '16 16:04 kermitt2

FYI http://stackoverflow.com/questions/3911966/how-to-convert-number-to-words-in-java

lfoppiano avatar Apr 14 '16 12:04 lfoppiano

We're doing the opposite from this stackoverflow, we convert words to numbers ;)

But anyway, it's almost finished, just need to cover a couple of unregular expressions (like "oh", "dozen", "half",...) as CRF feature and normalization.

kermitt2 avatar Apr 14 '16 12:04 kermitt2