grobid-quantities icon indicating copy to clipboard operation
grobid-quantities copied to clipboard

Unit parsing: full names unit, full names with inflections

Open kermitt2 opened this issue 8 years ago • 4 comments

As we moved from "lexical mapping" (not to say rules!) to a CRF parser to process and normalize the unit expressions, the full name unit are not covered by the unit parser, e.g. hours in "2 hours".

[WARN ] org.grobid.core.engines.QuantityParser: Could not normalize the value: 2. 
org.grobid.core.data.normalization.NormalizationException: The unit Unit{rawName='hours', offsets=661   666, productBlock=null} cannot be normalized. It is either not a valid unit or it is not recognized from the available parsers.

kermitt2 avatar Mar 17 '16 21:03 kermitt2

Should work now. The only glitch is that the value is normalized in seconds, which is not optimal for huge units, like year/years.

See example: screen shot 2016-03-29 at 15 55 03

lfoppiano avatar Mar 29 '16 13:03 lfoppiano

Now also strange prefix-inflection combination are working in english.

screen shot 2016-04-12 at 11 15 01

lfoppiano avatar Apr 12 '16 09:04 lfoppiano

TODO: we need to migrate the lexicon to a json base input file and add support for french and german. (see #14 )

lfoppiano avatar Apr 12 '16 09:04 lfoppiano

Lexicon has been migrated. We need to add support for french and german (to be check if the notation are not changing between languages)

lfoppiano avatar Apr 14 '16 13:04 lfoppiano