grobid-quantities icon indicating copy to clipboard operation
grobid-quantities copied to clipboard

Numerical value as exponent on 10s

Open kermitt2 opened this issue 8 years ago • 7 comments

The quantity CRF model recognizes numerical expressions with exponents on 10 (in particular distorted one due to PDF text extraction):

example_exponent

However we are not currently parsing it (in their "noisy" form) to actual BigDecimal values.

kermitt2 avatar Mar 17 '16 22:03 kermitt2

Regarding this subject, I've started implementing a CRF model for parsing that.

The idea is to classify the value, like for example 5 x 10-5 as:

<val>5</val>
<operation>x</operation> --> althought this could be just <other>? 
<base>10</base>
<pow>-5</pow>

Regarding the resulting value, we need to accept that it will be not precise, as we have to approximate (we are talking about small values).

I would say we should save both form, the "structured" parsed value the schema above, and the parse value as BigDecimal to have an approximate value.

Does it make sense?

lfoppiano avatar Sep 18 '18 07:09 lfoppiano

Hello!

The value parser should be generic enough to cover several cases described in #13.

The tagset has to be relevant for the different cases:

  • <val> is misleading (this is the value parser), should be much better <number> for instance
  • <operation> is useless, keep it non annotated
  • <alpha> for alphabetic expression of numbers
  • <exp> for the value of the exponential function (E/e kept non annotated)
  • <time> for date and time expressions, a time parser more sophisticated than the existing date model of grobid will be needed

kermitt2 avatar Sep 19 '18 08:09 kermitt2

Regarding the approximation, the idea to introduce BigDecimal was to offer the possibility to set the precision, scaling and rounding, so that we can avoid the usual issue where 1 becomes 1.000000001.

kermitt2 avatar Sep 19 '18 08:09 kermitt2

I've be been remarked that 2.3E5 correspond to https://en.wikipedia.org/wiki/Scientific_notation#E-notation meaning that it's a special case of the 10 power.

Was this the initial though you had, @kermitt2 ?

What do you mean with exponential function? I think I have misunderstood that part...

lfoppiano avatar Jul 09 '19 05:07 lfoppiano

I would make the hypothesis that this notation is not relevant to scientific papers where the e is indeed always the exponential function - why would someone use the calculator/old program notation in a typeset scientific paper?

<exp> for the value of the exponential function (E/e kept non annotated) -> in this context we just annotate the value of the exponential, not the exponential function e, e.g. 10e5 -> <number>10</number>e<exp>5</exp> (note that we could maybe find something better than <number> as tag name)

kermitt2 avatar Jul 09 '19 05:07 kermitt2

So this is the exponential function. Right? But then what about other functions like, for example log ? Since we don't have much data, maybe we could just support the exp "in future"?

lfoppiano avatar Jul 09 '19 05:07 lfoppiano

I think this is the exponential function yes.

I think we don't express a value with a log (we don't write 3log(5)), if we have a log normally it's in an equation with variables, not as value.

mmm I dont see why would we wait for the future? The exponent e is quite frequent as part of a value and it is not complicated.

kermitt2 avatar Jul 09 '19 05:07 kermitt2