grobid-quantities
grobid-quantities copied to clipboard
valueMost followed by a valueLeast wrongly aggregated to the same measurement
The text: Before the 1920s the number of stages was usually 15 at most and the riders enjoyed at least one day of rest after each stage.
is labeled (correctly) as
- 1920s is the
valueMost
- 15 at most. is also
valueMost
- one day is
valueLeast
and this is the labeled result:
Before before B Be Bef Befo e re ore fore INITCAP NODIGIT 0 NOPUNCT Xxxx Xx 0 0 <other>
the the t th the the e he the the NOCAPS NODIGIT 0 NOPUNCT xxx x 0 0 <other>
1920 1920 1 19 192 1920 0 20 920 1920 NOCAPS ALLDIGIT 0 NOPUNCT dddd d 0 0 I-<valueMost>
s s s s s s s s s s NOCAPS NODIGIT 1 NOPUNCT x x 1 0 <valueMost>
the the t th the the e he the the NOCAPS NODIGIT 0 NOPUNCT xxx x 0 0 <other>
number number n nu num numb r er ber mber NOCAPS NODIGIT 0 NOPUNCT xxxx x 0 0 <other>
of of o of of of f of of of NOCAPS NODIGIT 0 NOPUNCT xx x 1 0 <other>
stages stages s st sta stag s es ges ages NOCAPS NODIGIT 0 NOPUNCT xxxx x 0 0 <other>
was was w wa was was s as was was NOCAPS NODIGIT 0 NOPUNCT xxx x 0 0 <other>
usually usually u us usu usua y ly lly ally NOCAPS NODIGIT 0 NOPUNCT xxxx x 0 0 <other>
15 15 1 15 15 15 5 15 15 15 NOCAPS ALLDIGIT 0 NOPUNCT dd d 0 0 I-<valueMost>
at at a at at at t at at at NOCAPS NODIGIT 0 NOPUNCT xx x 1 0 <other>
most most m mo mos most t st ost most NOCAPS NODIGIT 0 NOPUNCT xxxx x 0 0 <other>
and and a an and and d nd and and NOCAPS NODIGIT 0 NOPUNCT xxx x 0 0 <other>
the the t th the the e he the the NOCAPS NODIGIT 0 NOPUNCT xxx x 0 0 <other>
riders riders r ri rid ride s rs ers ders NOCAPS NODIGIT 0 NOPUNCT xxxx x 0 0 <other>
enjoyed enjoyed e en enj enjo d ed yed oyed NOCAPS NODIGIT 0 NOPUNCT xxxx x 0 0 <other>
at at a at at at t at at at NOCAPS NODIGIT 0 NOPUNCT xx x 1 0 <other>
least least l le lea leas t st ast east NOCAPS NODIGIT 0 NOPUNCT xxxx x 0 0 <other>
one one o on one one e ne one one NOCAPS NODIGIT 0 NOPUNCT xxx x 0 1 I-<valueLeast>
day day d da day day y ay day day NOCAPS NODIGIT 0 NOPUNCT xxx x 1 0 I-<unitLeft>
of of o of of of f of of of NOCAPS NODIGIT 0 NOPUNCT xx x 1 0 <other>
rest rest r re res rest t st est rest NOCAPS NODIGIT 0 NOPUNCT xxxx x 0 0 <other>
after after a af aft afte r er ter fter NOCAPS NODIGIT 0 NOPUNCT xxxx x 0 0 <other>
each each e ea eac each h ch ach each NOCAPS NODIGIT 0 NOPUNCT xxxx x 0 0 <other>
stage stage s st sta stag e ge age tage NOCAPS NODIGIT 0 NOPUNCT xxxx x 0 0 <other>
. . . . . . . . . . ALLCAPS NODIGIT 1 DOT . . 0 0 <other>
However, the measurements are not correctly reconstructed as 15
and one
are aggregated to the same measurement...
I have no idea how we can threat this case to be honest...
The way the interval measurements are built when we have one most and one least value following is very basic currently, we just attached both to the same measurement. One way to tackle this would be to introduce "barrier" to indicate that we move to another close. Here for instance, the fact that we have a and
between two VP clauses could be exploited as a syntactic barrier, forcing two distinct measurements.
In DeLFT I am thinking about having a sentence tokenizer and a predicate/clause tokenizer within the sentence - without going through a complete sentence parsing which would be very expensive.
@kermitt2 thanks! I've been thinking on a way or another to do it. The easiest component to be added is indeed the sentence tokeniser, which would avoid fairly big mistakes.
However I'm not sure how we can define the barrier, in this case and
would work, however normal cases of intervals have and
in the middle (e.g. the temperature was between 10 and 11 celsius
).
I've tried to search for already made predicate - clause tokenizer and not much is around. Quick tests using a complete dependency parser wasn't successful (https://lindat.mff.cuni.cz/services/udpipe/ for example).
First task is then to plug in the sentence tokenizer
Here another use case:
. . . . . . . . . . ALLCAPS NODIGIT 1 DOT . . 0 0 <other>
High high H Hi Hig High h gh igh High INITCAP NODIGIT 0 NOPUNCT Xxxx Xx 0 0 <other>
T t T T T T T T T T ALLCAPS NODIGIT 1 NOPUNCT X X 1 0 <other>
c c c c c c c c c c NOCAPS NODIGIT 1 NOPUNCT x x 1 0 <other>
( ( ( ( ( ( ( ( ( ( ALLCAPS NODIGIT 1 OPENBRACKET ( ( 0 0 <other>
up up u up up up p up up up NOCAPS NODIGIT 0 NOPUNCT xx x 0 0 <other>
to to t to to to o to to to NOCAPS NODIGIT 0 NOPUNCT xx x 0 0 <other>
15 15 1 15 15 15 5 15 15 15 NOCAPS ALLDIGIT 0 NOPUNCT dd d 0 0 I-<valueMost>
K k K K K K K K K K ALLCAPS NODIGIT 1 NOPUNCT X X 1 0 I-<unitLeft>
for for f fo for for r or for for NOCAPS NODIGIT 0 NOPUNCT xxx x 0 0 <other>
a a a a a a a a a a NOCAPS NODIGIT 1 NOPUNCT x x 1 0 <other>
Re re R Re Re Re e Re Re Re INITCAP NODIGIT 0 NOPUNCT Xx Xx 0 0 <other>
content content c co con cont t nt ent tent NOCAPS NODIGIT 0 NOPUNCT xxxx x 0 0 <other>
ranging ranging r ra ran rang g ng ing ging NOCAPS NODIGIT 0 NOPUNCT xxxx x 0 0 <other>
from from f fr fro from m om rom from NOCAPS NODIGIT 0 NOPUNCT xxxx x 0 0 <other>
25 25 2 25 25 25 5 25 25 25 NOCAPS ALLDIGIT 0 NOPUNCT dd d 0 0 I-<valueLeast>
to to t to to to o to to to NOCAPS NODIGIT 0 NOPUNCT xx x 0 0 <other>
62 62 6 62 62 62 2 62 62 62 NOCAPS ALLDIGIT 0 NOPUNCT dd d 0 0 I-<valueMost>
at at a at at at t at at at NOCAPS NODIGIT 0 NOPUNCT xx x 1 0 I-<unitLeft>
% % % % % % % % % % ALLCAPS NODIGIT 1 NOPUNCT % % 0 0 <unitLeft>
) ) ) ) ) ) ) ) ) ) ALLCAPS NODIGIT 1 ENDBRACKET ) ) 0 0 <other>
has has h ha has has s as has has NOCAPS NODIGIT 0 NOPUNCT xxx x 1 0 <other>
been been b be bee been n en een been NOCAPS NODIGIT 0 NOPUNCT xxxx x 0 0 <other>
reported reported r re rep repo d ed ted rted NOCAPS NODIGIT 0 NOPUNCT xxxx x 0 0 <other>
in in i in in in n in in in NOCAPS NODIGIT 0 NOPUNCT xx x 1 0 <other>
literature literature l li lit lite e re ure ture NOCAPS NODIGIT 0 NOPUNCT xxxx x 0 0 <other>
Here another example from a paper with the following DOI: 10.1002_adma.201202328
There is a lower than 1 T
and, in a separate sentence larger than 1 T
which are merged together.
Figure 8 shows the initial‐magnetization curves of these composite thin films. The two‐step magnetization process can be clearly observed in the film with N = 0; such a step behavior gradually diminishes with larger N due to the presence of a soft‐magnetic phase and degraded microstructures as seen in Figure 6. By correlating the Lorentz images with the intial‐magnetization curve of the thin film with N = 14, it is clear that, when the extermal field is lower than 1 T, the high slope of the initial‐magnetization curve is due to the presence of the movable domain walls as observed in Figure 7. There is a resemblance of a step‐behavior on the initial‐magnetization curve when the external field is larger than 1 T in the thin film with N = 14, consistent with the presence of the pinned domain walls in Figure 7. The depinning field can be determined from the initial‐magnetization curves, and the coercivity dependence of the stack number is shown in the inset of Figure 8. Both the coercivity and the depinning field decrease with an increase in the stack period N. In the thin film with N = 0, the depinning field is lower than the coercivity. However, the depinning field becomes higher than the coercivity in other nanocomposite thin films. Such a change in the initial‐magnetization curves associated with the gradually diminished step‐behavior on the initial‐magnetization curves for the higher fraction of the soft phase indicates that the high coercivity of the DP‐Nd‐Fe‐B film (N = 0) decreases as the pinning force at the Nd‐rich grain bounary decreases by increasing the fraction of the soft‐magnetic phases.
In this case closing the quantities at the end of the sentence, could solve the problem.