grobid-quantities valueMost followed by a valueLeast wrongly aggregated to the same measurement

The text: Before the 1920s the number of stages was usually 15 at most and the riders enjoyed at least one day of rest after each stage.

is labeled (correctly) as

1920s is the valueMost
15 at most. is also valueMost
one day is valueLeast

and this is the labeled result:

Before	before	B	Be	Bef	Befo	e	re	ore	fore	INITCAP	NODIGIT	0	NOPUNCT	Xxxx	Xx	0	0	<other>
the	the	t	th	the	the	e	he	the	the	NOCAPS	NODIGIT	0	NOPUNCT	xxx	x	0	0	<other>
1920	1920	1	19	192	1920	0	20	920	1920	NOCAPS	ALLDIGIT	0	NOPUNCT	dddd	d	0	0	I-<valueMost>
s	s	s	s	s	s	s	s	s	s	NOCAPS	NODIGIT	1	NOPUNCT	x	x	1	0	<valueMost>
the	the	t	th	the	the	e	he	the	the	NOCAPS	NODIGIT	0	NOPUNCT	xxx	x	0	0	<other>
number	number	n	nu	num	numb	r	er	ber	mber	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
of	of	o	of	of	of	f	of	of	of	NOCAPS	NODIGIT	0	NOPUNCT	xx	x	1	0	<other>
stages	stages	s	st	sta	stag	s	es	ges	ages	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
was	was	w	wa	was	was	s	as	was	was	NOCAPS	NODIGIT	0	NOPUNCT	xxx	x	0	0	<other>
usually	usually	u	us	usu	usua	y	ly	lly	ally	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
15	15	1	15	15	15	5	15	15	15	NOCAPS	ALLDIGIT	0	NOPUNCT	dd	d	0	0	I-<valueMost>
at	at	a	at	at	at	t	at	at	at	NOCAPS	NODIGIT	0	NOPUNCT	xx	x	1	0	<other>
most	most	m	mo	mos	most	t	st	ost	most	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
and	and	a	an	and	and	d	nd	and	and	NOCAPS	NODIGIT	0	NOPUNCT	xxx	x	0	0	<other>
the	the	t	th	the	the	e	he	the	the	NOCAPS	NODIGIT	0	NOPUNCT	xxx	x	0	0	<other>
riders	riders	r	ri	rid	ride	s	rs	ers	ders	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
enjoyed	enjoyed	e	en	enj	enjo	d	ed	yed	oyed	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
at	at	a	at	at	at	t	at	at	at	NOCAPS	NODIGIT	0	NOPUNCT	xx	x	1	0	<other>
least	least	l	le	lea	leas	t	st	ast	east	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
one	one	o	on	one	one	e	ne	one	one	NOCAPS	NODIGIT	0	NOPUNCT	xxx	x	0	1	I-<valueLeast>
day	day	d	da	day	day	y	ay	day	day	NOCAPS	NODIGIT	0	NOPUNCT	xxx	x	1	0	I-<unitLeft>
of	of	o	of	of	of	f	of	of	of	NOCAPS	NODIGIT	0	NOPUNCT	xx	x	1	0	<other>
rest	rest	r	re	res	rest	t	st	est	rest	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
after	after	a	af	aft	afte	r	er	ter	fter	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
each	each	e	ea	eac	each	h	ch	ach	each	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
stage	stage	s	st	sta	stag	e	ge	age	tage	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
.	.	.	.	.	.	.	.	.	.	ALLCAPS	NODIGIT	1	DOT	.	.	0	0	<other>

However, the measurements are not correctly reconstructed as 15 and one are aggregated to the same measurement...

I have no idea how we can threat this case to be honest...

Mar 26 '19 05:03 lfoppiano

The way the interval measurements are built when we have one most and one least value following is very basic currently, we just attached both to the same measurement. One way to tackle this would be to introduce "barrier" to indicate that we move to another close. Here for instance, the fact that we have a and between two VP clauses could be exploited as a syntactic barrier, forcing two distinct measurements.

In DeLFT I am thinking about having a sentence tokenizer and a predicate/clause tokenizer within the sentence - without going through a complete sentence parsing which would be very expensive.

Mar 26 '19 15:03 kermitt2

@kermitt2 thanks! I've been thinking on a way or another to do it. The easiest component to be added is indeed the sentence tokeniser, which would avoid fairly big mistakes.

However I'm not sure how we can define the barrier, in this case and would work, however normal cases of intervals have and in the middle (e.g. the temperature was between 10 and 11 celsius).

I've tried to search for already made predicate - clause tokenizer and not much is around. Quick tests using a complete dependency parser wasn't successful (https://lindat.mff.cuni.cz/services/udpipe/ for example).

Mar 29 '19 04:03 lfoppiano

First task is then to plug in the sentence tokenizer

Apr 01 '19 08:04 lfoppiano

Here another use case:

.	.	.	.	.	.	.	.	.	.	ALLCAPS	NODIGIT	1	DOT	.	.	0	0	<other>
High	high	H	Hi	Hig	High	h	gh	igh	High	INITCAP	NODIGIT	0	NOPUNCT	Xxxx	Xx	0	0	<other>
T	t	T	T	T	T	T	T	T	T	ALLCAPS	NODIGIT	1	NOPUNCT	X	X	1	0	<other>
c	c	c	c	c	c	c	c	c	c	NOCAPS	NODIGIT	1	NOPUNCT	x	x	1	0	<other>
(	(	(	(	(	(	(	(	(	(	ALLCAPS	NODIGIT	1	OPENBRACKET	(	(	0	0	<other>
up	up	u	up	up	up	p	up	up	up	NOCAPS	NODIGIT	0	NOPUNCT	xx	x	0	0	<other>
to	to	t	to	to	to	o	to	to	to	NOCAPS	NODIGIT	0	NOPUNCT	xx	x	0	0	<other>
15	15	1	15	15	15	5	15	15	15	NOCAPS	ALLDIGIT	0	NOPUNCT	dd	d	0	0	I-<valueMost>
K	k	K	K	K	K	K	K	K	K	ALLCAPS	NODIGIT	1	NOPUNCT	X	X	1	0	I-<unitLeft>
for	for	f	fo	for	for	r	or	for	for	NOCAPS	NODIGIT	0	NOPUNCT	xxx	x	0	0	<other>
a	a	a	a	a	a	a	a	a	a	NOCAPS	NODIGIT	1	NOPUNCT	x	x	1	0	<other>
Re	re	R	Re	Re	Re	e	Re	Re	Re	INITCAP	NODIGIT	0	NOPUNCT	Xx	Xx	0	0	<other>
content	content	c	co	con	cont	t	nt	ent	tent	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
ranging	ranging	r	ra	ran	rang	g	ng	ing	ging	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
from	from	f	fr	fro	from	m	om	rom	from	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
25	25	2	25	25	25	5	25	25	25	NOCAPS	ALLDIGIT	0	NOPUNCT	dd	d	0	0	I-<valueLeast>
to	to	t	to	to	to	o	to	to	to	NOCAPS	NODIGIT	0	NOPUNCT	xx	x	0	0	<other>
62	62	6	62	62	62	2	62	62	62	NOCAPS	ALLDIGIT	0	NOPUNCT	dd	d	0	0	I-<valueMost>
at	at	a	at	at	at	t	at	at	at	NOCAPS	NODIGIT	0	NOPUNCT	xx	x	1	0	I-<unitLeft>
%	%	%	%	%	%	%	%	%	%	ALLCAPS	NODIGIT	1	NOPUNCT	%	%	0	0	<unitLeft>
)	)	)	)	)	)	)	)	)	)	ALLCAPS	NODIGIT	1	ENDBRACKET	)	)	0	0	<other>
has	has	h	ha	has	has	s	as	has	has	NOCAPS	NODIGIT	0	NOPUNCT	xxx	x	1	0	<other>
been	been	b	be	bee	been	n	en	een	been	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
reported	reported	r	re	rep	repo	d	ed	ted	rted	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
in	in	i	in	in	in	n	in	in	in	NOCAPS	NODIGIT	0	NOPUNCT	xx	x	1	0	<other>
literature	literature	l	li	lit	lite	e	re	ure	ture	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>

Aug 22 '19 06:08 lfoppiano

Here another example from a paper with the following DOI: 10.1002_adma.201202328

There is a lower than 1 T and, in a separate sentence larger than 1 T which are merged together.

Figure 8 shows the initial‐magnetization curves of these composite thin films. The two‐step magnetization process can be clearly observed in the film with N = 0; such a step behavior gradually diminishes with larger N due to the presence of a soft‐magnetic phase and degraded microstructures as seen in Figure 6. By correlating the Lorentz images with the intial‐magnetization curve of the thin film with N = 14, it is clear that, when the extermal field is lower than 1 T, the high slope of the initial‐magnetization curve is due to the presence of the movable domain walls as observed in Figure 7. There is a resemblance of a step‐behavior on the initial‐magnetization curve when the external field is larger than 1 T in the thin film with N = 14, consistent with the presence of the pinned domain walls in Figure 7. The depinning field can be determined from the initial‐magnetization curves, and the coercivity dependence of the stack number is shown in the inset of Figure 8. Both the coercivity and the depinning field decrease with an increase in the stack period N. In the thin film with N = 0, the depinning field is lower than the coercivity. However, the depinning field becomes higher than the coercivity in other nanocomposite thin films. Such a change in the initial‐magnetization curves associated with the gradually diminished step‐behavior on the initial‐magnetization curves for the higher fraction of the soft phase indicates that the high coercivity of the DP‐Nd‐Fe‐B film (N = 0) decreases as the pinning force at the Nd‐rich grain bounary decreases by increasing the fraction of the soft‐magnetic phases.

In this case closing the quantities at the end of the sentence, could solve the problem.

Feb 07 '23 06:02 lfoppiano

grobid-quantities grobid-quantities copied to clipboard

valueMost followed by a valueLeast wrongly aggregated to the same measurement

grobid-quantities
grobid-quantities copied to clipboard