grobid-quantities icon indicating copy to clipboard operation
grobid-quantities copied to clipboard

valueMost followed by a valueLeast wrongly aggregated to the same measurement

Open lfoppiano opened this issue 5 years ago • 4 comments

The text: Before the 1920s the number of stages was usually 15 at most and the riders enjoyed at least one day of rest after each stage.

is labeled (correctly) as

  • 1920s is the valueMost
  • 15 at most. is also valueMost
  • one day is valueLeast

and this is the labeled result:

Before	before	B	Be	Bef	Befo	e	re	ore	fore	INITCAP	NODIGIT	0	NOPUNCT	Xxxx	Xx	0	0	<other>
the	the	t	th	the	the	e	he	the	the	NOCAPS	NODIGIT	0	NOPUNCT	xxx	x	0	0	<other>
1920	1920	1	19	192	1920	0	20	920	1920	NOCAPS	ALLDIGIT	0	NOPUNCT	dddd	d	0	0	I-<valueMost>
s	s	s	s	s	s	s	s	s	s	NOCAPS	NODIGIT	1	NOPUNCT	x	x	1	0	<valueMost>
the	the	t	th	the	the	e	he	the	the	NOCAPS	NODIGIT	0	NOPUNCT	xxx	x	0	0	<other>
number	number	n	nu	num	numb	r	er	ber	mber	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
of	of	o	of	of	of	f	of	of	of	NOCAPS	NODIGIT	0	NOPUNCT	xx	x	1	0	<other>
stages	stages	s	st	sta	stag	s	es	ges	ages	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
was	was	w	wa	was	was	s	as	was	was	NOCAPS	NODIGIT	0	NOPUNCT	xxx	x	0	0	<other>
usually	usually	u	us	usu	usua	y	ly	lly	ally	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
15	15	1	15	15	15	5	15	15	15	NOCAPS	ALLDIGIT	0	NOPUNCT	dd	d	0	0	I-<valueMost>
at	at	a	at	at	at	t	at	at	at	NOCAPS	NODIGIT	0	NOPUNCT	xx	x	1	0	<other>
most	most	m	mo	mos	most	t	st	ost	most	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
and	and	a	an	and	and	d	nd	and	and	NOCAPS	NODIGIT	0	NOPUNCT	xxx	x	0	0	<other>
the	the	t	th	the	the	e	he	the	the	NOCAPS	NODIGIT	0	NOPUNCT	xxx	x	0	0	<other>
riders	riders	r	ri	rid	ride	s	rs	ers	ders	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
enjoyed	enjoyed	e	en	enj	enjo	d	ed	yed	oyed	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
at	at	a	at	at	at	t	at	at	at	NOCAPS	NODIGIT	0	NOPUNCT	xx	x	1	0	<other>
least	least	l	le	lea	leas	t	st	ast	east	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
one	one	o	on	one	one	e	ne	one	one	NOCAPS	NODIGIT	0	NOPUNCT	xxx	x	0	1	I-<valueLeast>
day	day	d	da	day	day	y	ay	day	day	NOCAPS	NODIGIT	0	NOPUNCT	xxx	x	1	0	I-<unitLeft>
of	of	o	of	of	of	f	of	of	of	NOCAPS	NODIGIT	0	NOPUNCT	xx	x	1	0	<other>
rest	rest	r	re	res	rest	t	st	est	rest	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
after	after	a	af	aft	afte	r	er	ter	fter	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
each	each	e	ea	eac	each	h	ch	ach	each	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
stage	stage	s	st	sta	stag	e	ge	age	tage	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
.	.	.	.	.	.	.	.	.	.	ALLCAPS	NODIGIT	1	DOT	.	.	0	0	<other>

However, the measurements are not correctly reconstructed as 15 and one are aggregated to the same measurement...

I have no idea how we can threat this case to be honest...

lfoppiano avatar Mar 26 '19 05:03 lfoppiano

The way the interval measurements are built when we have one most and one least value following is very basic currently, we just attached both to the same measurement. One way to tackle this would be to introduce "barrier" to indicate that we move to another close. Here for instance, the fact that we have a and between two VP clauses could be exploited as a syntactic barrier, forcing two distinct measurements.

In DeLFT I am thinking about having a sentence tokenizer and a predicate/clause tokenizer within the sentence - without going through a complete sentence parsing which would be very expensive.

kermitt2 avatar Mar 26 '19 15:03 kermitt2

@kermitt2 thanks! I've been thinking on a way or another to do it. The easiest component to be added is indeed the sentence tokeniser, which would avoid fairly big mistakes.

However I'm not sure how we can define the barrier, in this case and would work, however normal cases of intervals have and in the middle (e.g. the temperature was between 10 and 11 celsius).

I've tried to search for already made predicate - clause tokenizer and not much is around. Quick tests using a complete dependency parser wasn't successful (https://lindat.mff.cuni.cz/services/udpipe/ for example).

lfoppiano avatar Mar 29 '19 04:03 lfoppiano

First task is then to plug in the sentence tokenizer

lfoppiano avatar Apr 01 '19 08:04 lfoppiano

Here another use case:

.	.	.	.	.	.	.	.	.	.	ALLCAPS	NODIGIT	1	DOT	.	.	0	0	<other>
High	high	H	Hi	Hig	High	h	gh	igh	High	INITCAP	NODIGIT	0	NOPUNCT	Xxxx	Xx	0	0	<other>
T	t	T	T	T	T	T	T	T	T	ALLCAPS	NODIGIT	1	NOPUNCT	X	X	1	0	<other>
c	c	c	c	c	c	c	c	c	c	NOCAPS	NODIGIT	1	NOPUNCT	x	x	1	0	<other>
(	(	(	(	(	(	(	(	(	(	ALLCAPS	NODIGIT	1	OPENBRACKET	(	(	0	0	<other>
up	up	u	up	up	up	p	up	up	up	NOCAPS	NODIGIT	0	NOPUNCT	xx	x	0	0	<other>
to	to	t	to	to	to	o	to	to	to	NOCAPS	NODIGIT	0	NOPUNCT	xx	x	0	0	<other>
15	15	1	15	15	15	5	15	15	15	NOCAPS	ALLDIGIT	0	NOPUNCT	dd	d	0	0	I-<valueMost>
K	k	K	K	K	K	K	K	K	K	ALLCAPS	NODIGIT	1	NOPUNCT	X	X	1	0	I-<unitLeft>
for	for	f	fo	for	for	r	or	for	for	NOCAPS	NODIGIT	0	NOPUNCT	xxx	x	0	0	<other>
a	a	a	a	a	a	a	a	a	a	NOCAPS	NODIGIT	1	NOPUNCT	x	x	1	0	<other>
Re	re	R	Re	Re	Re	e	Re	Re	Re	INITCAP	NODIGIT	0	NOPUNCT	Xx	Xx	0	0	<other>
content	content	c	co	con	cont	t	nt	ent	tent	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
ranging	ranging	r	ra	ran	rang	g	ng	ing	ging	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
from	from	f	fr	fro	from	m	om	rom	from	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
25	25	2	25	25	25	5	25	25	25	NOCAPS	ALLDIGIT	0	NOPUNCT	dd	d	0	0	I-<valueLeast>
to	to	t	to	to	to	o	to	to	to	NOCAPS	NODIGIT	0	NOPUNCT	xx	x	0	0	<other>
62	62	6	62	62	62	2	62	62	62	NOCAPS	ALLDIGIT	0	NOPUNCT	dd	d	0	0	I-<valueMost>
at	at	a	at	at	at	t	at	at	at	NOCAPS	NODIGIT	0	NOPUNCT	xx	x	1	0	I-<unitLeft>
%	%	%	%	%	%	%	%	%	%	ALLCAPS	NODIGIT	1	NOPUNCT	%	%	0	0	<unitLeft>
)	)	)	)	)	)	)	)	)	)	ALLCAPS	NODIGIT	1	ENDBRACKET	)	)	0	0	<other>
has	has	h	ha	has	has	s	as	has	has	NOCAPS	NODIGIT	0	NOPUNCT	xxx	x	1	0	<other>
been	been	b	be	bee	been	n	en	een	been	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
reported	reported	r	re	rep	repo	d	ed	ted	rted	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>
in	in	i	in	in	in	n	in	in	in	NOCAPS	NODIGIT	0	NOPUNCT	xx	x	1	0	<other>
literature	literature	l	li	lit	lite	e	re	ure	ture	NOCAPS	NODIGIT	0	NOPUNCT	xxxx	x	0	0	<other>

lfoppiano avatar Aug 22 '19 06:08 lfoppiano