PatCit
PatCit copied to clipboard
Missing `title_*`
Around 10% of the npl_publn in the beta version have neither title_j
nor title_m
nor title_main_a
. Most of the time, part of these elements are wrongly parsed the title_main_m
.
How to reproduce the behaviour
SELECT
*
FROM (
SELECT
*
FROM
`npl-parsing.patcit.beta`
WHERE
title_j is NULL
AND title_m is NULL
AND title_main_a is NULL
)
AS parsing
JOIN (
SELECT
npl_publn_id AS id,
npl_biblio
FROM
`usptobias.patstat.tls214`) AS tls214
ON
tls214.id=parsing.npl_publn_id
Ideas/ solution
There seems to be a common pattern in these citations in the sense that they are already very structured (e.g NIELSEN F ET AL: 'HERSTELLUNG STAUBARMER, FREIFLIESSENDER PRODUKTE', CHEMIETECHNIK, HUTHIG, HEIDELBERG, DE, vol. 22, no. 10, 1 October 1993 (1993-10-01), pages 48 - 49, XP000415410, ISSN: 0340-9961).
At this stage, training the Grobid model on these examples seems to be the best option. Then, examples affected by this issue will be processed again.