PatCit icon indicating copy to clipboard operation
PatCit copied to clipboard

Missing `title_*`

Open cverluise opened this issue 4 years ago • 0 comments

Around 10% of the npl_publn in the beta version have neither title_j nor title_m nor title_main_a. Most of the time, part of these elements are wrongly parsed the title_main_m.

How to reproduce the behaviour


SELECT
  *
FROM (
  SELECT
    *
  FROM
    `npl-parsing.patcit.beta`
  WHERE
    title_j is NULL
    AND title_m is NULL
    AND title_main_a is NULL
    ) 
    AS parsing
JOIN (
  SELECT
    npl_publn_id AS id,
    npl_biblio
  FROM
    `usptobias.patstat.tls214`) AS tls214
ON
  tls214.id=parsing.npl_publn_id

Ideas/ solution

There seems to be a common pattern in these citations in the sense that they are already very structured (e.g NIELSEN F ET AL: 'HERSTELLUNG STAUBARMER, FREIFLIESSENDER PRODUKTE', CHEMIETECHNIK, HUTHIG, HEIDELBERG, DE, vol. 22, no. 10, 1 October 1993 (1993-10-01), pages 48 - 49, XP000415410, ISSN: 0340-9961).

At this stage, training the Grobid model on these examples seems to be the best option. Then, examples affected by this issue will be processed again.

cverluise avatar Nov 09 '19 11:11 cverluise