grobid icon indicating copy to clipboard operation
grobid copied to clipboard

ptr type="web" note detected

Open rodyoukai opened this issue 2 years ago • 11 comments

Hi

I was training citation model and everything is correctly detected except the URL. this is an example of my data training:

<bibl> <author>Azaola, Elena</author> (<date>2009</date>). <title level="a">El comercio con el dolor y la esperanza. La extorsión telefónica en México</title>. <title level="j">URVIO, Revista Latinoamericana de Estudios de Seguridad</title>, <biblScope unit="volume"></biblScope>(<biblScope unit="issue" type="issue">6</biblScope>), <biblScope unit="page">115-122</biblScope>. <idno type="ISSN"> ISSN: 1390-3691</idno>. <ptr type="web">https://www.redalyc.org/articulo.oa?id=552656559008</ptr> </bibl> <bibl> <author>Trejo Nieto, Alejandra</author> (<date>2013</date>). <title level="a">Las economías de las zonas metropolitanas de México en los albores del siglo xxi</title>. <title level="j">Estudios Demográficos y Urbanos</title>, <biblScope unit="volume">28</biblScope>(<biblScope unit="issue" type="issue">3</biblScope>), <biblScope unit="page">545-591</biblScope>. <idno type="ISSN"> ISSN: 0186-7210</idno>. <ptr type="web">https://www.redalyc.org/articulo.oa?id=31230011001</ptr> </bibl>

Maybe I do something wrong but I can't detect it

rodyoukai avatar Dec 03 '21 19:12 rodyoukai

The same happend with ISSN

rodyoukai avatar Dec 03 '21 22:12 rodyoukai

Hi @rodyoukai !

Thanks for the issue,

I don't see anything wrong in the training examples for URL and ISSN.

<bibl> <author>Azaola, Elena</author> (<date>2009</date>). <title level="a">El comercio con el dolor y la esperanza. La extorsión telefónica en México</title>. <title level="j">URVIO, Revista Latinoamericana de Estudios de Seguridad</title>, <biblScope unit="volume"></biblScope>(<biblScope unit="issue" type="issue">6</biblScope>), <biblScope unit="page">115-122</biblScope>. <idno type="ISSN"> ISSN: 1390-3691</idno>. <ptr type="web">https://www.redalyc.org/articulo.oa?id=552656559008</ptr> </bibl>

<bibl> <author>Trejo Nieto, Alejandra</author> (<date>2013</date>). <title level="a">Las economías de las zonas metropolitanas de México en los albores del siglo xxi</title>. <title level="j">Estudios Demográficos y Urbanos</title>, <biblScope unit="volume">28</biblScope>(<biblScope unit="issue" type="issue">3</biblScope>), <biblScope unit="page">545-591</biblScope>. <idno type="ISSN"> ISSN: 0186-7210</idno>. <ptr type="web">https://www.redalyc.org/articulo.oa?id=31230011001</ptr> </bibl>

When trying them with the current citation model, I have correct identification for web url and issn:

Azaola, Elena (2009). El comercio con el dolor y la esperanza. La extorsión telefónica en México. URVIO, Revista Latinoamericana de Estudios de Seguridad, (6), 115-122. ISSN: 1390-3691. https://www.redalyc.org/articulo.oa?id=552656559008
<biblStruct >
    <analytic>
        <title level="a" type="main">El comercio con el dolor y la esperanza</title>
        <author>
            <persName>
                <forename type="first">Elena</forename>
                <surname>Azaola</surname>
            </persName>
        </author>
        <ptr target="https://www.redalyc.org/articulo.oa?id=552656559008" />
    </analytic>
    <monogr>
        <title level="j">Revista Latinoamericana de Estudios de Seguridad</title>
        <idno type="ISSN">1390-3691</idno>
        <imprint>
            <biblScope unit="issue">6</biblScope>
            <biblScope unit="page" from="115" to="122" />
            <date type="published" when="2009" />
        </imprint>
    </monogr>
</biblStruct>
Trejo Nieto, Alejandra (2013). Las economías de las zonas metropolitanas de México en los albores del siglo xxi. Estudios Demográficos y Urbanos, 28(3), 545-591. ISSN: 0186-7210. https://www.redalyc.org/articulo.oa?id=31230011001
<biblStruct >
    <analytic>
        <title level="a" type="main">Las economías de las zonas metropolitanas de México en los albores del siglo xxi</title>
        <author>
            <persName>
                <forename type="first">Trejo</forename>
                <surname>Nieto</surname>
            </persName>
        </author>
        <author>
            <persName>
                <forename type="first">Alejandra</forename>
            </persName>
        </author>
        <ptr target="https://www.redalyc.org/articulo.oa?id=31230011001" />
    </analytic>
    <monogr>
        <title level="j">Estudios Demográficos y Urbanos</title>
        <idno type="ISSN">0186-7210</idno>
        <imprint>
            <biblScope unit="volume">28</biblScope>
            <biblScope unit="issue">3</biblScope>
            <biblScope unit="page" from="545" to="591" />
            <date type="published" when="2013" />
        </imprint>
    </monogr>
</biblStruct>

Are you sure that there is no XML parsing errors for your training files? Nothing suspicious when training? How many examples are you using when training?

kermitt2 avatar Dec 04 '21 08:12 kermitt2

Hi @kermitt2

Thanks for your answer, I do a few test, let me tell you about it:

I use this endpoint api/processCitation and sending it a raw reference string with the parameter application/x-bibtex

I get this:

@article{-1, author = {Azaola, Elena}, title = {El comercio con el dolor y la esperanza. La extorsión telefónica en México}, journal = {URVIO, Revista Latinoamericana de Estudios de Seguridad}, date = {2009}, year = {2009}, pages = {115--122}, number = {6} }

But if I use application/xml I get this:

<biblStruct >
	<analytic>
		<title level="a" type="main">El comercio con el dolor y la esperanza. La extorsión telefónica en México</title>
		<author>
			<persName><forename type="first">Elena</forename><surname>Azaola</surname></persName>
		</author>
		<idno>1390-3691</idno>
		<ptr target="https://www.redalyc.org/articulo.oa?id=552656559008" />
	</analytic>
	<monogr>
		<title level="j">URVIO, Revista Latinoamericana de Estudios de Seguridad</title>
		<imprint>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="115" to="122" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>```

As you can see in xml I get more fields, but the problem is the **idno** tag does not have **type="web"** parameter and **ptr** tag does have the paramter **target** instead of **type** and tthe recover url is a value of a parameter instead of a text between a tags.

By the way, my training data does not have error or rare characters...

rodyoukai avatar Dec 08 '21 02:12 rodyoukai

Hello !

The encoding of the results follows the TEI, so URL are encoded like this by definition:

<ptr target="https://www.redalyc.org/articulo.oa?id=552656559008" /> 

<ptr> has no type, and target URL is defined by the @target attribute. Why do you think it is a problem?

Maybe I can stress that the encoding of the training data is different from the encoding of the final processed result. Grobid parsing results are metadata, so normalized and independent from a particular order/presentation/serialization. It's the format expected by a catalogue for instance.

Training data follow the input (for instance noisy token sequences from a PDF) and thus are not normalized. As they follow exactly the input string, the encoding is "inline", identifying spans to be extracted, so content is never in an attribute (XML attributes must be normalized to avoid XML failures).

To generate pre-annotated training data format, you can use the batch method createTraining, which produces inline annotations on the exact input reference strings.

kermitt2 avatar Dec 08 '21 10:12 kermitt2

I understand, the ptr tag now is clear for me, I appreciate the explanation about the diference of input and output data.

But the ISSN parameter in idno tag is not working for me...

<biblStruct >
    <analytic>
        <title level="a" type="main">Las economías de las zonas metropolitanas de México en los albores del siglo xxi</title>
        <author>
            <persName>
                <forename type="first">Trejo</forename>
                <surname>Nieto</surname>
            </persName>
        </author>
        <author>
            <persName>
                <forename type="first">Alejandra</forename>
            </persName>
        </author>
        <ptr target="https://www.redalyc.org/articulo.oa?id=31230011001" />
    </analytic>
    <monogr>
        <title level="j">Estudios Demográficos y Urbanos</title>
        <idno type="ISSN">0186-7210</idno>
        <imprint>
            <biblScope unit="volume">28</biblScope>
            <biblScope unit="issue">3</biblScope>
            <biblScope unit="page" from="545" to="591" />
            <date type="published" when="2013" />
        </imprint>
    </monogr>
</biblStruct>

In your example (above) type parameter exists in idno tag...

rodyoukai avatar Dec 08 '21 17:12 rodyoukai

In your example (above) type parameter exists in idno tag...

What is your input reference?

With ISSN keyword (e.g. ISSN: 1390-3691.) it works normally, but without (1390-3691.), it's just recognized as an identifier. In general the ISSN is presented with the prefix (ISSN: 1390-3691.), all the cases in the current training data are like that I think.

kermitt2 avatar Dec 08 '21 17:12 kermitt2

This is my query:

Azaola, Elena (2009). El comercio con el dolor y la esperanza. La extorsión telefónica en México. URVIO, Revista Latinoamericana de Estudios de Seguridad, (6), 115-122. ISSN: 1390-3691. https://www.redalyc.org/articulo.oa?id=552656559008

rodyoukai avatar Dec 08 '21 18:12 rodyoukai

With this input reference, the type ISSN appears with the current system (https://grobid.science-miner.com). In your training data, did you add systematically the ISSN prefix in the <idno> field, for example:

... <idno type="ISSN">ISSN: 0186-7210</idno> ...

This is what is expected to have the type of the identifier recognized.

kermitt2 avatar Dec 08 '21 18:12 kermitt2

yes I do, this is an example:

<bibl>
<author>Vargas Reyes, Bryan, Ariza Santamaría, Rosembert</author> (<date>2020</date>). <title level="a">Liberación de la madre tierra: entre la legitimidad y los usos sociales de la ilegalidad</title>. <title level="j">Revista Estudios Socio-Jurídicos</title>, <biblScope unit="volume">22</biblScope>(<biblScope unit="issue" type="issue">1</biblScope>), <biblScope unit="page">203-232</biblScope>. ISSN: <idno type="issn">0124-0579</idno>. <ptr type="web">https://www.redalyc.org/articulo.oa?id=73362099007</ptr>
</bibl>

rodyoukai avatar Dec 08 '21 18:12 rodyoukai

In this example the ISSN: is outside the <idno> mark-up?

Should be:

<bibl>
<author>Vargas Reyes, Bryan, Ariza Santamaría, Rosembert</author> (<date>2020</date>). <title level="a">Liberación de la madre tierra: entre la legitimidad y los usos sociales de la ilegalidad</title>. <title level="j">Revista Estudios Socio-Jurídicos</title>, <biblScope unit="volume">22</biblScope>(<biblScope unit="issue" type="issue">1</biblScope>), <biblScope unit="page">203-232</biblScope>. <idno type="issn">ISSN: 0124-0579</idno>. <ptr type="web">https://www.redalyc.org/articulo.oa?id=73362099007</ptr>
</bibl>

https://grobid.readthedocs.io/en/latest/training/Bibliographical-references/#identifiers

kermitt2 avatar Dec 09 '21 07:12 kermitt2

I understand, sorry, the english is not my native language and sometimes I have this issues in my comprehension, I will be retrain the model and check, thanks for your time and patience

rodyoukai avatar Dec 09 '21 16:12 rodyoukai