Stylesheets icon indicating copy to clipboard operation
Stylesheets copied to clipboard

GROBID TEI to bibtex

Open stzellerhoff opened this issue 8 years ago • 5 comments

Hi,

I am using a dockerized version of grobid to extract references from scientific pdfs. The available output format is TEI (direct bibtex is not possible using the docker version). Converting it using the teitobibtex script produces incorrect bibtex files. Does anyone know how to solve this problem? Thank you!

Stephan

stzellerhoff avatar Oct 02 '17 07:10 stzellerhoff

Could you provide samples of the TEI and the bibtex output, and describe the features which are incorrect?

martindholmes avatar Oct 02 '17 16:10 martindholmes

Hi!

GROBID tei output:

<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML"> <teiHeader/>

<listBibl> <biblStruct xml:id="b0"> Endocardial and epicardial radiofrequency ablation of ventricular tachycardia associated with dilated cardiomyopathy: the importance of low-voltage scars <persName xmlns="http://www.tei-c.org/ns/1.0">KSoejima</persName> <persName xmlns="http://www.tei-c.org/ns/1.0">WgStevenson</persName> <persName xmlns="http://www.tei-c.org/ns/1.0">JlSapp</persName> <persName xmlns="http://www.tei-c.org/ns/1.0">ApSelwyn</persName> <persName xmlns="http://www.tei-c.org/ns/1.0">GCouper</persName> <persName xmlns="http://www.tei-c.org/ns/1.0">LmEpstein</persName> J Am Coll Cardiol <biblScope unit="volume">43</biblScope> <biblScope unit="page" from="1834" to="1842" /> </biblStruct> </listBibl>
</TEI>

Bibtex result:

@article{b0, title={{Endocardial and epicardial radiofrequency ablation of ventricular tachycardia associated with dilated cardiomyopathy: the importance of low-voltage scars}}, author={{KSoejima} and {WgStevenson} and {JlSapp} and {ApSelwyn} and {GCouper} and {LmEpstein}}, journal={{J Am Coll Cardiol}}43, year={} }

Author forenames and surnames are merged, issue, pages, and publication year are empty. Thank yout!

stzellerhoff avatar Oct 02 '17 17:10 stzellerhoff

Hi!

GROBID Tei output:

<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML"> <teiHeader/> <text> <front/> <body/> <back> <div> <listBibl> <biblStruct xml:id="b0"> <analytic> <title level="a" type="main">Endocardial and epicardial radiofrequency ablation of ventricular tachycardia associated with dilated cardiomyopathy: the importance of low-voltage scars</title> <author> <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">K</forename><surname>Soejima</surname></persName> </author> <author> <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">Wg</forename><surname>Stevenson</surname></persName> </author> <author> <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">Jl</forename><surname>Sapp</surname></persName> </author> <author> <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">Ap</forename><surname>Selwyn</surname></persName> </author> <author> <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">G</forename><surname>Couper</surname></persName> </author> <author> <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">Lm</forename><surname>Epstein</surname></persName> </author> </analytic> <monogr> <title level="j">J Am Coll Cardiol</title> <imprint> <biblScope unit="volume">43</biblScope> <biblScope unit="page" from="1834" to="1842" /> <date type="published" when="2004" /> </imprint> </monogr> </biblStruct> </listBibl> </div> </back> </text> </TEI>

Bibtex result:

@article{b0, title={{Endocardial and epicardial radiofrequency ablation of ventricular tachycardia associated with dilated cardiomyopathy: the importance of low-voltage scars}}, author={{KSoejima} and {WgStevenson} and {JlSapp} and {ApSelwyn} and {GCouper} and {LmEpstein}}, journal={{J Am Coll Cardiol}}43, year={} }

Author forenames and surnames are merged, issue, pages, and publication year are empty. Thank you!

Stephan

stzellerhoff avatar Oct 02 '17 17:10 stzellerhoff

Looks like the name problem is in https://github.com/TEIC/Stylesheets/blob/dev/bibtex/convertbib.xsl about line 176. It's looking for a tei:author/tei:surname and is presented with a tei:author/tei:persName/tei:surname instead.

Should be pretty straight forward to add conditional clauses for that (and the same for editor above), but I'm not in front of suitable machine to code and test that right now.

The date is encoded purely as an attribute rather that as XML text, which is not really the TEI way,but could be handled about line 86 of the same file.

cheers stuart

stuartyeates avatar Oct 03 '17 07:10 stuartyeates

Hi!

I gave it a try, but could not fix the output correctly - probably due to a lack of knowing how to exactly...

stzellerhoff avatar Oct 06 '17 19:10 stzellerhoff