grobid icon indicating copy to clipboard operation
grobid copied to clipboard

Segment sentences coordinates.

Open ayhama16 opened this issue 2 years ago • 1 comments

Hi, Is there a way to get only sentences coords, or paragraph coords, without other parsing of grobid? I need only plain text + coords of pdf.

Thanks!

ayhama16 avatar Mar 20 '22 00:03 ayhama16

Hi @ayhama16 !

Not sure I have enough context to answer, but:

  • if you process scientific articles, the purpose of XML is to make possible to extract only the information you are interested in, so you can use simple xpath to get only sentences (//s) or paragraph information (//p) and ignore the other structures (filtering out mark-up in sentences and paragraphs), write an XML parser, or use XSLT style sheets.

  • if you want to process non-scholar PDF and get the sentence/paragraph information only, you would need to adapt the segmentation model to the new type of document, this is the first applied model that identify large zones of the document, in particular the text body where paragraph and sentences are located. Otherwise Grobid consider every documents as a scientific article and you might miss some text.

Hope this helps !

kermitt2 avatar Mar 25 '22 12:03 kermitt2