grobid
grobid copied to clipboard
Segment sentences coordinates.
Hi, Is there a way to get only sentences coords, or paragraph coords, without other parsing of grobid? I need only plain text + coords of pdf.
Thanks!
Hi @ayhama16 !
Not sure I have enough context to answer, but:
-
if you process scientific articles, the purpose of XML is to make possible to extract only the information you are interested in, so you can use simple xpath to get only sentences (
//s
) or paragraph information (//p
) and ignore the other structures (filtering out mark-up in sentences and paragraphs), write an XML parser, or use XSLT style sheets. -
if you want to process non-scholar PDF and get the sentence/paragraph information only, you would need to adapt the segmentation model to the new type of document, this is the first applied model that identify large zones of the document, in particular the text body where paragraph and sentences are located. Otherwise Grobid consider every documents as a scientific article and you might miss some text.
Hope this helps !