grobid
grobid copied to clipboard
Any options for ignoring captions or footers(anything other than main text or section titles)?
Hi!
Thanks for this wonderful software! I tested with some papers and surprisingly found Grobid is much better than all other software I tried on my project. I would like to ask if Grobid would always parse out the figures/tables' captions or the texts embedded in the figures/tables? If no, is there an option that allows me only get the content texts(exclude captions or footers)? Thank you so much!
Hi @com3dian and thank you for the kind words on the project !
Yes Grobid always parse figures when using the processFulltextDocument service.
The result XML format is designed to then select structures of interest. There are comprehensive XML parsers and tools in every languages. The usual way to proceed for your use case would be to use XPath .
For instance to get all the text under <div>
markup under the <body>
part, with the command line xmllint
it's something like this:
xmllint --xpath "//*[local-name()='body']/*[local-name()='div']//text()" file.xml
If you want to ignore figure stuff and footnotes:
xmllint --xpath "//*[local-name()='body']/*[not(local-name()='note') and not(local-name()='figure')]//text()" file.xml
Thanks a lot! I'm not using command line tools for this part of my project(and I don't know how xmllint
works either) so I wrote some python scripts and it works fine for me. Do you mind if I post them here so they could be helpful to others as well :handshake:?
@com3dian Can I use your python scripts?I want get all the text under the <body>
part.
@com3dian Can I use your python scripts?I want get all the text under the
<body>
part.
Hi @TaichiLi, I used the xmltodict package for my own project to read the .json
data as a dictionary object, below is my code for extracting text data under ['head', '#text', 'p', 'surname']
tags while filtering out everything under ['ref', 'figure', 'idno', 'listBibl', 'note']
tags. Please make your own stopwordList
and keywordList
if you want something different. Also, I can not guarantee these scripts are a hundred percent bug-free but feel free to speak it out if there are any errors.
def getIndex(inputData):
'''
get input index for lists and dicts.
'''
inputType = type(inputData)
if inputType is dict:
return list(inputData.keys())
elif inputType is list:
return range(len(inputData))
return 0
def getSonNodes(nodeData, nodeName):
'''
Given a nodeData object and a nodeName string,
returns a list of tuples containing the child
nodes of the given node and their corresponding
names.
'''
index = getIndex(nodeData)
ans = []
if type(nodeData) is list:
for i in index:
ans.append((nodeData[i], nodeName))
elif type(nodeData) is dict:
for i in index:
ans.append((nodeData[i], i))
return ans
def docRead(sonData, sonName):
'''
Given a sonData object and its corresponding
sonName string, returns a string representation
of the data.
Returns:
- If the sonData object is a string, its value
will be returned.
- If the sonData object is not a string, the
recRead function will be called recursively
to construct the string.
'''
ans = ''
dataType = type(sonData)
if dataType is str:
ans += sonData + '\n'
else:
ans += recRead(sonData, sonName)
return ans
def recRead(data, key):
'''
Notes:
- This function assumes that the data object
is a dictionary or list.
- This function is called recursively to traverse
the nested structure of the data object and
construct the string representation.
- The function filters out certain stop words and
keywords defined in the stopwordList and keywordList
variables, respectively.
- The docRead function is called to construct the
string representation of each keyword object found.
'''
ans = ''
stopwordList = ['ref', 'figure', 'idno', 'listBibl', 'note']
keywordList = ['head', '#text', 'p', 'surname']
if getIndex(data):
for son, father in getSonNodes(data, key):
if father in stopwordList:
continue
elif father in keywordList:
ans += docRead(son, father)
else:
ans += recRead(son, father)
return ans
To use the above functions you have to run the following codes and you should be able to see your texts.
with open(filename, 'r', encoding='utf-8') as file:
xmlData = file.read()
dataDict = xmltodict.parse(xmlData)
print(recRead(dataDict, 0))
If you're using python, there are some available command line scripts to transform Grobid TEI into JSON (some information is lost in the process, but it should cover what you want), like: https://github.com/softcite/software-mentions/blob/master/scripts/TEI2LossyJSON.py
python3 TEI2LossyJSON.py --tei-file ~/test/392_2010_Article_261.tei.xml
/home/lopez/test/392_2010_Article_261.json
or the AI2 https://github.com/allenai/s2orc-doc2json ("grobid2json").
The json produced by these two scripts above is more or less the same (the one of the CORD-19 corpus).