grobid icon indicating copy to clipboard operation
grobid copied to clipboard

Any options for ignoring captions or footers(anything other than main text or section titles)?

Open com3dian opened this issue 1 year ago • 5 comments

Hi!

Thanks for this wonderful software! I tested with some papers and surprisingly found Grobid is much better than all other software I tried on my project. I would like to ask if Grobid would always parse out the figures/tables' captions or the texts embedded in the figures/tables? If no, is there an option that allows me only get the content texts(exclude captions or footers)? Thank you so much!

com3dian avatar Feb 21 '23 17:02 com3dian

Hi @com3dian and thank you for the kind words on the project !

Yes Grobid always parse figures when using the processFulltextDocument service.

The result XML format is designed to then select structures of interest. There are comprehensive XML parsers and tools in every languages. The usual way to proceed for your use case would be to use XPath .

For instance to get all the text under <div> markup under the <body> part, with the command line xmllint it's something like this:

xmllint --xpath "//*[local-name()='body']/*[local-name()='div']//text()" file.xml

If you want to ignore figure stuff and footnotes:

xmllint --xpath "//*[local-name()='body']/*[not(local-name()='note') and not(local-name()='figure')]//text()" file.xml

kermitt2 avatar Feb 22 '23 11:02 kermitt2

Thanks a lot! I'm not using command line tools for this part of my project(and I don't know how xmllint works either) so I wrote some python scripts and it works fine for me. Do you mind if I post them here so they could be helpful to others as well :handshake:?

com3dian avatar Mar 15 '23 09:03 com3dian

@com3dian Can I use your python scripts?I want get all the text under the <body> part.

TaichiLi avatar Mar 24 '23 11:03 TaichiLi

@com3dian Can I use your python scripts?I want get all the text under the <body> part.

Hi @TaichiLi, I used the xmltodict package for my own project to read the .json data as a dictionary object, below is my code for extracting text data under ['head', '#text', 'p', 'surname'] tags while filtering out everything under ['ref', 'figure', 'idno', 'listBibl', 'note'] tags. Please make your own stopwordList and keywordList if you want something different. Also, I can not guarantee these scripts are a hundred percent bug-free but feel free to speak it out if there are any errors.

def getIndex(inputData):
    '''
    get input index for lists and dicts.
    '''
    inputType = type(inputData)
    if inputType is dict:
        return list(inputData.keys())
    elif inputType is list:
        return range(len(inputData))
    return 0

def getSonNodes(nodeData, nodeName):
    '''
    Given a nodeData object and a nodeName string,
    returns a list of tuples containing the child
    nodes of the given node and their corresponding
    names.
    '''
    index = getIndex(nodeData)
    ans = []
    if type(nodeData) is list:
        for i in index:
            ans.append((nodeData[i], nodeName))
    elif type(nodeData) is dict:
        for i in index:
            ans.append((nodeData[i], i))
    return ans

def docRead(sonData, sonName):
    '''
    Given a sonData object and its corresponding
    sonName string, returns a string representation
    of the data.
    Returns:
    - If the sonData object is a string, its value
      will be returned.
    - If the sonData object is not a string, the
      recRead function will be called recursively 
      to construct the string.
    '''
    ans = ''
    dataType = type(sonData)
    if dataType is str:
        ans += sonData + '\n'
    else:
        ans += recRead(sonData, sonName)
    return ans

def recRead(data, key):
    '''
    Notes:
    - This function assumes that the data object
      is a dictionary or list.
    - This function is called recursively to traverse
      the nested structure of the data object and
      construct the string representation.
    - The function filters out certain stop words and
      keywords defined in the stopwordList and keywordList
      variables, respectively.
    - The docRead function is called to construct the
      string representation of each keyword object found.
    '''
    ans = ''
    stopwordList = ['ref', 'figure', 'idno', 'listBibl', 'note']
    keywordList = ['head', '#text', 'p', 'surname']
    
    if getIndex(data):
        for son, father in getSonNodes(data, key):
            if father in stopwordList:
                continue
            elif father in keywordList:
                ans += docRead(son, father)
            else:
                ans += recRead(son, father)
    return ans

To use the above functions you have to run the following codes and you should be able to see your texts.

with open(filename, 'r', encoding='utf-8') as file:
    xmlData = file.read()
dataDict = xmltodict.parse(xmlData)
print(recRead(dataDict, 0))

com3dian avatar Mar 24 '23 12:03 com3dian

If you're using python, there are some available command line scripts to transform Grobid TEI into JSON (some information is lost in the process, but it should cover what you want), like: https://github.com/softcite/software-mentions/blob/master/scripts/TEI2LossyJSON.py

python3 TEI2LossyJSON.py --tei-file ~/test/392_2010_Article_261.tei.xml
/home/lopez/test/392_2010_Article_261.json

or the AI2 https://github.com/allenai/s2orc-doc2json ("grobid2json").

The json produced by these two scripts above is more or less the same (the one of the CORD-19 corpus).

kermitt2 avatar Mar 28 '23 14:03 kermitt2