grobid Table Training

Hii @kermitt2 , Thank you for grobid. Lately I have been working on the table model based on what originally grobid was training the tags for table. I have followed same architecture of the tags for table , But when I was training the table model I saw the epochs where the err percentage was running oppositely (you can see the attached photo). Based on the image Uploaded the err is 6.12% / 0.70 % but it should be like 0.70% / 100%. I think that because of this I am not able to produce the table tags. It would glad to get from help from youside. Thank you.

Sep 19 '22 14:09 vikhil0609

Hii @kermitt2 @lfoppiano , This is vikhil again. I wanted to get some information on the warning facing while training the table model. The warning is warning: maximum linesearch reached. Can you elaborate more on this

Sep 19 '22 14:09 vikhil0609

Hi @vikhil0609 !

I think there is a problem with your input training format. Obviously the number of instances (nb train) should be much higher and you should have something like 10 labels for the table model. And you're right for the err, we should have first a lower value (field error rate) and second a higher one (instance error rate).

Could you share maybe some of your new training data for table? Or more information?

Sep 23 '22 08:09 kermitt2

6e33906268970482ee9979f1b237cebb.training.table.tei.txt

Hii @kermitt2 thank you for replying. This is one of the example of the table training data which I have been using for training the table model. also I am not able to understand the WARNING : maximum line search reached.

Sep 23 '22 09:09 vikhil0609

"maximum line search reached" indicate that the input for the training is ill-formed, features and instances are not properly segmented and apparently everything appear as a huge line/block.

The format of your TEI training file looks good !

Do you have the associated feature file (the corresponding "raw" .table file)?

Last thing to watch is the generated Wapiti training file. When launching the training (for example with .gradlew train_table), the generated file is indicated in the console:

lopez@work:~/grobid$ ./gradlew train_table

> Task :grobid-trainer:train_table
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
sourceTEIPathLabel: /home/lopez/grobid/grobid-trainer/../grobid-home/../grobid-trainer/resources/dataset/table/corpus/tei
sourceRawPathLabel: /home/lopez/grobid/grobid-trainer/../grobid-home/../grobid-trainer/resources/dataset/table/corpus/raw
trainingOutputPath: /home/lopez/grobid/grobid-trainer/../grobid-home/tmp/table6503145258199999081.train
evalOutputPath: null
21 tei files

The "actual" assembled training file is grobid-home/tmp/table6503145258199999081.train and it's interesting to share it if possible,

Sep 23 '22 10:09 kermitt2

OK! I am attaching the associated raw file here : 6e33906268970482ee9979f1b237cebb.training.txt This is wapiti training file: table9959709674626711143.txt

Sep 23 '22 10:09 vikhil0609

Thank you @vikhil0609. The training data have only one label (<content> for table content), so it cannot work.

The task of the table model in Grobid is to structure a table area, more precisely to identify the part that is the table content itself, the table captions, the table title, the table label (for table cross-reference in the text body) and the table notes. So you would need training input relevant to this task.

Maybe you are more interested in structuring the table content (e.g. in row, cells, header row, ...)? This is not covered by Grobid for the moment, there's only a basic line-based algorithm (https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/data/Table.java#L252) by default, which is not machine learning (which is not working very well, but afaik not worse than tabula on this kind of scientific table).

Sep 25 '22 09:09 kermitt2

grobid grobid copied to clipboard

Table Training

grobid
grobid copied to clipboard