grobid
grobid copied to clipboard
Table Training
Hii @kermitt2 ,
Thank you for grobid.
Lately I have been working on the table model based on what originally grobid was training the tags for table.
I have followed same architecture of the tags for table , But when I was training the table model I saw the epochs where the err percentage was running oppositely (you can see the attached photo). Based on the image Uploaded the err is 6.12% / 0.70 % but it should be like 0.70% / 100%.
I think that because of this I am not able to produce the table tags.
It would glad to get from help from youside.
Thank you.
Hii @kermitt2 @lfoppiano ,
This is vikhil again.
I wanted to get some information on the warning facing while training the table model.
The warning is warning: maximum linesearch reached.
Can you elaborate more on this
Hi @vikhil0609 !
I think there is a problem with your input training format. Obviously the number of instances (nb train
) should be much higher and you should have something like 10 labels for the table model.
And you're right for the err
, we should have first a lower value (field error rate) and second a higher one (instance error rate).
Could you share maybe some of your new training data for table? Or more information?
6e33906268970482ee9979f1b237cebb.training.table.tei.txt
Hii @kermitt2 thank you for replying. This is one of the example of the table training data which I have been using for training the table model. also I am not able to understand the WARNING : maximum line search reached.
"maximum line search reached" indicate that the input for the training is ill-formed, features and instances are not properly segmented and apparently everything appear as a huge line/block.
The format of your TEI training file looks good !
Do you have the associated feature file (the corresponding "raw" .table
file)?
Last thing to watch is the generated Wapiti training file. When launching the training (for example with .gradlew train_table
), the generated file is indicated in the console:
lopez@work:~/grobid$ ./gradlew train_table
> Task :grobid-trainer:train_table
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
sourceTEIPathLabel: /home/lopez/grobid/grobid-trainer/../grobid-home/../grobid-trainer/resources/dataset/table/corpus/tei
sourceRawPathLabel: /home/lopez/grobid/grobid-trainer/../grobid-home/../grobid-trainer/resources/dataset/table/corpus/raw
trainingOutputPath: /home/lopez/grobid/grobid-trainer/../grobid-home/tmp/table6503145258199999081.train
evalOutputPath: null
21 tei files
The "actual" assembled training file is grobid-home/tmp/table6503145258199999081.train
and it's interesting to share it if possible,
OK! I am attaching the associated raw file here : 6e33906268970482ee9979f1b237cebb.training.txt This is wapiti training file: table9959709674626711143.txt
Thank you @vikhil0609. The training data have only one label (<content>
for table content), so it cannot work.
The task of the table model in Grobid is to structure a table area, more precisely to identify the part that is the table content itself, the table captions, the table title, the table label (for table cross-reference in the text body) and the table notes. So you would need training input relevant to this task.
Maybe you are more interested in structuring the table content (e.g. in row, cells, header row, ...)? This is not covered by Grobid for the moment, there's only a basic line-based algorithm (https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/data/Table.java#L252) by default, which is not machine learning (which is not working very well, but afaik not worse than tabula on this kind of scientific table).