grobid-quantities icon indicating copy to clipboard operation
grobid-quantities copied to clipboard

Create holdout set

Open lfoppiano opened this issue 1 year ago • 1 comments

This PR will select some paper to have an holdout set. At the moment, as the data set is small, we will use all the documents for create the final models, however we will keep a fixed holdout set to have a more strict and precise evaluation. Except for Units where the evaluation set was borrowed by a different source.

The holdout set was created using an automatic script and re-balanced based on the distribution of entities between training and holdout set.

The python script to reproduce the holdout dataset are contained under scripts.

The statistics about the training/holdout set can be found in:

  • https://github.com/kermitt2/grobid-quantities/tree/feature/holdout-set/resources/dataset/quantities
  • https://github.com/kermitt2/grobid-quantities/tree/feature/holdout-set/resources/dataset/units
  • https://github.com/kermitt2/grobid-quantities/tree/feature/holdout-set/resources/dataset/values

lfoppiano avatar Oct 24 '22 05:10 lfoppiano

Coverage Status

Coverage remained the same at 27.67% when pulling 06c7e11ab71fbff32a22f0e5ef47957945c4109e on feature/holdout-set into 0957bc631017ca3c603bb394d53af4b9643720d3 on master.

coveralls avatar Oct 24 '22 05:10 coveralls

I think this is ready to merge

lfoppiano avatar Nov 07 '22 01:11 lfoppiano