LiLT icon indicating copy to clipboard operation
LiLT copied to clipboard

Need help with reproducing CORD results

Open martinkozle opened this issue 1 year ago • 2 comments

I tried to reproduce the CORD results given in the paper, but I only managed to get an F1 score of ~0.62 on the test dataset. Is there any special pre-processing that is done to the CORD dataset for it to work with LiLT or am I making a mistake.

Currently what I am doing is changing the labels in /LiLTfinetune/data/datasets/xfun.py to the labels of the CORD dataset. As well as changing the _generate_examples method to load from the CORD files.

The config that I used:

{
    "model_name_or_path": "models/lilt-infoxlm-base",
    "tokenizer_name": "roberta-base",
    "output_dir": "output/xfun_ser",
    "do_train": "true",
    "do_eval": "true",
    "do_predict": "true",
    "lang": "en",
    "num_train_epochs": 10,
    "max_steps" : 2000,
    "per_device_train_batch_size": 1,
    "warmup_ratio": 0.1,
    "pad_to_max_length": "true",
    "return_entity_level_metrics": "true"
}

Is there another step that needs to be done for LiLT to work with a different dataset? With how many epochs/steps are the results in the paper achieved?

Update: With 20,000 steps I managed to get to an overall F1 score of ~0.79, still far from the expected. With 30,000 steps the score stays at ~0.79, so it is not increasing any more with the number of steps.

martinkozle avatar Jul 19 '22 15:07 martinkozle

Hi, in your config, it seems that you use roberta-base tokenizer with lilt-infoxlm-base model, which is inconsistent. You need to use the xlm-roberta-base tokenizer. And the per_device_train_batch_size equals to 1 in your config is also too small.

jpWang avatar Jul 23 '22 06:07 jpWang

Hi, We did more runs, this time with xlm-roberta-base as the tokenizer, per_device_train_batch_size up to 12 (as much as we could with 24GB VRAM). Here is the exact config we tried this time:

{
    "model_name_or_path": "models/lilt-infoxlm-base",
    "tokenizer_name": "xlm-roberta-base",
    "output_dir": "output/xfun_ser",
    "do_train": "true",
    "do_eval": "true",
    "do_predict": "true",
    "lang": "en",
    "num_train_epochs": 10,
    "max_steps" : 20000,
    "per_device_train_batch_size": 12,
    "warmup_ratio": 0.1,
    "pad_to_max_length": "true",
    "return_entity_level_metrics": "true",
    "save_total_limit": 1
}

With this setup we got an f1 score of ~0.828, which is still lower than what you have gotten (0.9616), and lower than all of the other architectures as well. Do you have the exact config that you used for CORD? And did you do any special preprocessing on the dataset?

martinkozle avatar Jul 28 '22 14:07 martinkozle

We managed to find the issue and improve the f1 score to ~0.94.

martinkozle avatar Sep 14 '22 14:09 martinkozle