LiLT
LiLT copied to clipboard
Need help with reproducing CORD results
I tried to reproduce the CORD results given in the paper, but I only managed to get an F1 score of ~0.62 on the test dataset. Is there any special pre-processing that is done to the CORD dataset for it to work with LiLT or am I making a mistake.
Currently what I am doing is changing the labels in /LiLTfinetune/data/datasets/xfun.py to the labels of the CORD dataset. As well as changing the _generate_examples method to load from the CORD files.
The config that I used:
{
"model_name_or_path": "models/lilt-infoxlm-base",
"tokenizer_name": "roberta-base",
"output_dir": "output/xfun_ser",
"do_train": "true",
"do_eval": "true",
"do_predict": "true",
"lang": "en",
"num_train_epochs": 10,
"max_steps" : 2000,
"per_device_train_batch_size": 1,
"warmup_ratio": 0.1,
"pad_to_max_length": "true",
"return_entity_level_metrics": "true"
}
Is there another step that needs to be done for LiLT to work with a different dataset? With how many epochs/steps are the results in the paper achieved?
Update: With 20,000 steps I managed to get to an overall F1 score of ~0.79, still far from the expected. With 30,000 steps the score stays at ~0.79, so it is not increasing any more with the number of steps.
Hi,
in your config, it seems that you use roberta-base
tokenizer with lilt-infoxlm-base
model, which is inconsistent. You need to use the xlm-roberta-base
tokenizer. And the per_device_train_batch_size
equals to 1 in your config is also too small.
Hi,
We did more runs, this time with xlm-roberta-base
as the tokenizer, per_device_train_batch_size
up to 12 (as much as we could with 24GB VRAM). Here is the exact config we tried this time:
{
"model_name_or_path": "models/lilt-infoxlm-base",
"tokenizer_name": "xlm-roberta-base",
"output_dir": "output/xfun_ser",
"do_train": "true",
"do_eval": "true",
"do_predict": "true",
"lang": "en",
"num_train_epochs": 10,
"max_steps" : 20000,
"per_device_train_batch_size": 12,
"warmup_ratio": 0.1,
"pad_to_max_length": "true",
"return_entity_level_metrics": "true",
"save_total_limit": 1
}
With this setup we got an f1 score of ~0.828, which is still lower than what you have gotten (0.9616), and lower than all of the other architectures as well. Do you have the exact config that you used for CORD? And did you do any special preprocessing on the dataset?
We managed to find the issue and improve the f1 score to ~0.94.
Hello! @martinkozle, I faced the same problem. Can you give some tips or share your config? It would be very much appreciated.
Hello! @martinkozle, I faced the same problem. Can you give some tips or share your config? It would be very much appreciated.
Sorry for the late response. I doubt that you are having the same issue that we had. Our problem was with words that get converted into multiple tokens, we were only using the first token for the model, instead of all of them.