tner
tner copied to clipboard
Issue with dataset concatenation
Hi,
First of all there is a bug in https://github.com/asahi417/tner/blob/master/tner/tner_cl/train.py#L118 The GridSearcher
call should be:
trainer = GridSearcher(
checkpoint_dir=opt.checkpoint_dir,
dataset=opt.dataset,
local_dataset=opt.local_dataset,
dataset_name=opt.dataset_name,
n_max_config=opt.n_max_config,
epoch_partial=opt.epoch_partial,
max_length_eval=opt.max_length_eval,
dataset_split_train=opt.dataset_split_train,
dataset_split_valid=opt.dataset_split_valid,
model=opt.model,
crf=opt.crf,
max_length=opt.max_length,
epoch=opt.epoch,
batch_size=opt.batch_size,
lr=opt.lr,
random_seed=opt.random_seed,
gradient_accumulation_steps=opt.gradient_accumulation_steps,
weight_decay=[i if i != 0 else None for i in opt.weight_decay],
lr_warmup_step_ratio=[i if i != 0 else None for i in opt.lr_warmup_step_ratio],
max_grad_norm=[i if i != 0 else None for i in opt.max_grad_norm],
use_auth_token=opt.use_auth_token
)
The dataset_name
argument was missing.
Then when I want to train a model over two different datasets they are not properly concatenated. Here a simple example to reproduce:
tner-train-search -m "xlm-roberta-base" -c "output/" -d "tner/wikiann" "tner/tweetner7" --dataset-name "ace" "tweetner7" -e 15 --epoch-partial 5 --n-max-config 3 -b 32 -g 2 4 --lr 1e-6 1e-5 --crf 0 1 --max-grad-norm 0 10 --weight-decay 0 1e-7
According to the logs we get:
encode all the data: 7111
7111
is the size of the tner/tweetner7
dataset for the split train_all
. The real size should be 100 + 7111
the former being the size of the train
split of the ace
subdataset of tner/wikiann
.
I don't know if this is an easy fix or not. I will be happy to help if needed.