tner icon indicating copy to clipboard operation
tner copied to clipboard

Issue with dataset concatenation

Open jplu opened this issue 2 years ago • 0 comments

Hi,

First of all there is a bug in https://github.com/asahi417/tner/blob/master/tner/tner_cl/train.py#L118 The GridSearcher call should be:

trainer = GridSearcher(
        checkpoint_dir=opt.checkpoint_dir,
        dataset=opt.dataset,
        local_dataset=opt.local_dataset,
        dataset_name=opt.dataset_name,
        n_max_config=opt.n_max_config,
        epoch_partial=opt.epoch_partial,
        max_length_eval=opt.max_length_eval,
        dataset_split_train=opt.dataset_split_train,
        dataset_split_valid=opt.dataset_split_valid,
        model=opt.model,
        crf=opt.crf,
        max_length=opt.max_length,
        epoch=opt.epoch,
        batch_size=opt.batch_size,
        lr=opt.lr,
        random_seed=opt.random_seed,
        gradient_accumulation_steps=opt.gradient_accumulation_steps,
        weight_decay=[i if i != 0 else None for i in opt.weight_decay],
        lr_warmup_step_ratio=[i if i != 0 else None for i in opt.lr_warmup_step_ratio],
        max_grad_norm=[i if i != 0 else None for i in opt.max_grad_norm],
        use_auth_token=opt.use_auth_token
    )

The dataset_name argument was missing.

Then when I want to train a model over two different datasets they are not properly concatenated. Here a simple example to reproduce:

tner-train-search -m "xlm-roberta-base" -c "output/" -d "tner/wikiann" "tner/tweetner7" --dataset-name "ace" "tweetner7" -e 15 --epoch-partial 5 --n-max-config 3 -b 32 -g 2 4 --lr 1e-6 1e-5 --crf 0 1 --max-grad-norm 0 10 --weight-decay 0 1e-7

According to the logs we get:

encode all the data: 7111

7111 is the size of the tner/tweetner7 dataset for the split train_all. The real size should be 100 + 7111 the former being the size of the train split of the ace subdataset of tner/wikiann .

I don't know if this is an easy fix or not. I will be happy to help if needed.

jplu avatar Dec 28 '22 15:12 jplu