transformers icon indicating copy to clipboard operation
transformers copied to clipboard

model fine-tuning error

Open marcomameli1992 opened this issue 1 year ago • 3 comments

System Info

Hi to all, I fine-tune a model for my dataset.

But I need some help with the inference execution. I train the model with a tokenizer without truncation, but I receive the first error in inference. So I tried to retrain the model with truncation activated, as shown in the code. Still, I encountered a new error during the training that only appeared after adding truncation into the tokenizer.

Now, if I try to train the network without truncation on the tokenizer, the training not working, and I need help understanding what happens.

Who can help?

@ArthurZucker @sgugger

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

Inference error: RuntimeError: The size of tensor a (726) must match the size of tensor b (512) at non-singleton dimension 1 The function for classification is:

def classification_infer(data, model_path):
    # device
    device = find_device()

    # data preprocessing
    data[['notuseful', 'usefull']] = data['Descrizione'].apply(text_splitting)
    data = data.loc[~data['Classe'].isin([0, 1, 2])]

    # model loading
    model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=3)
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    print("Model loaded")

    print("Classifying...")
    classifier = pipeline('text-classification', model=model, tokenizer=tokenizer, device=device)
    tokenizer_kwargs = {'padding': True, 'truncation': True, 'max_length': 512, 'return_tensors': 'pt'}
    classifier_output = classifier(data['usefull'].tolist())
    print("Classification completed")
    data['Classe'] = [int(x['label']) for x in classifier_output]

    return data

Training function, with truncation active:

def classification_train(data):
    # metric
    metric = evaluate.load('accuracy')

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)

    # Load tokenizer
    tokenizer_kwargs = {'padding': True, 'truncation': True, 'max_length': 512, 'return_tensors': 'pt'}
    tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-cased", tokenizer_kwargs=tokenizer_kwargs)

    # preprocessing function
    def preprocess_function(data):
        return tokenizer(data['text'])

    # device
    device = find_device()

    # data preprocessing
    data[['notuseful', 'usefull']] = data['Descrizione'].apply(text_splitting)

    # dataset creation
    dataset = pd.DataFrame()
    dataset[['text', 'label']] = data.loc[data['Classe'].isin([0, 1, 2]), ['usefull', 'Classe']]
    dataset['label'] = dataset['label'].astype(int)

    # dataset equilibrium
    dataset = dataset.groupby('label').head(100)
    dataset['text'] = dataset['text'].map(lambda x: x.lower())

    # dataset split
    train, test = train_test_split(dataset, test_size=0.2, random_state=42)

    # huggingface dataset
    train = Dataset.from_pandas(train)
    train = train.map(preprocess_function, batched=True)
    test = Dataset.from_pandas(test)
    test = test.map(preprocess_function, batched=True)

    # Load model
    model = AutoModelForSequenceClassification.from_pretrained("dbmdz/bert-base-italian-cased", num_labels=3)
    model.to(device)

    # data collator
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    # training arguments
    training_args = TrainingArguments(
        output_dir='./model_weight',  # output directory
        num_train_epochs=2,  # total number of training epochs
        per_device_train_batch_size=16,  # batch size per device during training
        per_device_eval_batch_size=64,  # batch size for evaluation
        weight_decay=0.01,  # strength of weight decay
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model='eval_accuracy',
        greater_is_better=True,
        report_to="wandb",  # enable logging to W&B
        run_name="bert-base-italian-cased-fit-for-crm",  # name of the W&B run (optional)
        label_names=['0', '1', '2']
    )

    # trainer
    trainer = Trainer(
        model=model,  # the instantiated 🤗 Transformers model to be trained
        args=training_args,  # training arguments, defined above
        train_dataset=train,  # training dataset
        eval_dataset=test,  # evaluation dataset
        data_collator=data_collator,  # data collator
        tokenizer=tokenizer,  # tokenizer
        compute_metrics=compute_metrics,  # the callback that computes metrics of interest
    )

    # train
    trainer.train()

Error during the evaluation step with the training function with truncation active: IndexError: too many indices for array: array is 0-dimensional, but 1 were indexed

training function withouth truncation:

def classification_train(data):
    # metric
    metric = evaluate.load('accuracy')

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-cased")

    # preprocessing function
    def preprocess_function(data):
        return tokenizer(data['text'], truncation=False)

    # device
    device = find_device()

    # data preprocessing
    data[['notuseful', 'usefull']] = data['Descrizione'].apply(text_splitting)

    # dataset creation
    dataset = pd.DataFrame()
    dataset[['text', 'label']] = data.loc[data['Classe'].isin([0, 1, 2]), ['usefull', 'Classe']]
    dataset['label'] = dataset['label'].astype(int)

    # dataset equilibrium
    dataset = dataset.groupby('label').head(100)
    dataset['text'] = dataset['text'].map(lambda x: x.lower())

    # dataset split
    train, test = train_test_split(dataset, test_size=0.2, random_state=42)

    # huggingface dataset
    train = Dataset.from_pandas(train)
    train = train.map(preprocess_function, batched=True)
    test = Dataset.from_pandas(test)
    test = test.map(preprocess_function, batched=True)

    # Load model
    model = AutoModelForSequenceClassification.from_pretrained("dbmdz/bert-base-italian-cased", num_labels=3)
    model.to(device)

    # data collator
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    # training arguments
    training_args = TrainingArguments(
        output_dir='./model_weight',  # output directory
        num_train_epochs=2,  # total number of training epochs
        per_device_train_batch_size=16,  # batch size per device during training
        per_device_eval_batch_size=64,  # batch size for evaluation
        weight_decay=0.01,  # strength of weight decay
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model='eval_accuracy',
        greater_is_better=True,
        report_to="wandb",  # enable logging to W&B
        run_name="bert-base-italian-cased-fit-for-crm",  # name of the W&B run (optional)
        label_names=['0', '1', '2']
    )

    # trainer
    trainer = Trainer(
        model=model,  # the instantiated 🤗 Transformers model to be trained
        args=training_args,  # training arguments, defined above
        train_dataset=train,  # training dataset
        eval_dataset=test,  # evaluation dataset
        data_collator=data_collator,  # data collator
        tokenizer=tokenizer,  # tokenizer
        compute_metrics=compute_metrics,  # the callback that computes metrics of interest
    )

    # train
    trainer.train()

Expected behavior

I expect that with the training function, I can train the network with the tokenizer arguments, and I expect the inference to work when I try to classify a text

marcomameli1992 avatar Feb 17 '23 18:02 marcomameli1992

I have found the solution. The problem is in the Bach dimension for the evaluation that is greater than the number of rows inside the evaluation dataset. Now I have set it to 16 and it works.

But now I have a problem with the inference because I have trained the model with tokenizer max_length settings seated to 1024 and model with the same settings but when I use the model weight for the inference I continue to receive the error about the size of the tensor:

RuntimeError: The size of tensor a (726) must match the size of tensor b (512) at non-singleton dimension 1

marcomameli1992 avatar Feb 18 '23 14:02 marcomameli1992

Hey! Could you give more details on the exact trace that you are getting? I have no idea where it comes from so can't really help, it could be a problem with loading the checkpoints or anything. Also can you share a simple inference reproducing script? Thanks!

ArthurZucker avatar Feb 20 '23 07:02 ArthurZucker

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Mar 20 '23 15:03 github-actions[bot]