setfit ValueError: A column mapping must be provided when the dataset does not contain the following columns: {'text', 'label'}

ValueError: A column mapping must be provided when the dataset does not contain the following columns: {'text', 'label'}

Open Cheril184 opened this issue 1 year ago • 12 comments

I am getting this value error in trainer.evaluate() despite my dataset containing columns named 'text' and 'label'

my dataset dict looks like this: Dataset({ features: ['text', 'reason', 'label'], num_rows: 2062 })

Dec 10 '22 13:12 Cheril184

@LuketheDukeBates can you please help?

Dec 10 '22 17:12 Cheril184

Hello @Cheril184,

Map the features like this. Use the column_mapping to map with the column names. Also share the error as well for better understanding.

trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    num_iterations=20,
    column_mapping={"text": "text", "label": "label"},
)

Dec 11 '22 13:12 theainerd

If the datasets contain the text and label columns, then a column mapping should not be needed.

I'm unable to reproduce your issue. I'm using the following dummy script in my attempt:

from setfit import SetFitModel, SetFitTrainer
from datasets import Dataset

ds = Dataset.from_dict({"text": ["a", "b", "c"], "reason": ["d", "e", "f"], "label": [0, 1, 2]})
print(ds)

model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

trainer = SetFitTrainer(
    model=model,
    train_dataset=ds,
    eval_dataset=ds,
    num_iterations=20,
)

trainer.train()
print(trainer.evaluate())

This outputs

Dataset({
    features: ['text', 'reason', 'label'],
    num_rows: 3
})
model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
***** Running training *****
  Num examples = 120
  Num epochs = 1
  Total optimization steps = 8
  Total train batch size = 16
Epoch:   0%|                                                                                                                                                                    | 0/1 [00:00<?, ?it/s]
Iteration:   0%|                                                                                                                                                                | 0/8 [00:00<?, ?it/s] 
Iteration:  12%|███████████████████                                                                                                                                     | 1/8 [00:02<00:15,  2.27s/it]
Iteration:  25%|██████████████████████████████████████                                                                                                                  | 2/8 [00:02<00:07,  1.28s/it]
Iteration:  38%|█████████████████████████████████████████████████████████                                                                                               | 3/8 [00:02<00:04,  1.07it/s]
Iteration:  50%|████████████████████████████████████████████████████████████████████████████                                                                            | 4/8 [00:03<00:03,  1.31it/s]
Iteration:  62%|███████████████████████████████████████████████████████████████████████████████████████████████                                                         | 5/8 [00:03<00:01,  1.52it/s]
Iteration:  75%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                      | 6/8 [00:03<00:01,  1.70it/s]
Iteration:  88%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                   | 7/8 [00:04<00:00,  1.86it/s]
Iteration: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:04<00:00,  1.84it/s]
Epoch: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.35s/it] 
***** Running evaluation *****
{'accuracy': 1.0}

Which is to be expected.

Could you provide a way for us to reproduce your issue, i.e. via your own script or by modifying mine?

Tom Aarsen

Dec 12 '22 20:12 tomaarsen

@tomaarsen I have a csv file which I am loading using the hf load_dataset function, I can share my csv files if you want

Dec 15 '22 07:12 Cheril184

That might help. Furthermore, I am interested in how you initialize the SetFitTrainer. Perhaps the issues lies there somewhere?

Dec 15 '22 08:12 tomaarsen

FWIW, one reason this might happen is that the default CSV loader script in datasets (in version 2.8.0, at the time of writing) automatically assigns all the points to the training split if you don't specify one (unlike e.g. Tensorflow Datasets, where the default is no train-test split in the dataset):

import datasets
with open('example.csv', 'w') as example:
    example.writelines(["label,text\n","1,good\n","0,terrible\n"])
features = datasets.Features(
    { 
    'label': datasets.ClassLabel(num_classes=2, names=['positive','negative']),
    'text': datasets.Value('string')
    })
dataset = datasets.load_dataset("csv", data_files="./example.csv", features=features)
print(dataset)
# Prints
# DatasetDict({
#     train: Dataset({
#         features: ['label', 'text'],
#         num_rows: 2
#     })
# })

If this is the case, just pass the training split dataset['train'] to the trainer!

EDIT: The original example here did not pass features to load_dataset, but the behavior of automatically assigning the rows to the training split remains the same.

Jan 26 '23 13:01 jaalu

@tomaarsen I'm also getting this issue - I attempted to fix it by doing 'column_mapping={"text": "text", "label": "label"}'. I'm still getting the error: ValueError: The following columns are missing from the dataset: {'text'}. Please provide a mapping for all required columns.

I am using the Dataset.from_pandas method.

Jan 30 '23 04:01 lbluett

@lbluett Does this dummy script work?

from setfit import SetFitModel, SetFitTrainer
import datasets
import pandas as pd

df = pd.DataFrame([
    {'text': 'a', 'label': 0},
    {'text': 'b', 'label': 1},
    {'text': 'c', 'label': 2},
    ])
features = datasets.Features(
    { 
    'label': datasets.ClassLabel(num_classes=3, names=['a','b','c']),
    'text': datasets.Value('string')
    })
ds = datasets.Dataset.from_pandas(df, features=features)
print(ds)

model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

trainer = SetFitTrainer(
    model=model,
    train_dataset=ds,
    eval_dataset=ds,
    num_iterations=20,
)

trainer.train()
print(trainer.evaluate())

Also worth noting that from_pandas does not exhibit the same behavior as the CSV loading script, and will not set a training split (as expected):

import pandas as pd
import datasets

features = datasets.Features(
    { 
    'label': datasets.ClassLabel(num_classes=2, names=['positive','negative']),
    'text': datasets.Value('string')
    })
df = pd.DataFrame([{'label': 1, 'text': 'good'}, {'label': 0, 'text': 'terrible'}])
print(datasets.Dataset.from_pandas(df, features=features))
# Prints
# Dataset({
#     features: ['label', 'text'],
#     num_rows: 2
# })
# note: no training split

Jan 30 '23 09:01 jaalu

I'm getting the same error. Would love to know how this can be solved.

EDIT It worked eventually. A bit of a long shot, but here's what I did:

I had train.csv and test.csv, each having 2 columns: "text" and "label"
train_df = pd.read_csv("train.csv") and test_df = pd.read_csv("test.csv")
train_text = train_df["text"] and train_label = train_df["label"]
train_data = Dataset.from_dict({ "text" : train_text, "label" : "train_label" })
Repeat steps 3 and 4 for test_data
Finally, pass the train_data and test_data in SetFitTrainer

Jan 30 '23 15:01 nazianafis

@lbluett Does this dummy script work?

from setfit import SetFitModel, SetFitTrainer
import datasets
import pandas as pd

df = pd.DataFrame([
    {'text': 'a', 'label': 0},
    {'text': 'b', 'label': 1},
    {'text': 'c', 'label': 2},
    ])
features = datasets.Features(
    { 
    'label': datasets.ClassLabel(num_classes=3, names=['a','b','c']),
    'text': datasets.Value('string')
    })
ds = datasets.Dataset.from_pandas(df, features=features)
print(ds)

model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

trainer = SetFitTrainer(
    model=model,
    train_dataset=ds,
    eval_dataset=ds,
    num_iterations=20,
)

trainer.train()
print(trainer.evaluate())

Also worth noting that from_pandas does not exhibit the same behavior as the CSV loading script, and will not set a training split (as expected):

import pandas as pd
import datasets

features = datasets.Features(
    { 
    'label': datasets.ClassLabel(num_classes=2, names=['positive','negative']),
    'text': datasets.Value('string')
    })
df = pd.DataFrame([{'label': 1, 'text': 'good'}, {'label': 0, 'text': 'terrible'}])
print(datasets.Dataset.from_pandas(df, features=features))
# Prints
# Dataset({
#     features: ['label', 'text'],
#     num_rows: 2
# })
# note: no training split

I managed to work-around the issue by keeping my original column names and then using the column_mapping function, for example: column_mapping={"Joined_description": "text", "Target Category": "label"}

Jan 31 '23 02:01 lbluett

I think this might be more of a problem with Datasets' interface than with the library itself. The case where Datasets automatically assigns a training split for CSV files should be possible to detect and give an appropriate warning/error for, but without code to reproduce the issues, it's hard to tell which other gotchas should be warned about.

@lbluett That's good to hear, still odd that the column mapping is necessary - does the script in the post you cited throw any errors when you run it?

@nazianafis Does it work if you first explicitly specify the features and then pick the training split? E.g.

features = datasets.Features(
    { 
    'label': datasets.ClassLabel(num_classes=2, names=['positive','negative']),
    'text': datasets.Value('string')
    })
train_data = datasets.load_dataset('csv', data_files="./train.csv", features=features)['train']
# Not a typo: unless we specify a split, everything is assigned to 'train'
test_data = datasets.load_dataset('csv', data_files="./test.csv", features=features)['train']

If not, does it work if you substitute steps 3 through 5 with

train_data = datasets.Dataset.from_pandas(train_df[['text', 'label']]))
test_data = datasets.Dataset.from_pandas(test_df[['text', 'label']]))

Feb 01 '23 11:02 jaalu

Does it work if you first explicitly specify the features and then pick the training split?

@jaalu It worked. Thanks a lot!

Feb 05 '23 17:02 nazianafis

setfit setfit copied to clipboard

ValueError: A column mapping must be provided when the dataset does not contain the following columns: {'text', 'label'}

setfit
setfit copied to clipboard