transformers
transformers copied to clipboard
run_qa.py on custom datasets raise TypeError: __init__() got an unexpected keyword argument 'field'
System Info
Hello,
I'm trying to train the qa model on SageMaker following the instracution, but I got TypeError: __init__() got an unexpected keyword argument 'field'
issue when try to use my own datasets.
I used SageMaker instance so it already install every dependency in requirements.txt.
I checked the datasets code and seems like it does not support "field" anymore?
Please fix this issue or let me know if there's something I did wrong.
Who can help?
No response
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
Run load_datasets has the same error
Expected behavior
run run_qa.py in sagemaker successfully
It may be installing old versions of the library so you have to pick up the corresponding version of the example (cc @philschmid for the exact versions)
It may be installing old versions of the library so you have to pick up the corresponding version of the example (cc @philschmid for the exact versions)
Thank you for your reply! That's also my assumption, I basically just used the train code from: https://huggingface.co/deepset/roberta-base-squad2 under train/SageMaker. Could be that the datasets version is too new in my instance, but in this case, which datasets version would you recommend? Thanks!
@TongJiL could you share the exact code snippet?
@philschmid
import sagemaker
from sagemaker.huggingface import HuggingFace
role = sagemaker.get_execution_role()
hyperparameters = {
'model_name_or_path':'deepset/roberta-base-squad2',
'output_dir':'/opt/ml/model'
'train_file';'/opt/ml/input/data/train/qa_train_data.csv'
}
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.17.0'}
huggingface_estimator = HuggingFace(
entry_point='run_qa.py',
source_dir='./examples/pytorch/question-answering',
instance_type='ml.p3.2xlarge',
instance_count=1,
role=role,
git_config=git_config,
transformers_version='4.17.0',
pytorch_version='1.10.2',
py_version='py38',
hyperparameters = hyperparameters
)
data = {
'train': "s3://my_s3_path/qa_train_data.csv"
}
huggingface_estimator.fit(data)```
Turns out the "filed" works for Json but not csv.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Based on the following documentation: https://huggingface.co/docs/datasets/loading, the field="data"
applies when using a code such as the following to load the dataset:
from datasets import load_dataset
dataset = load_dataset("json", data_files="my_file.json", field="data")
in which case the code will looking for a JSON file in the following format, where "data" is the name of the field in the JSON file where the data is stored:
{"version": "0.1.0",
"data": [{"a": 1, "b": 2.0, "c": "foo", "d": false},
{"a": 4, "b": -5.5, "c": null, "d": true}]
}
So this is why csv files won't work.