transformers icon indicating copy to clipboard operation
transformers copied to clipboard

run_qa.py on custom datasets raise TypeError: __init__() got an unexpected keyword argument 'field'

Open TongJiL opened this issue 1 year ago • 5 comments

System Info

Hello,

I'm trying to train the qa model on SageMaker following the instracution, but I got TypeError: __init__() got an unexpected keyword argument 'field' issue when try to use my own datasets.

I used SageMaker instance so it already install every dependency in requirements.txt.

I checked the datasets code and seems like it does not support "field" anymore?

Please fix this issue or let me know if there's something I did wrong.

Who can help?

No response

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

Run load_datasets has the same error

Expected behavior

run run_qa.py in sagemaker successfully

TongJiL avatar Mar 15 '23 19:03 TongJiL

It may be installing old versions of the library so you have to pick up the corresponding version of the example (cc @philschmid for the exact versions)

sgugger avatar Mar 15 '23 20:03 sgugger

It may be installing old versions of the library so you have to pick up the corresponding version of the example (cc @philschmid for the exact versions)

Thank you for your reply! That's also my assumption, I basically just used the train code from: https://huggingface.co/deepset/roberta-base-squad2 under train/SageMaker. Could be that the datasets version is too new in my instance, but in this case, which datasets version would you recommend? Thanks!

TongJiL avatar Mar 15 '23 20:03 TongJiL

@TongJiL could you share the exact code snippet?

philschmid avatar Mar 15 '23 20:03 philschmid

@philschmid

import sagemaker
from sagemaker.huggingface import HuggingFace

role = sagemaker.get_execution_role()
hyperparameters = {
	'model_name_or_path':'deepset/roberta-base-squad2',
	'output_dir':'/opt/ml/model'
        'train_file';'/opt/ml/input/data/train/qa_train_data.csv'
}

git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.17.0'}

huggingface_estimator = HuggingFace(
	entry_point='run_qa.py',
	source_dir='./examples/pytorch/question-answering',
	instance_type='ml.p3.2xlarge',
	instance_count=1,
	role=role,
	git_config=git_config,
	transformers_version='4.17.0',
	pytorch_version='1.10.2',
	py_version='py38',
	hyperparameters = hyperparameters
)

data = {
    'train': "s3://my_s3_path/qa_train_data.csv"
}
huggingface_estimator.fit(data)```

TongJiL avatar Mar 15 '23 20:03 TongJiL

Turns out the "filed" works for Json but not csv.

TongJiL avatar Mar 16 '23 22:03 TongJiL

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Apr 15 '23 15:04 github-actions[bot]

Based on the following documentation: https://huggingface.co/docs/datasets/loading, the field="data" applies when using a code such as the following to load the dataset:

from datasets import load_dataset
dataset = load_dataset("json", data_files="my_file.json", field="data")

in which case the code will looking for a JSON file in the following format, where "data" is the name of the field in the JSON file where the data is stored:

{"version": "0.1.0",
 "data": [{"a": 1, "b": 2.0, "c": "foo", "d": false},
          {"a": 4, "b": -5.5, "c": null, "d": true}]
}

So this is why csv files won't work.

ngnatk avatar Oct 09 '23 12:10 ngnatk