t-zero icon indicating copy to clipboard operation
t-zero copied to clipboard

how to use this t-zero to train my own dataset? did not find the api . if add the dataset to the cache_data_dir.I do not want reproduce. I want to use this tool train my own datasets

Open flyingwaters opened this issue 2 years ago • 4 comments

flyingwaters avatar Mar 03 '22 06:03 flyingwaters

Hi @flyingwaters , did you have a look at https://github.com/bigscience-workshop/t-zero/tree/master/examples? I am not sure what it is specifically that you are trying to do, but it might be what you are looking for. The example lets you fine-tune a model on a given task/dataset.

VictorSanh avatar Mar 04 '22 20:03 VictorSanh

hi @VictorSanh , i find Some problems when I reproduce your result. with t5==0.9.3, I use gpus to train the model and the environ is offline ,so I get sentencepiece.model downloaded and use this command
--gin_param="tsv_dataset_fn.vocabulary = SentencePieceVocabulary()"
--gin_param="get_sentencepiece_model_path = '/raid/yiptmp/huggingface-models/t5.1.1.lm100k.xxl'"
################## but it has some problems ,as follow: SyntaxError: malformed node or string: <_ast.Name object at 0x7f42f90404d0> Failed to parse token 'SentencePieceVocabulary' ###################### I think T0 is nice, Can you fix this bug? and I think the t5 you used, maybe has be updated, can you provided the requirement with version number. I think I will help many researchers to reproduce it and develop this tech further~ Thanks your work !!!!

flyingwaters avatar Mar 08 '22 15:03 flyingwaters

@flyingwaters it seems related to this issue, this is on t5 codebase side: https://github.com/google-research/text-to-text-transfer-transformer/issues/513

maybe @lintangsutawika you've got a suggestion on how to proceed?

Something hacky that worked for me is to modify the t5/data/utils.py file in the text-to-text-transfer-transformer codebase. the diff:

-DEFAULT_SPM_PATH = "gs://t5-data/vocabs/cc_all.32000/sentencepiece.model"  # GCS
+DEFAULT_SPM_PATH = "LOCAL_PATH_TO_SENTENCEPIECE_MODEL"  # GCS

VictorSanh avatar Mar 11 '22 16:03 VictorSanh

This is a long standing issue which they haven't fixed or given any timeline on when it will be fixed.

I recommend switching to T5X to retrain on your own dataset or use HF's trainer library for your usecase.

lintangsutawika avatar Mar 14 '22 02:03 lintangsutawika