t-zero
t-zero copied to clipboard
how to use this t-zero to train my own dataset? did not find the api . if add the dataset to the cache_data_dir.I do not want reproduce. I want to use this tool train my own datasets
Hi @flyingwaters , did you have a look at https://github.com/bigscience-workshop/t-zero/tree/master/examples? I am not sure what it is specifically that you are trying to do, but it might be what you are looking for. The example lets you fine-tune a model on a given task/dataset.
hi @VictorSanh , i find Some problems when I reproduce your result. with t5==0.9.3,
I use gpus to train the model and the environ is offline ,so I get sentencepiece.model downloaded
and use this command
--gin_param="tsv_dataset_fn.vocabulary = SentencePieceVocabulary()"
--gin_param="get_sentencepiece_model_path = '/raid/yiptmp/huggingface-models/t5.1.1.lm100k.xxl'"
##################
but it has some problems ,as follow:
SyntaxError: malformed node or string: <_ast.Name object at 0x7f42f90404d0>
Failed to parse token 'SentencePieceVocabulary'
######################
I think T0 is nice, Can you fix this bug? and I think the t5 you used, maybe has be updated, can you provided the requirement with version number. I think I will help many researchers to reproduce it and develop this tech further~
Thanks your work !!!!
@flyingwaters it seems related to this issue, this is on t5 codebase side: https://github.com/google-research/text-to-text-transfer-transformer/issues/513
maybe @lintangsutawika you've got a suggestion on how to proceed?
Something hacky that worked for me is to modify the t5/data/utils.py
file in the text-to-text-transfer-transformer
codebase. the diff:
-DEFAULT_SPM_PATH = "gs://t5-data/vocabs/cc_all.32000/sentencepiece.model" # GCS
+DEFAULT_SPM_PATH = "LOCAL_PATH_TO_SENTENCEPIECE_MODEL" # GCS
This is a long standing issue which they haven't fixed or given any timeline on when it will be fixed.
I recommend switching to T5X to retrain on your own dataset or use HF's trainer library for your usecase.