t-zero
                                
                                 t-zero copied to clipboard
                                
                                    t-zero copied to clipboard
                            
                            
                            
                        T0 (p=1) replicability
Hi @VictorSanh
Thanks for releasing the code and data. I am trying to retrain it in pytorch Some questions , in your paper you have p=1 vs p=5.7 results
Say for p=1 we take one random prompt per example of a dataset. This is fine perfectly
I have some doubts about the
1) Sampling strategy: proportional to the number of examples in each dataset (we treated any dataset with over 500'000 examples as having 500'000/num_templates examples) -  
Does this mean for big datasets like gigaword you include  422661 examples instead of  3803957
2) On huggingface T0 it says Fine-tuning steps: 12'200  but in your script says 
export TRAIN_STEPS=1112200. Any idea how many epochs you trained ?
3) Can you tell the total number of samples included for p=1  given tasks ['commonsense_qa', 'dream', 'quail', 'quartz', 'social_i_qa', 'wiqa', 'cosmos_qa', 'qasc', 'quarel', 'sciq', 'wiki_hop', 'adversarial_qa_dbert', 'adversarial_qa_dbidaf', 'adversarial_qa_droberta', 'quoref', 'duorc_ParaphraseRC', 'duorc_SelfRC', 'ropes', 'wiki_qa', 'common_gen', 'wiki_bio', 'app_reviews', 'amazon_polarity', 'imdb', 'rotten_tomatoes', 'gigaword', 'cnn_dailymail', 'multi_news', 'samsum', 'xsum', 'ag_news', 'dbpedia_14', 'trec', 'paws_labeled_final', 'glue_mrpc', 'glue_qqp', 'yelp_review_full', 'kilt_tasks_hotpotqa']
I have Num examples = 3068602 , which was done by taking p=1 from individual datasets , for datasets bigger than 500k dividing num of samples by num_of_prompts. If you have the file for T0 ( p=1 ) or (p=5.7) do you mind sharing them 
4) Example grouping: We use packing to combine multiple training examples into a single sequence to reach the maximum sequence length . Not sure whats this ? Is it necessary and how can we do it ?
thanks for your patience @tuhinjubcse
- Sampling strategy: proportional to the number of examples in each dataset (we treated any dataset with over 500'000 examples as having 500'000/num_templates examples) -
Does this mean for big datasets like gigaword you include 422661 examples instead of 3803957
I'll let @awebson confirm!
- On huggingface T0 it says Fine-tuning steps: 12'200 but in your script says export TRAIN_STEPS=1112200. Any idea how many epochs you trained ?
Yeah trained for 12'200 steps (don't think we ever reached even one epoch). 1'112'200 is coming from 1'000'000 t5 pertaining + 100'000 lm steps to obtain t5-lm + 12'200 steps of multitask fine-tuning
- Can you tell the total number of samples included for p=1 given tasks ['commonsense_qa', 'dream', 'quail', 'quartz', 'social_i_qa', 'wiqa', 'cosmos_qa', 'qasc', 'quarel', 'sciq', 'wiki_hop', 'adversarial_qa_dbert', 'adversarial_qa_dbidaf', 'adversarial_qa_droberta', 'quoref', 'duorc_ParaphraseRC', 'duorc_SelfRC', 'ropes', 'wiki_qa', 'common_gen', 'wiki_bio', 'app_reviews', 'amazon_polarity', 'imdb', 'rotten_tomatoes', 'gigaword', 'cnn_dailymail', 'multi_news', 'samsum', 'xsum', 'ag_news', 'dbpedia_14', 'trec', 'paws_labeled_final', 'glue_mrpc', 'glue_qqp', 'yelp_review_full', 'kilt_tasks_hotpotqa'] I have Num examples = 3068602 , which was done by taking p=1 from individual datasets , for datasets bigger than 500k dividing num of samples by num_of_prompts. If you have the file for T0 ( p=1 ) or (p=5.7) do you mind sharing them
The mixtures t0_train_one_og_prompt and t0_train_all_og_prompts are what you need  (see https://github.com/bigscience-workshop/t-zero/blob/master/training/README.md#data-preparation)
- Example grouping: We use packing to combine multiple training examples into a single sequence to reach the maximum sequence length . Not sure whats this ? Is it necessary and how can we do it ?
Since in tf the shapes are fixed (and not dynamic), we need to make sure to reduce padding as much as possible to make the best use of the compute. Packing means concatenating multiple inputs on the encoder side, and predicting the concatenation of the targets. Code: https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/dataset.py#L64