fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

How to train a single model over multiple datasets

Open fenimi opened this issue 2 years ago • 10 comments

❓ Questions and Help

How can I build a model using multiple datasets sampled during training? By this I mean the model randomly samples data from DatasetA and DatasetB during training. The model should also sample one dataset more than the other. Can you point me to how to get that done in fairseq?

  • fairseq Version: 1.0.0a0+b5a039c
  • PyTorch Version: 1.10.0
  • OS (e.g., Linux): linux
  • How you installed fairseq (pip, source): git source
  • Build command you used (if compiling from source): git clone https://github.com/pytorch/fairseq cd fairseq pip install --editable ./
  • Python version: 3.8.10
  • CUDA/cuDNN version: cudacore/.11.0.2
  • GPU models and configuration: NVidia V100SXM2
  • Any other relevant information:

fenimi avatar May 26 '22 00:05 fenimi

I believe current normal fairseq does not provide such feature. Not in command line for sure.

-- As for the implementation, If your data is not that huge, (size is decided by how much gpu memory data consumes in total.) One way is to write your custom dataset class. Read fairseq/data/language_pair_dataset.py and copy-paste-edit-finish a new .py yourself. As a result, you have to write a custom task class to utilize your dataset class. To import your custom code into fairseq during runtime, use --user-dir (also you need to search how to use it.)

Append every dataset into one file. Take notes which line starts a new dataset and you will obtain regions for each dataset ( like line 1-10000 is datasetA, 10001-15000 is B, ....) In fairseq-preprocess, data will all get preprocessed, so it is fine. Then in training, you have to write your own def collate (the method how dataset create a batch for training) in order to make sure the batch contain the correct ratio of data among different region. You also want to add two new command line argument for obtaining those seperating line indexes, ratio among regions in your custom task. (or you do not add new arguments, but hand written them into .py)

-- If you data is huge, you mix up the data beforehead and split them into multiple folders. That is the only option. So you sample epochs from datasetA and datasetB and write them down into data_1_folder, data_2_folder, data_3_folder... by yourself, outside of fairseq. ( you may also use c++ or other language to speedup) Each data in folders are mixture of A and B with correct ratio. Then, you fairseq-preprocess them all.

In training, you can provide multiple data folder for fairseq to train with them in a robin-round fashion. Use the command like this: fairseq-train /path/data_1_folder:/path/data_2_folder:/path/data_3_folder --train-subset train --valid-subset valid .... When you do this, the model will get first epoch form data_1_folder, second epoch from data_2,... When folders run out, the next epoch starts from data_1_folder and all over again. All folders must have their own train-subset, but only the first folder must have the valid-subset.(the rest does not need valid-subset) This folder order is not shuffled by current fairseq, nor switched midway.

(valid is also a mixture of A and B)

-- The first one requires a lot of fairseq extension, but also most can be done inside fairseq command line if you implemented it. The second one requires no fairseq extension but a lot of codes excution outside fairseq are required in advance.

You may also think of the third way - write a dataset which can utilize multiple data folder at first. This is also a very good way but requires more understanding into fairseq/fairseq_cli/train.py and how it deals with arguments, create dataset

gmryu avatar May 26 '22 01:05 gmryu

I believe current normal fairseq does not provide such feature. Not in command line for sure.

-- As for the implementation, If your data is not that huge, (size is decided by how much gpu memory data consumes in total.) One way is to write your custom dataset class. Read fairseq/data/language_pair_dataset.py and copy-paste-edit-finish a new .py yourself. As a result, you have to write a custom task class to utilize your dataset class. To import your custom code into fairseq during runtime, use --user-dir (also you need to search how to use it.)

Append every dataset into one file. Take notes which line starts a new dataset and you will obtain regions for each dataset ( like line 1-10000 is datasetA, 10001-15000 is B, ....) In fairseq-preprocess, data will all get preprocessed, so it is fine. Then in training, you have to write your own def collate (the method how dataset create a batch for training) in order to make sure the batch contain the correct ratio of data among different region. You also want to add two new command line argument for obtaining those seperating line indexes, ratio among regions in your custom task. (or you do not add new arguments, but hand written them into .py)

-- If you data is huge, you mix up the data beforehead and split them into multiple folders. That is the only option. So you sample epochs from datasetA and datasetB and write them down into data_1_folder, data_2_folder, data_3_folder... by yourself, outside of fairseq. ( you may also use c++ or other language to speedup) Each data in folders are mixture of A and B with correct ratio. Then, you fairseq-preprocess them all.

In training, you can provide multiple data folder for fairseq to train with them in a robin-round fashion. Use the command like this: fairseq-train /path/data_1_folder:/path/data_2_folder:/path/data_3_folder --train-subset train --valid-subset valid .... When you do this, the model will get first epoch form data_1_folder, second epoch from data_2,... When folders run out, the next epoch starts from data_1_folder and all over again. All folders must have their own train-subset, but only the first folder must have the valid-subset.(the rest does not need valid-subset) This folder order is not shuffled by current fairseq, nor switched midway.

(valid is also a mixture of A and B)

-- The first one requires a lot of fairseq extension, but also most can be done inside fairseq command line if you implemented it. The second one requires no fairseq extension but a lot of codes excution outside fairseq are required in advance.

You may also think of the third way - write a dataset which can utilize multiple data folder at first. This is also a very good way but requires more understanding into fairseq/fairseq_cli/train.py and how it deals with arguments, create dataset

Thank you. I will try it and let you know how it goes

fenimi avatar May 26 '22 22:05 fenimi

I believe current normal fairseq does not provide such feature. Not in command line for sure. -- As for the implementation, If your data is not that huge, (size is decided by how much gpu memory data consumes in total.) One way is to write your custom dataset class. Read fairseq/data/language_pair_dataset.py and copy-paste-edit-finish a new .py yourself. As a result, you have to write a custom task class to utilize your dataset class. To import your custom code into fairseq during runtime, use --user-dir (also you need to search how to use it.) Append every dataset into one file. Take notes which line starts a new dataset and you will obtain regions for each dataset ( like line 1-10000 is datasetA, 10001-15000 is B, ....) In fairseq-preprocess, data will all get preprocessed, so it is fine. Then in training, you have to write your own def collate (the method how dataset create a batch for training) in order to make sure the batch contain the correct ratio of data among different region. You also want to add two new command line argument for obtaining those seperating line indexes, ratio among regions in your custom task. (or you do not add new arguments, but hand written them into .py) -- If you data is huge, you mix up the data beforehead and split them into multiple folders. That is the only option. So you sample epochs from datasetA and datasetB and write them down into data_1_folder, data_2_folder, data_3_folder... by yourself, outside of fairseq. ( you may also use c++ or other language to speedup) Each data in folders are mixture of A and B with correct ratio. Then, you fairseq-preprocess them all. In training, you can provide multiple data folder for fairseq to train with them in a robin-round fashion. Use the command like this: fairseq-train /path/data_1_folder:/path/data_2_folder:/path/data_3_folder --train-subset train --valid-subset valid .... When you do this, the model will get first epoch form data_1_folder, second epoch from data_2,... When folders run out, the next epoch starts from data_1_folder and all over again. All folders must have their own train-subset, but only the first folder must have the valid-subset.(the rest does not need valid-subset) This folder order is not shuffled by current fairseq, nor switched midway. (valid is also a mixture of A and B) -- The first one requires a lot of fairseq extension, but also most can be done inside fairseq command line if you implemented it. The second one requires no fairseq extension but a lot of codes excution outside fairseq are required in advance. You may also think of the third way - write a dataset which can utilize multiple data folder at first. This is also a very good way but requires more understanding into fairseq/fairseq_cli/train.py and how it deals with arguments, create dataset

Thank you. I will try it and let you know how it goes

Hi, how is your experiment going? 🙂 @fenimi

yc1999 avatar Jun 23 '22 04:06 yc1999

@gmryu Hi! May I ask what would you suggest me to do if I want to add new inputs (scalars) to the decoder at every time step and use an embedding for them that's different from the one for the text tokens of target sentences? My current solution is incorporating these scalars to target sentences (src sentences may be better) in the dataset so those inputs can be sent to the model via prev_output_tokens. However, I have to change fairseq code all over the place. For example, binarization has to be changed so that for scalar part of the sentences we don't want to use dictionary to encode or want to use a different dictionary.

martianmartina avatar Nov 12 '22 22:11 martianmartina

@martianmartina I guess your solution is not bad. (Though I do not understand what you mean incorporating to target sentences.) At first glance, I would have a new argument for decoder's forward, says scalar_input_ids=None(or a default you want), and make LanguagePairDataset collate "scalar_input_ids" inside "net_input", align with "src_tokens","prev_output_tokens". The codes I need to change are in LanguagePairDataset, Transformer main and decoder implementation. About that binarization part, if this new input does something with the raw sentences, I guess it cannot be helped. If it has nothing or few, you can build them inside LanguagePairDataset's init. For a test run it is sufficient. If you want millions of new input, then you may write another way of loading, too. (But extending binarization is cool)

gmryu avatar Nov 13 '22 15:11 gmryu

@martianmartina I guess your solution is not bad. (Though I do not understand what you mean incorporating to target sentences.) At first glance, I would have a new argument for decoder's forward, says scalar_input_ids=None(or a default you want), and make LanguagePairDataset collate "scalar_input_ids" inside "net_input", align with "src_tokens","prev_output_tokens". The codes I need to change are in LanguagePairDataset, Transformer main and decoder implementation. About that binarization part, if this new input does something with the raw sentences, I guess it cannot be helped. If it has nothing or few, you can build them inside LanguagePairDataset's init. For a test run it is sufficient. If you want millions of new input, then you may write another way of loading, too. (But extending binarization is cool)

Thank you for your kind reply! My new inputs are features associated with every target token at every time step. This same amount of tokens makes it easier to extract those inputs from prev_output_tokens rather than src_tokens. So while creating my dataset, I interleaved the extra inputs with the target tokens. However, when calculating the loss, the target shouldn't include those inputs anymore. So now the workaround in my mind is hacking inside LanguagePairDataset collate to delete all the extra inputs in target before returning the sample. I don't know whether there is something I haven't considered because that keeps happening and I've changed my implementation strategy so many times :(

This is all hacky and messy. A more elegant way might be to use an entire dataset (the same size as target) for extra inputs instead of adding them to the target sentences and extracting or deleting them when needed but I think that would bring a lot of code.

martianmartina avatar Nov 13 '22 16:11 martianmartina

@martianmartina Do not know if you still need my help. Sorry I am pretty poor at understanding your implementation. prev_output_tokens is the same as target, while prev_output_tokens are passed to decoder in training (along side all of "net_input" but target is not. The part I do not understand is it is only target got passed to loss function. If you name your new input something else, would not it be all done? Or maybe you mean you want to have this new input only applied to decoder out of training? In that case, you only need to have training subset does not have them, right?

gmryu avatar Nov 14 '22 03:11 gmryu

@gmryu I still could need your help and sorry for the confusion I brought about the implementation. It indeed seems weird. I added all my new inputs to the raw target text file. I guess a more normal way would be to send a new Fairseqdataset containing the new inputs inside LanguagePairDataset init but I don't know how to create that IndexedCachedDataset (indexed_dataset.py), which is same the kind of dataset that target uses, from the raw text file of the new inputs where the new inputs don't need to be binarized.

martianmartina avatar Nov 14 '22 04:11 martianmartina

@martianmartina Okay, I had the same problem facing IndexedCachedDataset and I choose to ignore it and use list instead. It is very brave and cool of you to use those dataset. However I found making a list inside LangaugePairDataset __init__ is just more direct and simple (though a little bit dirty in the view of coding). Once you set up the list[each is your new input], you can add them by editing LanguagePairDataset __getitem__, like

# inside def __getitem__.  new_ids is the list holding your new scalar inputs.
if( self.new_ids is not None and len(self.new_ids)>index ):#you may not need this condition. I had this incase of loading original dataset.
                example["new_ids"]=self.new_ids[index]
return example # this is the last line of __getitem__ 

Then, you have access to this in def collate, like

if samples[0].get("new_ids", None) is not None:# incase of loading original dataset
            new_ids = merge(
                "new_ids",
                left_pad=left_pad_source,
                pad_to_length=pad_to_length["source"] if pad_to_length is not None else None, # this depends on how long that new_ids should be
            )
            new_ids=new_ids.index_select(0, sort_order)
            batch["net_input"]["new_ids"]=new_ids
            ### this is what go to criterion then model(**net_input)

Well, I have not done anything further so this is all I have for now.

gmryu avatar Nov 16 '22 10:11 gmryu

@gmryu Hi thank you so much for your reply. I think it is a great workaround!

martianmartina avatar Nov 19 '22 22:11 martianmartina