disco
disco copied to clipboard
Improve and rework GPT-tfjs
Here is a list of potential improvements for gpt-tfjs in Disco:
- [x] Create a compile method to initialize the optimizer (rather than initializing it when fitDataset is called). This ensures the optimizer state is persisted across multiple calls to fitDataset
- [x] Implement save and load methods to save and re-use a trained model
- [x] Rename classes for better clarity and consistency, e.g. multiple classes and functions are called
GPT - [x] Assess whether we can use tf.CustomCallbackArgs rather than redefining an interface for TrainingCallbacks
- [x] Assess whenever we can use TFJS' native fitDataset method rather than overriding it with a custom training loop -> tfjs only implements Adam while GPT2 uses AdamW. Additionally, the custom optimizer allows having weight decay which is used in the original GPT2.
- [ ] Rework GPT-tfjs config (learning rate, number of iteration) as Disco parameters rather than being hard-coded
- [x] Reading a text file with TF.js only supports reading line by line which is not ideal for LLM inputs, try implementing a file reader chunk by chunk rather than by lines
- [ ] To use a trained model in Disco to generate text, we have to get the model instance through the aggregator. Implement a better interface to access the language generation API.
- [ ] Make sure pad tokens are ignored in the loss computation (similarly to pytorch ignoring -100 as padding token). An example of how to do that can be found here.
- [ ] There is memory leak in the model disposal, one tensor per attention layer is still not disposed after calling model.dispose. Edit: the federated/decentralized mechanism also allocates new tensors every round #683
- [x] Training with gpt2 has NaN loss after the first epoch step
#656 and #657 should be addressed first