Improve and rework GPT-tfjs

Open JulienVig opened this issue 1 year ago • 0 comments

Here is a list of potential improvements for gpt-tfjs in Disco:

[x] Create a compile method to initialize the optimizer (rather than initializing it when fitDataset is called). This ensures the optimizer state is persisted across multiple calls to fitDataset
[x] Implement save and load methods to save and re-use a trained model
[x] Rename classes for better clarity and consistency, e.g. multiple classes and functions are called GPT
[x] Assess whether we can use tf.CustomCallbackArgs rather than redefining an interface for TrainingCallbacks
[x] Assess whenever we can use TFJS' native fitDataset method rather than overriding it with a custom training loop -> tfjs only implements Adam while GPT2 uses AdamW. Additionally, the custom optimizer allows having weight decay which is used in the original GPT2.
[ ] Rework GPT-tfjs config (learning rate, number of iteration) as Disco parameters rather than being hard-coded
[x] Reading a text file with TF.js only supports reading line by line which is not ideal for LLM inputs, try implementing a file reader chunk by chunk rather than by lines
[ ] To use a trained model in Disco to generate text, we have to get the model instance through the aggregator. Implement a better interface to access the language generation API.
[ ] Make sure pad tokens are ignored in the loss computation (similarly to pytorch ignoring -100 as padding token). An example of how to do that can be found here.
[ ] There is memory leak in the model disposal, one tensor per attention layer is still not disposed after calling model.dispose. Edit: the federated/decentralized mechanism also allocates new tensors every round #683
[x] Training with gpt2 has NaN loss after the first epoch step

#656 and #657 should be addressed first

Mar 27 '24 12:03 JulienVig