transformers icon indicating copy to clipboard operation
transformers copied to clipboard

distillation training for arabic langauge

Open muhammed-saeed opened this issue 1 year ago • 2 comments

System Info

I encountered two issues while attempting to run the binarized_data.py and train.py scripts for the Knowledge Distillation of BERT Language Model on the Arabic Language project. Below are the details of each issue:

  1. In the binarized_data.py script, I had to modify line 83 to make it work. The original line is:

    dp_file = f"{args.dump_file}.{args.tokenizer_name}.pickle"
    

    However, I had to remove the tokenizer_name variable and change the line to:

    dp_file = f"{args.dump_file}.pickle"
    

    This change was necessary because the Arabic BERT model name, "asafaya/bert-large-arabic," contains a forward slash ("/"), which caused errors when concatenating it with the tokenizer_name variable.

  2. In the train.py script, I made a modification on line 258. The original line is:

    args.max_model_input_size = tokenizer.max_model_input_sizes[args.teacher_name]
    

    However, I had to change it to:

    args.max_model_input_size = tokenizer.max_model_input_sizes['bert-large-uncased']
    

    This modification was necessary because I am using different model configurations than those listed in the folder. It would be helpful if the script could be modified to automatically work with the intended config, allowing for more flexibility.

Apart from these script modifications, I made the necessary changes to the config files to match the different models I am using. this is understood as I am using a model with a different config than the one listed in the folder, maybe we can modify the script to download and locate the necessary config file automatically.

Please let me know if there are any further clarifications needed or if you require additional information to address these issues.

Who can help?

No response

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

here is link to the Google colab that has the problem https://colab.research.google.com/drive/1OqSvRNMl0-Z7ScCd6hLbPHMO-ZXT3WEw?usp=sharing

Expected behavior

the model has to start the training smoothly and the script has to be able to handle the model names which contains '/'

muhammed-saeed avatar May 31 '23 01:05 muhammed-saeed

Please use the forums for such questions. This is not a maintained example.

sgugger avatar May 31 '23 13:05 sgugger

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jun 30 '23 15:06 github-actions[bot]