blog icon indicating copy to clipboard operation
blog copied to clipboard

[how-to-train] Link to a Google Colab version of the blogpost

Open julien-c opened this issue 5 years ago • 36 comments

cc @srush!

julien-c avatar Feb 15 '20 12:02 julien-c

@OP I’m working on it, will share when done. Thanks

aditya-malte avatar Feb 16 '20 15:02 aditya-malte

There is missing config.json and tokenizer config.

djstrong avatar Feb 16 '20 20:02 djstrong

+1, because I'm really confused by the blogpost... In particular, I have no idea how to "combine" the tokenizer and dataset implemented in python with the run_language_modeling.py script used for training, which seems to be intended to be run from a commandline rather than from code... I'm admittedly a noob, but seeing how that is done would be extremely helpful

Jazzpirate avatar Feb 21 '20 12:02 Jazzpirate

Check this, A small example I have created https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b#file-smallberta_pretraining-ipynb

aditya-malte avatar Feb 22 '20 13:02 aditya-malte

@julien-c , I have pruned the dataset to the first 200,000 samples so that the notebook may run quickly on Colab, as this is meant to be more like a quick tutorial to glue several things together than get SOTA performance. During actual training one could use the full data. Do share it with your network and STAR if found useful 🤓.

aditya-malte avatar Feb 22 '20 14:02 aditya-malte

@aditya-malte Thanks a lot :) I'm still confused though. Both the original blog post and your notebook use ByteLevelBPETokenizer. If I save one of those (and rename the output files like your notebook does), I get two files "merges.txt" and "vocab.json" (which in my case live in the folder "./tokenizer". But if I point model_class.from_pretrained to the directory containing them (as your notebook does via the tokenizer_name-flag), I get: OSError: Model name './tokenizer' was not found in tokenizers model name list (<long list of names>). We assumed './tokenizer' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such vocabulary files at this path or url.

I originally thought that meant that the PreTrainedTokenizer-class just isn't compatible with the way ByteLevelBPETokenizers are saved, but apparently it works in your notebook, so... what am I doing wrong? :(

Jazzpirate avatar Feb 22 '20 17:02 Jazzpirate

Hi, The easiest solution (and I have also used the same in my Colab notebook) is just to rename the files using !mv. I know this is a hack but it currently seems to work.

aditya-malte avatar Feb 22 '20 17:02 aditya-malte

@julien-c , this is another issue that I wanted to point out. While renaming does work, it is a bit confusing for the programmer and takes some time to figure out. Maybe the next release could also check for a tokenizer file format in <model_name>-vocab.txt ,etc. Thanks

aditya-malte avatar Feb 22 '20 17:02 aditya-malte

I did rename them, as you did in the notebook, but I still get the error... if I interpret the error message correctly, it expects a vocab.txt, but your notebook uses vocab.json and merges.txt - and I don't think either of the two files correponds to the vocab.txt it is looking for...?

Jazzpirate avatar Feb 22 '20 17:02 Jazzpirate

I’m not sure, I’ll have to see your code for that. Perhaps it could be possible that it is just an incorrect path.

aditya-malte avatar Feb 22 '20 18:02 aditya-malte

It's the path to the folder containing the two files vocab.json and merges.txt - seemingly the same thing your notebook does, so I'm almost positive that's not it...

do different models use different tokenizers? It's currently set to "bert", not "roberta" as in your notebook, but I'd be very suprised if that would make a difference regarding tokenizer-file-structure? :D

Jazzpirate avatar Feb 22 '20 18:02 Jazzpirate

Did you call from_pretrained using a BertTokenizer object or a PretrainedTokenizer object?

aditya-malte avatar Feb 22 '20 18:02 aditya-malte

@aditya-malte I'm doing it exactly like the script does... i.e. match on the model name and use

MODEL_CLASSES = {
    "gpt2": (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
    "openai-gpt": (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
    "bert": (BertConfig, BertForMaskedLM, BertTokenizer),
    "roberta": (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer),
    "distilbert": (DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer),
    "camembert": (CamembertConfig, CamembertForMaskedLM, CamembertTokenizer),
}

...so in my case, that would be BertTokenizer. The relevant part of my code is this:

    def trainTokenizer(self, output_dir: str, file: str, tokenizer_class, vocab_size: int = 7000, min_frequency: int = 5):
        tokenizer = ByteLevelBPETokenizer()
        tokenizer.train(files=[file], vocab_size=vocab_size, min_frequency=min_frequency, special_tokens=[
            "<s>",
            "<pad>",
            "</s>",
            "<unk>",
            "<mask>"

        ])
        tokenizer._tokenizer.post_processor = BertProcessing(
            ("</s>", tokenizer.token_to_id("</s>")),
            ("<s>", tokenizer.token_to_id("<s>")),
        )
        tokenizer.enable_truncation(max_length=512)
        if not os.path.exists(output_dir + "/tokenizer"):
            os.makedirs(output_dir + "/tokenizer")
        tokenizer.save(output_dir + "/tokenizer", "")
        os.rename(output_dir + "/tokenizer/-merges.txt", output_dir + "/tokenizer/merges.txt")
        os.rename(output_dir + "/tokenizer/-vocab.json", output_dir + "/tokenizer/vocab.json")
        print("")
        return tokenizer_class.from_pretrained(output_dir + "/tokenizer", cache_dir=output_dir + "/cache")```

Jazzpirate avatar Feb 22 '20 18:02 Jazzpirate

Hmm, that’s strange. What is your version of Transformers and Tokenizers. Why use a cache_dir btw? If you’re not downloading from S3

aditya-malte avatar Feb 22 '20 18:02 aditya-malte

freshly installed from a freshly upgraded version of pip thursday ;)

regarding cache_dir: no idea, just copied that from the script to see what ends up in there :D

Jazzpirate avatar Feb 22 '20 18:02 Jazzpirate

Wait, so if you’re not running it on Colab, with all other things remaining the same. I think it might be an issue with your environment. Also, just yesterday (or day before) I think there was a major change in the Transformer library

aditya-malte avatar Feb 22 '20 19:02 aditya-malte

~Heh, so uninstalling with pip and pulling the git repo directly seems to have solved it. Thanks :)~ Huh, no, it actually didn't, I just overlooked that I commented out the offending line earlier :( Problem still stands...

Jazzpirate avatar Feb 22 '20 19:02 Jazzpirate

okay, update: the problem really was the model.

tokenization_roberta.py:

VOCAB_FILES_NAMES = {
    "vocab_file": "vocab.json",
    "merges_file": "merges.txt",
}

tokenization_bert.py:

VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}

...I have no idea why they use different file conventions, and in particular why bert doesn't allow one to use the ones from ByteLevelBPETokenizer... :/

Jazzpirate avatar Feb 22 '20 20:02 Jazzpirate

Great!

aditya-malte avatar Feb 23 '20 09:02 aditya-malte

I'm currently writing class abstractions to assemble a model in-code that uses the right tokenizers-class depending on the model_type. My goal is to have something like (yes, I'm heavily biased towards object-orientation):

# for a new model
tokenizer = trainTokenizer(data_file, model_type="<somemodel>")
dataset = <something>
model = ModularModel(out_file="some/path", tokenizer, model_type = "<somemodel>", non_default_parameters={})
model.train(train_dataset=dataset, eval_dataset=None)

...
# for a pre-saved model
loaded_model = ModularModel(out_file="some/path")
...

Would that be useful to anyone other than me?

Jazzpirate avatar Feb 23 '20 10:02 Jazzpirate

I so strongly agree with you, and I too feel that the community should go in an OOP direction.(Rather than the CLI way, that we’re all using abstractions now) Do share your code

aditya-malte avatar Feb 23 '20 11:02 aditya-malte

@aditya-malte Here it is: Classes: https://pastebin.com/71N3gp7C - mostly copy-pasted from the run-language-modeling-script, but all shell parameters replaced by a dict with all entries optional. example usage: https://pastebin.com/SQUf61aD

The default parameters used are questionable for sure. For camembert, I couldn't find out what kind of tokenizer can generate the files the PretrainedTokenizer-subclass expects, so that one won't work, but afaik all other ones work out of the box.

Jazzpirate avatar Feb 23 '20 11:02 Jazzpirate

@aditya-malte sorry, found a slight error: Line 686 needs to be sorted_checkpoints = self._sorted_checkpoints(output_dir + "/out") in order for continuing from a checkpoint to work

Jazzpirate avatar Feb 23 '20 11:02 Jazzpirate

@Jazzpirate This is awesome! The only thing that's a bit unfortunate (though probably operator error) is that it only seems to run on cpu for me. Is there a way to specify gpu?

Gulp... spoke too soon: no_cuda = False... duh

jbmaxwell avatar Feb 25 '20 01:02 jbmaxwell

@jbmaxwell you can pass on a dictionary to the ModularModel constructor with the CLI-parameters you wish to be used. By default, it set no_cuda to True, because I couldn't get Cuda to run on my machine without compiling "old" linux kernels myself :/

One more "bug" I found: if you want to use evaluation during training, make sure to replace the calls to evaluate in the train and _train methods with _evaluate instead.

I honestly only did this to the point where I could use it for my own purposes. If I can get something out of this, I might be inclined to make a fork and pull-request with a more stable script at some point, but I honestly have no real idea what I'm doing (yet?) :D

Jazzpirate avatar Feb 25 '20 09:02 Jazzpirate

The huggingface library in general would massively benefit from keeping things in the code and not an unholy, messy blend of CLI. (A bit like how fast-BERT does it. https://github.com/kaushaltrivedi/fast-bert)

On Tue, Feb 25, 2020 at 11:31 AM Dennis Müller [email protected] wrote:

@jbmaxwell https://github.com/jbmaxwell you can pass on a dictionary to the ModularModel constructor with the CLI-parameters you wish to be used. By default, it set no_cuda to True, because I couldn't get Cuda to run on my machine without compiling "old" linux kernels myself :/

One more "bug" I found: if you want to use evaluation during training, make sure to replace the calls to evaluate in the train and _train methods with _evaluate instead.

I honestly only did this to the point where I could use it for my own purposes. If I can get something out of this, I might be inclined to make a fork and pull-request with a more stable script at some point, but I honestly have no real idea what I'm doing (yet?) :D

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/huggingface/blog/issues/3?email_source=notifications&email_token=ABHCDA5EFY4ZHTHVDG52BY3RETQPNA5CNFSM4KVYMW7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM3HDEY#issuecomment-590770579, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHCDA2OIDZQXAEKT7MEN63RETQPNANCNFSM4KVYMW7A .

-- Dan Ofer

Data Czar Data Scientist

Cell: +972-524-600688 <%2B972-52-5799899> [image: Image result for sparkbeyond]

ddofer avatar Feb 25 '20 10:02 ddofer

As a new user to the Transformers/Tokenizers library, I had trouble following the blogpost, too. Following this thread for a clean notebook which I can follow.

What I want to do is train from scratch a language model with a custom architecture, e.g., I want to play around with the BERT layers.

Suggestion: Using the wikitext-2 dataset?

chaitjo avatar Feb 26 '20 10:02 chaitjo

Hi, Just change the config variable in this colab notebook to adjust number of layers. https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b#file-smallberta_pretraining-ipynb Thanks

aditya-malte avatar Feb 26 '20 12:02 aditya-malte

@Jazzpirate RoBERTa uses a Byte-level BPE tokenizer (similar to what GPT-2 uses) whereas BERT uses a Wordpiece tokenizer.

A Wordpiece tokenizer is only based on a set of tokens, starting from whole words and decomposing its way into single tokens. Whereas the BPE algorithm merges tokens together according to merge pairs which are stored in a separate file – hence the serialization formats are different.

julien-c avatar Feb 27 '20 03:02 julien-c

@aditya-malte Thanks for sharing, this looks good and contains great ideas.

A few comments:

  • with transformers 2.5.1, there should not be a need to mv the tokenizer files anymore (as they are created with the "standard" file names by default)
  • You shouldn't need to store each sample in a different text file.
  • I was thinking of adding a (slightly closer to the blog post) notebook in /notebooks in this repo, and linking to it with an "Open in colab" badge on the blog post, maybe we can collaborate on it

julien-c avatar Feb 27 '20 04:02 julien-c