please add a sequence labelling example using bert and crf
This repository is very helpful. The movie review example is very interesting. Please add more example, especially using bert and crf to do sequence labelling tasks.
Please add more example, especially using bert and crf to do sequence labelling tasks.
Hi I've made a rough implementation of sequence labelling using this library and CRF. Please have a check: https://github.com/DUTANGx/TF2-albert-NER
@DUTANGx - thank you for your contribution. Have you made any comparison of how much improvement could be obtained by using CRF on top of (AL)BERT as compared to using (AL)BERT only?
@DUTANGx - thank you for your contribution. Have you made any comparison of how much improvement could be obtained by using CRF on top of (AL)BERT as compared to using (AL)BERT only?
@kpe Thanks to your really nice library :) I've tested on a Chinese Corpus "MSRA". Adding CRF brings around 3%-5% improvement on F1 score using alBERT. BTW, one thing might be interesting to look at is that, when I added a bi-lstm(with default tanh as hidden activation) after the alBERT layer, a ValueError is raised, says the lstm layer can't handle float type input. This is fixed by changing the activation to elu.
I am testing albert with non-crf layers as decoders for a sequence labelling task where one classifies tokens into argumentation candidates. The repository is very rough at the moment but you can take a look here if it helps (https://github.com/atreyasha/sentiment-argument-mining/tree/develop_argumentation).
I am using this module and adding decoder layers such 1d-convolutions, lstm's and time-distributed dense layers. So far the results are quite good.
However, I did notice one problem. As I have limited hardware (1 GPU with 12 GB RAM), I can only train with a small batch size (48 samples) and I also have to truncate my sequence lengths to the point that I lose a lot of data.
@kpe I was thinking of implementing a gradient accumulation optimizer which could help to save some memory and thereby allow for longer sequence lengths. Have you by chance already implemented a gradient accumulator in this library or in your code somewhere? This is also an issue in tensorflow (https://github.com/tensorflow/tensorflow/pull/32576).
If not, do you think it would a good addition for this python package? I could implement such a thing and submit a pull request. This has already been done by some folks here (https://github.com/run-ai/runai), however it only works for keras and not tensorflow.keras.
@atreyasha - I have not done gradient accumulation yet. I think if you do one, it would best fit https://github.com/kpe/params-flow as it is not model specific, but feel free to do a PR where you like.
I'm not sure what's currently the best option for on training larger batches (check TFRC for getting access to a TPU pod and use it with the LAMB optimizer), but tweaking on the optimizer might be the best option (as ADAM needs so much memory) - there is for example https://github.com/openai/gradient-checkpointing (check https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9), but there might also be some more recent developments.
AdaFactor seems to also be a good choice for transformer based models.
Just as a note, I used bert-for-tf2 successfully in a sequence labelling task. This is detailed in the repository here. The relevant information on the training and pre-processing can be found here.
Due to lack of better hardware and faulty gradient-memory-handling solutions with tensorflow 2, I had to stick to the naive approach of simply keeping a very small batch-size.
Acknowledgments to @kpe are of course there as well :smile: