TextAnalysis.jl icon indicating copy to clipboard operation
TextAnalysis.jl copied to clipboard

[WIP] Albert

Open tejasvaidhyadev opened this issue 4 years ago • 10 comments

Hi everyone I am adding ALBERT [WIP] Currently only raw code is given in PR. Dependencies - Transformers.jl , WordTokenizer.jl

I am not exporting any function.I am still in middle of deciding what is the best way to use it. But i am adding some important codes which is used for conversion of pretrained checkpoints and in Demo file below

Roadmap

  • [x] SentencePiece - containing wordpiece as well as unigram model(python Wrapper (for now) as well as julia implementation (under development))
  • [X] tfckpt2bsonforalbert.jl - for conversion of Tensorflow checkpoint to BSON weights
  • [x] albert transformer - It is not completed but is based on transformers.jl transformer
  • [X] model file - for now is kept inside ALBERT folder but it just the general wrapping structure to load ALBERT pretrain weight
  • [x] APIs - alberttokenizer , albertmasklm , albertforsequenceclassification etc.
  • [x] our own hosted Pretrain model manage by datadeps.jl
  • [x] Documentation, test and Tutorial
  • [x] code and APIs for fine tuning and Data loading apart from above refactoring and cleaning of code is remaining

Important links

Pretrained weights link .

  • The pretrained weigths are converted from tensorflow check point released by google-research.
  • The code for conversion is given in tfckpt2bsonforalbert.jl
  • Currently Pretrained weight for Version-1 is given soon I will release it for version-2

For detail refer - link

Demo - link

PS All the suggestions are welcome

tejasvaidhyadev avatar Mar 31 '20 09:03 tejasvaidhyadev

Sorry for closing PR before Commit history of git is now updated

News

Updated Demo

  • Contatins demo of embedding from wordpiece and sentencepiece

  • Demo of conversion of Tensorflow checkpoint to bson file(as desire by Julia flux) - link

tejasvaidhyadev avatar Apr 02 '20 13:04 tejasvaidhyadev

Pretrained weights

Version 2 of ALBERT converted Bson is released It doesn't contain 30k-clean.model file (by sentencepiece)

tejasvaidhyadev avatar Apr 18 '20 13:04 tejasvaidhyadev

@aviks any suggestion on the roadmap mentioned above. i am also thinking of adding Tutorial folder (containing ipynb of tutorials)

tejasvaidhyadev avatar Apr 23 '20 10:04 tejasvaidhyadev

added Sentencepiece unigram support

tejasvaidhyadev avatar Jun 27 '20 04:06 tejasvaidhyadev

completed trainable Albert structure.

tejasvaidhyadev avatar Jul 03 '20 20:07 tejasvaidhyadev

fine-tuning Training Tutorial (it's not supported GPU so far)- here

tejasvaidhyadev avatar Jul 17 '20 19:07 tejasvaidhyadev

The above code is pretty messy and not yet refractor (for the experiment) we can drop Sentencepiece as soon as PR of ALBERT is merged Apart from that pretrain.jl is ready and can drop tfck2bsonforalbert.jl in next push I will refractor code within next 1 week

tejasvaidhyadev avatar Jul 18 '20 18:07 tejasvaidhyadev

Hi @tejasvaidhyadev can you move this PR to TextModels now please?

aviks avatar Nov 01 '20 21:11 aviks

Hi @tejasvaidhyadev can you move this PR to TextModels now please? Hi @aviks,

Is it okay, if I will do it the coming weekend? I have exams this week

tejasvaidhyadev avatar Nov 02 '20 05:11 tejasvaidhyadev

I will do it the coming weekend?

Yes, of course, only whenever you have time.

aviks avatar Nov 02 '20 15:11 aviks