Matt Watson

Results 339 comments of Matt Watson

This will need to be a little exploratory, I don't think anyone has looked into this yet! It's definitely possible things will already mostly work today; I don't know the...

Just for browsing reference for others: https://colab.sandbox.google.com/github/jessechancy/keras-nlp/blob/jesse-checkpoint-conversion/scripts/models/roberta/benchmark_roberta_checkpoints.ipynb https://colab.sandbox.google.com/github/jessechancy/keras-nlp/blob/jesse-checkpoint-conversion/scripts/models/roberta/roberta_checkpoint_conversion.ipynb https://colab.sandbox.google.com/github/jessechancy/keras-nlp/blob/jesse-checkpoint-conversion/scripts/models/roberta/benchmark_xlmr_checkpoints.ipynb https://colab.sandbox.google.com/github/jessechancy/keras-nlp/blob/jesse-checkpoint-conversion/scripts/models/roberta/xlmr_checkpoint_conversion.ipynb

I think I will go ahead and land the checkpoint conversion scripts, so we have something to work off of, but leave the benchmarking scripts out for now, as we...

@jbischof the colabs moved location since I posted. These links are just paths into github. You can search in the colab UI, but it's kind of a pain. `https://colab.research.google.com/github/{fork}/keras-nlp/blob/{branch}/{path}` https://colab.research.google.com/github/jessechancy/keras-nlp/blob/jesse-checkpoint-conversion/tools/checkpoint_conversion/roberta_checkpoint_conversion.ipynb...

Overall, sounds good to me! Definitely like adding `BertPreprocessor.from_preset("bert_base_uncased_en")`, that will really improve our end-to-end, mid-level usage pattern. I wonder if it's a little more correct to have `sequence_length=None` in...

Re namespaces, I do think we are signing up for a global namespace of IDs. We probably will store presets for different uses separately out of convenience, but the namespace...

Thanks! Will take a look! One note, it might be nice to add Jesse as a co-author on the commit, he did some incredible work on this and we should...

Forgot to leave a comment here, but from conversations with @jessechancy For the cache, we want a way to go from a string word input, to a token list output....

Another issue we need to work though is that python regex and tf regex appear to handle certain whitespace characters--non breaking spaces. We need to fix this, probably with some...

@chenmoneygithub left a few brief comments above in that regard. The issue with different output is apparently with non-breaking space characters. And there are some things I would like to...