rust-bert icon indicating copy to clipboard operation
rust-bert copied to clipboard

Chunking and sentence detection?

Open SFosterQRA opened this issue 3 years ago • 2 comments

Hi there,

I work on a project that currently uses opennlp, but we're considering switching our code over to rust, and rust-bert seems like a natural fit. We use opennlp for a few things, among them chunking/shallow parsing (ie extracting noun phrases, verb phrases, etc) and sentence detection. Does rust-bert support either of those functions? I looked through the code and it seemed like they weren't supported, but I could easily have missed it.

Anyway, assuming they're not supported, we might pursue implementing them ourselves. If we went that route, do you have any advice on what we'd need to do, or even where to get started? Sorry, it's a bit of a vague question, I know - I've worked with NLP for a while, but this is pretty much my first time diving into the inner workings of it.

SFosterQRA avatar May 21 '21 17:05 SFosterQRA

Hi @SFosterQRA ,

You are correct that there are no pretrained models supporting chunking or sentence detection yet. To my knowledge, there are no transformer-based architecture in Hugging face's model hub supporting these tasks either (https://huggingface.co/models).

I believe it would be possible training token classification models for these tasks using the appropriate dataset (e.g., CoNLL2000 for chunking). The easiest may be to use the numerous training utilities existing in Python to finetune these models on these tasks, and then to convert/import them in the library. A wide range of architectures are supported which would make this a rather quick process.

However, I would like to highlight that most models for sentence boundary detection and parsing/chunking tend to be significantly lighter than typical transformer architectures.

  • I looked at the size of the chunker in OpenNLP and it weights only ~2MB for English. Most transformer-based models are at least 1, often 2 orders of magnitude larger and computationally expensive - this may result in a significant increase from the Python baseline you worked with.
  • For sentence boundary detection, I would recommend looking at NNSplit which seems to offer lightweight (~4MB) LSTM-based models for Rust.

If you would like to proceed with Transformer-based architectures for chunking and sentence boundary detection, I believe the easiest would be:

  1. Identify a target light transformer architecture (e.g., DistilBERT, DistilRoBERTa, MobileBERT, TernaryBERT, SqueezeBERT...)
  2. Identify a training dataset for your target language / domain
  3. Train the model using the rich Python ecosystem (e.g., Transformers, Pytorch-lightning)
  4. Convert the model weight to C-tensors (using for example this crate's conversion utilities)
  5. Load it and use the TokenClassification pipeline directly or a derived pipeline

I believe the Rust community would benefit significantly from such lightweight taggers - I am not sure Transformers offer the best performance/efficiency ratio. Maybe porting some capabilities from OpenNLP to Rust may be a promising alternative? I hope this helps

guillaume-be avatar May 21 '21 19:05 guillaume-be

Was just skimming through the issues on this repo... @SFosterQRA You can try to phrase all your tasks as text generation. In the project I want to use rust-bert for, that's what I'm doing to for instance extract "virtual assistants" as the topic of the sentence "Formulate a research question about virtual assistants."

paulbricman avatar May 24 '21 05:05 paulbricman