sentencepiece icon indicating copy to clipboard operation
sentencepiece copied to clipboard

A recent EMNLP work to share about task-adaptive tokenization with variable segmentation

Open lsy641 opened this issue 2 years ago • 4 comments

I saw this previous dicussion is really interesting at Multi-word segmentation #220 and knew you project members have experimented the segmentation beyond word-level on MT datasets and didn't see significant improvement.

I think it is because the segmentation of sub-word vocabulary was already trained from MT data, and there is little improvement room in effectiveness by changing granularity, although increaing granularity can bring efficiency boost. But in the era of pretraining models, I rethink to change the granularity and compositionality of generation in downstream domain.

image

Recently, our work((https://arxiv.org/abs/2310.05317)) provides a solution to make pretraining model be able to adopt a task-adaptive tokenizer, which supports variable segmentation optimized by the downstream data. Then it allows multi bigger granular segamentations (still retaining sub-word level) to be sampled. It does bring significant improvement in both generation effectiveness and efficiency for the tasks where task-specific terminologies often show up (e.g., medical, mental healh) The improvement is from two sources. 1. The gap between the pretraining vocabulary (for exampl, Bert vocabulary is optimized by GNMT benchmark that may be suitable for MT, but not for other tasks) and the downstream language style. 2.The second is the potential of varabile segamentation on efficiency.

To build a task-adaptive tokenizer, currently I manually sew the pretraining vocabulary and the downstream vocabulary by using the ProtoBuf apis provided by sentencepiece_model_bp2.py and sentencepiece_bp2.py and build a new tokenizer compatible with HuggingFace. I saw wondering if your project is interested to provide a funtion for researchers to easily build a task-adatpive tokenizer.

lsy641 avatar Oct 24 '23 16:10 lsy641

I read the paper, is there any code available which showcases the algorithm?

RubyBit avatar Feb 05 '24 22:02 RubyBit

I read the paper, is there any code available which showcases the algorithm?

@RubyBit Hello. I am currently organizing the codes but I am able to add you to our repository in advance. If you would like to join the repository, please give me your github account

lsy641 avatar Feb 21 '24 23:02 lsy641

I read the paper, is there any code available which showcases the algorithm?

@RubyBit Hello. I am currently organizing the codes but I am able to add you to our repository in advance. If you would like to join the repository, please give me your github account

Yes that would be great (this is my github account: RubyBit)

RubyBit avatar Feb 25 '24 13:02 RubyBit

@lsy641 I am so sorry, can you resend the invite? I didn't check my mail in time.

RubyBit avatar Mar 20 '24 12:03 RubyBit