bert.cpp icon indicating copy to clipboard operation
bert.cpp copied to clipboard

Tokenizer in bert.cpp is not good enough, how about `tokenizers-cpp`

Open FFengIll opened this issue 2 years ago • 11 comments

As mention in title, https://github.com/mlc-ai/tokenizers-cpp is a good implement for token. Maybe persons do not like another dependency, but it is worthy.

FFengIll avatar Sep 20 '23 05:09 FFengIll

Using a mature implementation is helpful

  • case / uncase may both work.
  • CJK will work.
  • it is effecient.

FFengIll avatar Sep 20 '23 05:09 FFengIll

I believe it works (demo as bellow).

image

But it won't be done right away.

FFengIll avatar Sep 20 '23 08:09 FFengIll

Great!

redthing1 avatar Sep 21 '23 02:09 redthing1

After review and testing, I found current tokenize implement is not efficient enough. Import tokenizers-cpp boost much. (just for convenience)

Bellow is the benchmark on my laptop (via m2 max).

image

FFengIll avatar Sep 21 '23 13:09 FFengIll

I believe it works (demo as bellow).

image

But it won't be done right away.

E5 or m3e? that's great!

cgisky1980 avatar Sep 21 '23 17:09 cgisky1980

This is a very exiting direction, and huge props to FFenglll for getting this working. Unfortunately I don't have much time to work on this project right now, but I think people would enjoy the changes you've made recently.

The usecase I originally had for this project is no longer valid, so I'm not as invested in making this library "production quality".

So I have 2 suggestions on how to share your changes:

  1. Make a new repo with the updates and I'll add a link to it at the beginning of the readme, saying that version is more up to date
  2. I can give you contributor rights to this repo and the huggingface storage for the pre-converted models. I can still do code reviews and answer any questions, but you'd be free to set the direction for the project to your liking (e.g. changing the tokenizer)

skeskinen avatar Sep 22 '23 12:09 skeskinen

This is a very exiting direction, and huge props to FFenglll for getting this working. Unfortunately I don't have much time to work on this project right now, but I think people would enjoy the changes you've made recently.

The usecase I originally had for this project is no longer valid, so I'm not as invested in making this library "production quality".

So I have 2 suggestions on how to share your changes:

  1. Make a new repo with the updates and I'll add a link to it at the beginning of the readme, saying that version is more up to date
  2. I can give you contributor rights to this repo and the huggingface storage for the pre-converted models. I can still do code reviews and answer any questions, but you'd be free to set the direction for the project to your liking (e.g. changing the tokenizer)

@skeskinen Thanks for your suggestions.

After thinking, I wish to build a new repo (actually still a fork) with name embedding.cpp.

of course, orginal info for bert.cpp is kept.

The major reason is what I want is just an efficient text embedding tool which can be deployed standalone.

I've worked on this area for some time, and very glad to see bert.cpp and join in.

Some other minor reasons might be

  • I am eager for efficiency, but tokenizers-cpp will import rust into build, someone may not like the way.
  • Changes are large and happens frequently, holding in a fork might be better.

FFengIll avatar Sep 26 '23 08:09 FFengIll

@cgisky1980 both of them work well.

FFengIll avatar Sep 26 '23 08:09 FFengIll

This is a very exiting direction, and huge props to FFenglll for getting this working. Unfortunately I don't have much time to work on this project right now, but I think people would enjoy the changes you've made recently. The usecase I originally had for this project is no longer valid, so I'm not as invested in making this library "production quality". So I have 2 suggestions on how to share your changes:

  1. Make a new repo with the updates and I'll add a link to it at the beginning of the readme, saying that version is more up to date
  2. I can give you contributor rights to this repo and the huggingface storage for the pre-converted models. I can still do code reviews and answer any questions, but you'd be free to set the direction for the project to your liking (e.g. changing the tokenizer)

@skeskinen Thanks for your suggestions.

After thinking, I wish to build a new repo (actually still a fork) with name embedding.cpp.

of course, orginal info for bert.cpp is kept.

The major reason is what I want is just an efficient text embedding tool which can be deployed standalone.

I've worked on this area for some time, and very glad to see bert.cpp and join in.

Some other minor reasons might be

  • I am eager for efficiency, but tokenizers-cpp will import rust into build, someone may not like the way.
  • Changes are large and happens frequently, holding in a fork might be better.

THX. where is the new repo?

cgisky1980 avatar Oct 04 '23 03:10 cgisky1980

@FFengIll need embedding.cpp

cgisky1980 avatar Oct 08 '23 23:10 cgisky1980

Here is the repo: https://github.com/FFengIll/embedding.cpp

And I must remind that it is WIP and not stable enough.

FFengIll avatar Oct 11 '23 03:10 FFengIll