bert.cpp
bert.cpp copied to clipboard
Tokenizer in bert.cpp is not good enough, how about `tokenizers-cpp`
As mention in title, https://github.com/mlc-ai/tokenizers-cpp is a good implement for token.
Maybe persons do not like another dependency, but it is worthy.
Using a mature implementation is helpful
- case / uncase may both work.
- CJK will work.
- it is effecient.
I believe it works (demo as bellow).
But it won't be done right away.
Great!
After review and testing, I found current tokenize implement is not efficient enough. Import tokenizers-cpp boost much. (just for convenience)
Bellow is the benchmark on my laptop (via m2 max).
I believe it works (demo as bellow).
But it won't be done right away.
E5 or m3e? that's great!
This is a very exiting direction, and huge props to FFenglll for getting this working. Unfortunately I don't have much time to work on this project right now, but I think people would enjoy the changes you've made recently.
The usecase I originally had for this project is no longer valid, so I'm not as invested in making this library "production quality".
So I have 2 suggestions on how to share your changes:
- Make a new repo with the updates and I'll add a link to it at the beginning of the readme, saying that version is more up to date
- I can give you contributor rights to this repo and the huggingface storage for the pre-converted models. I can still do code reviews and answer any questions, but you'd be free to set the direction for the project to your liking (e.g. changing the tokenizer)
This is a very exiting direction, and huge props to FFenglll for getting this working. Unfortunately I don't have much time to work on this project right now, but I think people would enjoy the changes you've made recently.
The usecase I originally had for this project is no longer valid, so I'm not as invested in making this library "production quality".
So I have 2 suggestions on how to share your changes:
- Make a new repo with the updates and I'll add a link to it at the beginning of the readme, saying that version is more up to date
- I can give you contributor rights to this repo and the huggingface storage for the pre-converted models. I can still do code reviews and answer any questions, but you'd be free to set the direction for the project to your liking (e.g. changing the tokenizer)
@skeskinen Thanks for your suggestions.
After thinking, I wish to build a new repo (actually still a fork) with name embedding.cpp.
of course, orginal info for
bert.cppis kept.
The major reason is what I want is just an efficient text embedding tool which can be deployed standalone.
I've worked on this area for some time, and very glad to see bert.cpp and join in.
Some other minor reasons might be
- I am eager for efficiency, but
tokenizers-cppwill import rust into build, someone may not like the way. - Changes are large and happens frequently, holding in a fork might be better.
@cgisky1980 both of them work well.
This is a very exiting direction, and huge props to FFenglll for getting this working. Unfortunately I don't have much time to work on this project right now, but I think people would enjoy the changes you've made recently. The usecase I originally had for this project is no longer valid, so I'm not as invested in making this library "production quality". So I have 2 suggestions on how to share your changes:
- Make a new repo with the updates and I'll add a link to it at the beginning of the readme, saying that version is more up to date
- I can give you contributor rights to this repo and the huggingface storage for the pre-converted models. I can still do code reviews and answer any questions, but you'd be free to set the direction for the project to your liking (e.g. changing the tokenizer)
@skeskinen Thanks for your suggestions.
After thinking, I wish to build a new repo (actually still a fork) with name
embedding.cpp.of course, orginal info for
bert.cppis kept.The major reason is what I want is just an efficient text embedding tool which can be deployed standalone.
I've worked on this area for some time, and very glad to see
bert.cppand join in.Some other minor reasons might be
- I am eager for efficiency, but
tokenizers-cppwill import rust into build, someone may not like the way.- Changes are large and happens frequently, holding in a fork might be better.
THX. where is the new repo?
@FFengIll need embedding.cpp
Here is the repo: https://github.com/FFengIll/embedding.cpp
And I must remind that it is WIP and not stable enough.
