text-splitter
text-splitter copied to clipboard
Release wasm version
Hey Ben,
I would love to use a wasm version of text-splitter in the web application https://github.com/do-me/SemanticFinder. Currently it only supports chars, words, sentences, regex and tokens but all of these separators are too "stiff". I found that your unicode-based approach generally works quite well which would give users more flexibility and hopefully even better results.
Do you think you could release a wasm-compiled version for the web?
Hi @do-me cool project! I would definitely love to do support this.
Are you ok if it only supports character-based chunking? The reason is I think I need to do some workarounds or even see if it is possible to use tokenizer libs in wasm...
If character-based is fine, then I think it could be possible. I also would need to check if markdown can also be supported, but I guess anything is better than nothing perhaps for your use case.
Yes absolutely! Token-based chunking for my use case is an absolute overkill.
However if you'd still want to offer a way to include it for some reason, trasformers.js offers a very convenient tokenizing API out of the box. See here for example: https://huggingface.co/docs/transformers.js/api/tokenizers
import { AutoTokenizer } from '@xenova/transformers';
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/bert-base-uncased');
const { input_ids } = await tokenizer('I love transformers!');
// Tensor {
// data: BigInt64Array(6) [101n, 1045n, 2293n, 19081n, 999n, 102n],
// dims: [1, 6],
// type: 'int64',
// size: 6,
// }
So if you would shift the task of calculating tokens to the user and not include it directly in Rust/wasm, maybe that would make most sense. But again, for me it's not really necessary.
If character-based is fine, then I think it could be possible. I also would need to check if markdown can also be supported, but I guess anything is better than nothing perhaps for your use case.
For me certainly - whatever is feasible to you. If markdown was supported that would allow for a really great pipeline, as I just discovered https://r.jina.ai/ that converts any web input to LLM-ready markdown. So pairing this tool with your performant chunking and SemanticFinder would deliver a great user experience :)
Awesome. Yeah I think I'd likely do something similar to what I have in the Python bindings and accept a callback/lambda function so the user can bring custom logic that isn't compiled in. It has the downside of having to do an FFI call quite often, which isn't always performant, but at least provides the functionality.
Well cool, assuming the markdown crate works, it should be quite easy to support a wasm target I think for this use case. It would also enable building a playground of sorts so people can play with the effect of different chunk settings and see it visually, which is something I've been wanting to do anyway.