Chenghao Mou
Chenghao Mou
Based on the some of the suggestions above, I tested it with the following mvp code that works, at least locally, with a ws server and client: server.py ```python import...
Arabic text might require a different tokenisation method. Feel free to change the source file you are using. e.g. https://github.com/ChenghaoMou/text-dedup/blob/85dd9272e2cc0e873b1abb556807e1596f722284/text_dedup/minhash.py#L121 for minhash.py.
I remember a PR not long ago, the content would be empty if the audio is not played. And the warning "_SegmentSynchronizerImpl.playback_finished called before text/audio input is done" seems to...
The easiest way I can think of to debug this is just put print/raise/set_trace in your installed livekit code for that `mark_playback_finished` function. Check where the caller is from (stack...
感谢提问。两者去重的底层算法一致,效果应该看具体参数。本项目的表现可以参考README中的Benchmark结果。项目代码大部分来源于本人在BigScience和BigCode中的实验,而且主要的关注方向就是去重,而datatrove包含去重之外的数据清理逻辑,看个人喜好吧。我没有做过两者的实际对比。
Hi @alielfilali01 Thanks for reaching out. 313 billion tokens sounds doable with a decent cluster based on my experience. For reference, The spark script was tested with TB-level dataset with...
Thanks for the details. In this case, you might have at least two options: 1. Try datatrove with its [slurm pipeline executor](https://github.com/huggingface/datatrove?tab=readme-ov-file#slurmpipelineexecutor) for deduplication with minimal HPC configuration and knowledge....