Chenghao Mou comments

Results 57 comments of


                                            Chenghao Mou

trafficstars

Websocket Support for Streaming Input and Output

Based on the some of the suggestions above, I tested it with the following mvp code that works, at least locally, with a ws server and client: server.py ```python import...

Can we use it for Arabic text?

Arabic text might require a different tokenisation method. Feel free to change the source file you are using. e.g. https://github.com/ChenghaoMou/text-dedup/blob/85dd9272e2cc0e873b1abb556807e1596f722284/text_dedup/minhash.py#L121 for minhash.py.

Agent transcription missing in conversation_item_added

I remember a PR not long ago, the content would be empty if the audio is not played. And the warning "_SegmentSynchronizerImpl.playback_finished called before text/audio input is done" seems to...

Agent transcription missing in conversation_item_added

The easiest way I can think of to debug this is just put print/raise/set_trace in your installed livekit code for that `mark_playback_finished` function. Check where the caller is from (stack...

text-dedup 去重效果怎么样

感谢提问。两者去重的底层算法一致，效果应该看具体参数。本项目的表现可以参考README中的Benchmark结果。项目代码大部分来源于本人在BigScience和BigCode中的实验，而且主要的关注方向就是去重，而datatrove包含去重之外的数据清理逻辑，看个人喜好吧。我没有做过两者的实际对比。

Run MinHash dedup on Multi-Nodes

Hi @alielfilali01 Thanks for reaching out. 313 billion tokens sounds doable with a decent cluster based on my experience. For reference, The spark script was tested with TB-level dataset with...

Run MinHash dedup on Multi-Nodes

Thanks for the details. In this case, you might have at least two options: 1. Try datatrove with its [slurm pipeline executor](https://github.com/huggingface/datatrove?tab=readme-ov-file#slurmpipelineexecutor) for deduplication with minimal HPC configuration and knowledge....