Ayush Dattagupta
Ayush Dattagupta
Thanks for raising @tiraldj, it looks like the sql query creates a column called `sum` but it cannot be accessed via `result.sum` since that's a reserved for the function `sum`....
Thanks for opening @ByteWrite. To pass the DCO and merge requirements, NeMo-Curator requires all commits to be signed and signed-off. More information on how to do so is described in...
As a workaround could you try adding a `dataset.df.reset_index(drop=True)` in the `fuzzy_dedupe` method before calling `fuzzy_dup`. My best guess is it's related to #48 since after the first removal the...
Thanks for checking. I'll investigate further.
Thanks for raising @HuaYZhao. In our current setup I don't think there's an easy way to pipeline or overlap the tokenization with the inference but we are looking into other...
Algorithmically, the shuffle is needed to ensure all duplicates are found. I'll try to illustrate with an example: ```python # Assume input data is divided into 2 partitions Partition 0...
`max(1,hash_df.npartitions)` is theoretically okay. `hash_df.npartitions//3` is just an optimization for improved performance. In our experience, the input document dataset is much larger in size than the `hash_df` which only contains...
Marking as closed. @simplew2011 Feel free to reopen if you have any issues!
NeMo-Curator should be able to work both single and multi-node on SLURM clusters with both [NeMo-Run](https://github.com/NVIDIA/NeMo-Curator/tree/main/examples/nemo_run) which wraps some [bash scripts](https://github.com/NVIDIA/NeMo-Curator/tree/main/examples/slurm) that could be used to set up the cluster...
Thanks for raising the issue @chenrui17 . For 8TB of input data on 5 A100 GPUs (~400GB memory) the memory requirements to hold intermediates during stages like LSH might lead...