Ayush Dattagupta comments

Results 83 comments of


                                            Ayush Dattagupta

[BUG] on Starter example

Thanks for raising @tiraldj, it looks like the sql query creates a column called `sum` but it cannot be accessed via `result.sum` since that's a reserved for the function `sum`....

Fixed bug: changed to correct model name

Thanks for opening @ByteWrite. To pass the DCO and merge requirements, NeMo-Curator requires all commits to be signed and signed-off. More information on how to do so is described in...

Consecutive execution of fuzzy deduplication on different columns fails with errors

As a workaround could you try adding a `dataset.df.reset_index(drop=True)` in the `fuzzy_dedupe` method before calling `fuzzy_dup`. My best guess is it's related to #48 since after the first removal the...

Consecutive execution of fuzzy deduplication on different columns fails with errors

Thanks for checking. I'll investigate further.

When I do fineweb-edu for classifier scoring, how do I overlap the tokenizer with the process of model infer?

Thanks for raising @HuaYZhao. In our current setup I don't think there's an easy way to pipeline or overlap the tokenization with the inference but we are looking into other...

shuffle datasets

Algorithmically, the shuffle is needed to ensure all duplicates are found. I'll try to illustrate with an example: ```python # Assume input data is divided into 2 partitions Partition 0...

shuffle datasets

`max(1,hash_df.npartitions)` is theoretically okay. `hash_df.npartitions//3` is just an optimization for improved performance. In our experience, the input document dataset is much larger in size than the `hash_df` which only contains...

shuffle datasets

Marking as closed. @simplew2011 Feel free to reopen if you have any issues!

Running Curator under SLURM Cluster

NeMo-Curator should be able to work both single and multi-node on SLURM clusters with both [NeMo-Run](https://github.com/NVIDIA/NeMo-Curator/tree/main/examples/nemo_run) which wraps some [bash scripts](https://github.com/NVIDIA/NeMo-Curator/tree/main/examples/slurm) that could be used to set up the cluster...

fuzzy_dedup OOM issue

Thanks for raising the issue @chenrui17 . For 8TB of input data on 5 A100 GPUs (~400GB memory) the memory requirements to hold intermediates during stages like LSH might lead...