NeMo-Curator
NeMo-Curator copied to clipboard
Zyda2 tutorial - key error when running compute_counts script
trafficstars
Describe the bug
When running the 2_compute_counts.py script, it fails with an error Exception: 'KeyError("[\'size\'] not in index")'
Steps/Code to reproduce bug
- Follow steps in tutorial
- Run
python3 2_dupes_removal/2_compute_counts.py - Script fails with following error
NeMo-Curator/tutorials/zyda2-tutorial/2_dupes_removal/2_compute_counts.py", line 55, in group_partition
return result[
File "/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py", line 6252, in _raise_if_missing
raise KeyError(f"{not_found} not in index")
KeyError: "['size'] not in index"
Expected behavior Successful run with size calculated correctly.
Environment overview (please complete the following information)
Environment location: Slurm Method of NeMo-Curator install: docker container, dev image from nvcr.io/nvidia/nemo:dev
Additional context
Adding this line sizes = sizes.rename(columns={0: 'size'}) after sizes = partition.groupby("group").size().reset_index() appears to correctly rename the column and fixes the error