Zyda2 tutorial - key error when running compute_counts script
Describe the bug
When running the 2_compute_counts.py script, it fails with an error Exception: 'KeyError("[\'size\'] not in index")'
Steps/Code to reproduce bug
- Follow steps in tutorial
- Run
python3 2_dupes_removal/2_compute_counts.py - Script fails with following error
NeMo-Curator/tutorials/zyda2-tutorial/2_dupes_removal/2_compute_counts.py", line 55, in group_partition
return result[
File "/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py", line 6252, in _raise_if_missing
raise KeyError(f"{not_found} not in index")
KeyError: "['size'] not in index"
Expected behavior Successful run with size calculated correctly.
Environment overview (please complete the following information)
Environment location: Slurm Method of NeMo-Curator install: docker container, dev image from nvcr.io/nvidia/nemo:dev
Additional context
Adding this line sizes = sizes.rename(columns={0: 'size'}) after sizes = partition.groupby("group").size().reset_index() appears to correctly rename the column and fixes the error
@ronjer30 Please share the latest updates
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.