NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Zyda2 tutorial - key error when running compute_counts script

Open ronjer30 opened this issue 1 year ago • 1 comments
trafficstars

Describe the bug When running the 2_compute_counts.py script, it fails with an error Exception: 'KeyError("[\'size\'] not in index")'

Steps/Code to reproduce bug

  1. Follow steps in tutorial
  2. Run python3 2_dupes_removal/2_compute_counts.py
  3. Script fails with following error
NeMo-Curator/tutorials/zyda2-tutorial/2_dupes_removal/2_compute_counts.py", line 55, in group_partition
    return result[
  File "/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py", line 6252, in _raise_if_missing
    raise KeyError(f"{not_found} not in index")
KeyError: "['size'] not in index"

Expected behavior Successful run with size calculated correctly.

Environment overview (please complete the following information)

Environment location: Slurm Method of NeMo-Curator install: docker container, dev image from nvcr.io/nvidia/nemo:dev

Additional context Adding this line sizes = sizes.rename(columns={0: 'size'}) after sizes = partition.groupby("group").size().reset_index() appears to correctly rename the column and fixes the error

ronjer30 avatar Nov 05 '24 21:11 ronjer30