NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Update single_gpu_tutorial.ipynb to use a recent snapshot

Open ronjer30 opened this issue 1 year ago • 0 comments
trafficstars

Describe the bug

The single gpu tutorial notebook refers to a snapshot that is not available and it is hardcoded within several cells in the notebook

Steps/Code to reproduce bug

  1. Launch notebook
  2. Run all steps in 0.Env Setup section
  3. In the 1.Download section, while running the below code
#Output
download_base_directory= os.path.join(data_dir,"wiki_downloads")
download_output_directory = os.path.join(download_base_directory,"data")

#Relevant parameters
dump_date = "20240201"
language = 'th'
url_limit = 1

res = download_wikipedia(download_output_directory,
                   language=language, 
                   dump_date=dump_date,
                   url_limit=url_limit).df.compute()

Returns the following error ValueError: No wikipedia dump found for 20240201

When the dump_date is changed to a valid snapshot from here, subsequent notebook cells still error since the snapshot dump date is hardcoded in these cells, example -

! ls {download_output_directory}
! wc -l  {download_output_directory}/thwiki-20240201-pages-articles-multistream.xml.bz2.jsonl

Expected behavior

When the snapshot/dump-date is changed, subsequent notebook cells should refer to the variable rather than the hardcoded value

Environment overview

Environment location: Bare-metal Method of NeMo-Curator install: Docker

docker run \
   --rm \
   -it \
   --gpus '"device=1"' \
   --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
   -p 8888:8888 \
   -p 8787:8787 \
   nvcr.io/nvidia/nemo:dev

Additional context Please update all references to the hard coded 20240201 with the variable dump_date

! ls {download_output_directory}
! wc -l  {download_output_directory}/thwiki-{dump_date}-pages-articles-multistream.xml.bz2.jsonl

ronjer30 avatar Aug 05 '24 22:08 ronjer30