NeMo-Curator
NeMo-Curator copied to clipboard
Update single_gpu_tutorial.ipynb to use a recent snapshot
Describe the bug
The single gpu tutorial notebook refers to a snapshot that is not available and it is hardcoded within several cells in the notebook
Steps/Code to reproduce bug
- Launch notebook
- Run all steps in 0.Env Setup section
- In the 1.Download section, while running the below code
#Output
download_base_directory= os.path.join(data_dir,"wiki_downloads")
download_output_directory = os.path.join(download_base_directory,"data")
#Relevant parameters
dump_date = "20240201"
language = 'th'
url_limit = 1
res = download_wikipedia(download_output_directory,
language=language,
dump_date=dump_date,
url_limit=url_limit).df.compute()
Returns the following error
ValueError: No wikipedia dump found for 20240201
When the dump_date is changed to a valid snapshot from here, subsequent notebook cells still error since the snapshot dump date is hardcoded in these cells, example -
! ls {download_output_directory}
! wc -l {download_output_directory}/thwiki-20240201-pages-articles-multistream.xml.bz2.jsonl
Expected behavior
When the snapshot/dump-date is changed, subsequent notebook cells should refer to the variable rather than the hardcoded value
Environment overview
Environment location: Bare-metal Method of NeMo-Curator install: Docker
docker run \
--rm \
-it \
--gpus '"device=1"' \
--ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-p 8888:8888 \
-p 8787:8787 \
nvcr.io/nvidia/nemo:dev
Additional context Please update all references to the hard coded 20240201 with the variable dump_date
! ls {download_output_directory}
! wc -l {download_output_directory}/thwiki-{dump_date}-pages-articles-multistream.xml.bz2.jsonl