HugeCTR
HugeCTR copied to clipboard
How can I use NVTabular to generate Norm data?
I want to training the WDL model using Embedding Training Cache like this:
reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Norm, source = ["/hugectr/tools/criteo_data_500k/file_list."+str(i)+".txt" for i in range(1)], keyset = ["/hugectr/tools/criteo_data_500k/file_list."+str(i)+".keyset" for i in range(1)], eval_source = "/hugectr/tools/criteo_data_500k/file_list_test.0.txt", check_type = hugectr.Check_t.Sum) optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam) hc_cnfg = hugectr.CreateHMemCache(num_blocks = 2, target_hit_rate = 0.5, max_num_evict = 0) etc = hugectr.CreateETC(ps_types = [hugectr.TrainPSType_t.Staged, hugectr.TrainPSType_t.Cached], sparse_models = ["/hugectr/samples/wdl_1gpu_ps/wdl0_sparse_20000.model", "/hugectr/samples/wdl_1gpu_ps/wdl1_sparse_20000.model"], local_paths = ["/hugectr/samples/wdl_1gpu_ps"], hmem_cache_configs = [hc_cnfg])
when I use command "bash preprocess.sh 1 criteo_data_500k pandas 1 1 100" to process part of Criteo datasets It works, and I can use the output data to run the training. BUT if i use pandas to process the whole day of Criteo dataset it would raise an error "out of memory" because my computer has not enough memory. So i switch to NVTalubar, and use the command "bash preprocess.sh 1 criteo_data_nvtnorm nvt 0 1 1" to process the Criteo datasets. I try to replace the hugectr.DataReaderParams like the following, but it comes out an error "DataHeaderError /hugectr/HugeCTR/include/common.hpp:259[HUGECTR][08:57:59][ERROR][RANK0]"
reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Norm, source = ["/hugectr/tools/criteo_data_nvtnorm_500k/train/_file_list.txt"], keyset = ["/hugectr/tools/criteo_data_nvtnorm_500k/train/_hugectr.keyset"], eval_source = "/hugectr/tools/criteo_data_nvtnorm_500k/val/_file_list.txt", check_type = hugectr.Check_t.Non)
And the problem is that if you want to use Embedding Training Cache you must provide the parameter 'keyset' in hugectr.DataReaderParams.So, is there any samples of "NVTabualr+hugectr.CreateHMemCache"?
Hi @dulvqingyunLT , thank you for your question! Currently NVT doesn't support Norm format output. But we are working on a notebook to demonstrate how to generate keyset with NVT. I believe you will find it in few weeks. @jershi425 to keep @dulvqingyunLT posted.
Hi @dulvqingyunLT , thank you for your question! Currently NVT doesn't support Norm format output. But we are working on a notebook to demonstrate how to generate keyset with NVT. I believe you will find it in few weeks. @jershi425 to keep @dulvqingyunLT posted.
Thanks a lot!
We have a notebook for generating keyset, please refer to https://github.com/NVIDIA-Merlin/HugeCTR/blob/main/notebooks/embedding_training_cache_example.ipynb