newbietuan
newbietuan
> "accuracy" during training probably meant the proportion of the training examples that had correctly predicted the contrastive label, e.g.: > > ``` > contrastive_label = torch.arange(batch_size) > > image_loss...
> hello, how much the disk space will need? about 100T?
thank you very much. i will try it. when i run the code `(demo) mayutuan@mayutuans-MacBook-Pro cc_net % python -m cc_net --dump 2023-06 --task_parallelism 6 --num_shards 6 -l en --mine_num_processes 6...
> hello, @ladit , i'm doing for the zh language, it seems i should download the wiki(zh) and use part of the pipelines to preprocessing it. while there's similar questions...
@ladit thanks for your reply. Indeed, what i want to do is input a paragraph, output the the score of quality, if it means that i just need to load...
> @newbietuan No. I am still waiting for instructions from the contributors. and the most important, i noticed the code of ` def get_tokenizer(self, lang: str) -> Optional[RobustTokenizer]: cache =...
> Thank you very much. @mauriceweber Sorry for replying late. during the pipeline process, the wet_cache will be deleted automatically? when i run for test, it seems not deleted. so...
> Hi @newbietuan -- the ccnet pipeline processes the warc files on the fly, so you won't need to store an entire cc dump on disk. I cannot say how...
@aerkalov hello, do you know how to modify this bug? the file can't be opened, so how should i modify this typo?