newbietuan comments

Results 9 comments of


                                            newbietuan

How can I get accuracy metrics when training?

> "accuracy" during training probably meant the proportion of the training examples that had correctly predicted the contrastive label, e.g.: > > ``` > contrastive_label = torch.arange(batch_size) > > image_loss...

Expected finish time for processing one single index of commoncrawl?

> hello, how much the disk space will need? about 100T?

about download a small portion of cc

thank you very much. i will try it. when i run the code `(demo) mayutuan@mayutuans-MacBook-Pro cc_net % python -m cc_net --dump 2023-06 --task_parallelism 6 --num_shards 6 -l en --mine_num_processes 6...

Questions about the quality classifier in common crawl

> hello, @ladit , i'm doing for the zh language, it seems i should download the wiki(zh) and use part of the pipelines to preprocessing it. while there's similar questions...

Questions about the quality classifier in common crawl

@ladit thanks for your reply. Indeed, what i want to do is input a paragraph, output the the score of quality, if it means that i just need to load...

Questions about the quality classifier in common crawl

> @newbietuan No. I am still waiting for instructions from the contributors. and the most important, i noticed the code of ` def get_tokenizer(self, lang: str) -> Optional[RobustTokenizer]: cache =...

how much disk memory will be used？

> Thank you very much. @mauriceweber Sorry for replying late. during the pipeline process, the wet_cache will be deleted automatically? when i run for test, it seems not deleted. so...

how much disk memory will be used？

> Hi @newbietuan -- the ccnet pipeline processes the warc files on the fly, so you won't need to store an entire cc dump on disk. I cannot say how...

read fail AttributeError: 'NoneType' object has no attribute 'nsmap'

@aerkalov hello, do you know how to modify this bug? the file can't be opened, so how should i modify this typo?