Hugo Laurençon comments

Results 11 comments of


                                            Hugo Laurençon

Return an error when trying to decompose a node with `html` tag

Sure, here is my example that results in an infinite loop ``` from selectolax.parser import HTMLParser html_str = """ """ def _remove_nodes_matching_css_rules(selectolax_tree): modification = True while modification: found_a_node = False...

add files to compute basic stats on pseudo crawl dataset

@SaulLu can you edit and merge? Thanks!

add files to compute basic stats on pseudo crawl dataset

@SaulLu No, no worries! It was just to tag you in case you didn't see the comments from Thomas

Reason for not applying remove_non_prining_characters normalization

Hi, thank you for your comment! The `remove_non_printing_characters` function was not used during the normalization of the documents: https://github.com/bigscience-workshop/data_tooling/blob/e28064ec7fb38af5143cafc896e9423a8b12392d/ac_dc/filtering.py#L357 However, it was used just before the tokenization step: https://github.com/bigscience-workshop/data_tooling/blob/e28064ec7fb38af5143cafc896e9423a8b12392d/ac_dc/filtering.py#L213 and...

Collect data from Data Catalog

Thanks!

Releasing trained topic models?

Hi thanks! I don't think I still have it, but it wasn't really long to train and I ran it on my personal computer for 1 day, so it should...

nsfw filtered texts only file missing at step 08_01

Hi thanks! I think the output of 07_03 is rather ``` PATH_SAVE_S3_WEB_DOCS_NSFW_FILTERED = os.path.join( "s3://m4-datasets/webdocs/web_document_dataset_filtered_imgurldedup_nsfwfiltered", str(IDX_JOB) ) ``` In this case you can just replace the images by None to...

Hugo Laurençon

Return an error when trying to decompose a node with `html` tag

add files to compute basic stats on pseudo crawl dataset

add files to compute basic stats on pseudo crawl dataset

Reason for not applying remove_non_prining_characters normalization

Collect data from Data Catalog

Releasing trained topic models?

nsfw filtered texts only file missing at step 08_01

Search engine over the training data

Missing TextMediaPairsExtractor from the repo

Is the tot_counter saved twice in this code snippe？