Hugo Laurençon

Results 11 comments of Hugo Laurençon

Sure, here is my example that results in an infinite loop ``` from selectolax.parser import HTMLParser html_str = """ """ def _remove_nodes_matching_css_rules(selectolax_tree): modification = True while modification: found_a_node = False...

@SaulLu can you edit and merge? Thanks!

@SaulLu No, no worries! It was just to tag you in case you didn't see the comments from Thomas

Hi, thank you for your comment! The `remove_non_printing_characters` function was not used during the normalization of the documents: https://github.com/bigscience-workshop/data_tooling/blob/e28064ec7fb38af5143cafc896e9423a8b12392d/ac_dc/filtering.py#L357 However, it was used just before the tokenization step: https://github.com/bigscience-workshop/data_tooling/blob/e28064ec7fb38af5143cafc896e9423a8b12392d/ac_dc/filtering.py#L213 and...

Hi thanks! I don't think I still have it, but it wasn't really long to train and I ran it on my personal computer for 1 day, so it should...

Hi thanks! I think the output of 07_03 is rather ``` PATH_SAVE_S3_WEB_DOCS_NSFW_FILTERED = os.path.join( "s3://m4-datasets/webdocs/web_document_dataset_filtered_imgurldedup_nsfwfiltered", str(IDX_JOB) ) ``` In this case you can just replace the images by None to...

No but we have a Nomic map: https://atlas.nomic.ai/map/obelics

Hi, it was not very useful in the end so I would recommend commenting the parts where it's mentioned in `global_visualization`