Hugo Laurençon
Hugo Laurençon
Sure, here is my example that results in an infinite loop ``` from selectolax.parser import HTMLParser html_str = """ """ def _remove_nodes_matching_css_rules(selectolax_tree): modification = True while modification: found_a_node = False...
@SaulLu can you edit and merge? Thanks!
@SaulLu No, no worries! It was just to tag you in case you didn't see the comments from Thomas
Hi, thank you for your comment! The `remove_non_printing_characters` function was not used during the normalization of the documents: https://github.com/bigscience-workshop/data_tooling/blob/e28064ec7fb38af5143cafc896e9423a8b12392d/ac_dc/filtering.py#L357 However, it was used just before the tokenization step: https://github.com/bigscience-workshop/data_tooling/blob/e28064ec7fb38af5143cafc896e9423a8b12392d/ac_dc/filtering.py#L213 and...
Thanks!
Hi thanks! I don't think I still have it, but it wasn't really long to train and I ran it on my personal computer for 1 day, so it should...
Hi thanks! I think the output of 07_03 is rather ``` PATH_SAVE_S3_WEB_DOCS_NSFW_FILTERED = os.path.join( "s3://m4-datasets/webdocs/web_document_dataset_filtered_imgurldedup_nsfwfiltered", str(IDX_JOB) ) ``` In this case you can just replace the images by None to...
No but we have a Nomic map: https://atlas.nomic.ai/map/obelics
Hi, it was not very useful in the end so I would recommend commenting the parts where it's mentioned in `global_visualization`
From which file did you get this?