NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Re-add `test_uneven_common_crawl_range` PyTest

Open sarahyurick opened this issue 1 year ago • 0 comments
trafficstars

PR https://github.com/NVIDIA/NeMo-Curator/pull/235 skips test_uneven_common_crawl_range because of how flaky it is. In the future, we may want to debug and re-add it.

def test_uneven_common_crawl_range(self):
    start_snapshot = "2021-03"
    end_snapshot = "2021-11"
    urls = get_common_crawl_urls(start_snapshot, end_snapshot)

    assert (
        urls[0]
        == "https://data.commoncrawl.org/crawl-data/CC-MAIN-2021-10/segments/1614178347293.1/warc/CC-MAIN-20210224165708-20210224195708-00000.warc.gz"
    )
    assert (
        urls[-1]
        == "https://data.commoncrawl.org/crawl-data/CC-MAIN-2021-04/segments/1610704847953.98/warc/CC-MAIN-20210128134124-20210128164124-00799.warc.gz"
    )
    assert len(urls) == 143840

sarahyurick avatar Sep 09 '24 18:09 sarahyurick