NeMo-Curator
NeMo-Curator copied to clipboard
Re-add `test_uneven_common_crawl_range` PyTest
trafficstars
PR https://github.com/NVIDIA/NeMo-Curator/pull/235 skips test_uneven_common_crawl_range because of how flaky it is. In the future, we may want to debug and re-add it.
def test_uneven_common_crawl_range(self):
start_snapshot = "2021-03"
end_snapshot = "2021-11"
urls = get_common_crawl_urls(start_snapshot, end_snapshot)
assert (
urls[0]
== "https://data.commoncrawl.org/crawl-data/CC-MAIN-2021-10/segments/1614178347293.1/warc/CC-MAIN-20210224165708-20210224195708-00000.warc.gz"
)
assert (
urls[-1]
== "https://data.commoncrawl.org/crawl-data/CC-MAIN-2021-04/segments/1610704847953.98/warc/CC-MAIN-20210128134124-20210128164124-00799.warc.gz"
)
assert len(urls) == 143840