cc-index-table icon indicating copy to clipboard operation
cc-index-table copied to clipboard

Replace int96 timestamps in index partitions before CC-MAIN-2020

Open sebastian-nagel opened this issue 2 years ago • 0 comments

See #7 and announcement of January 2020 crawl.

Recent Parquet library versions (1.12.2) start to complain about the int96 timestamps:

$> parquet-cli cat -c fetch_time -n 5 s3a://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2018-43/subset=warc/part-00247-f47c372a-e3d4-4f2b-b7a0-a939c04fd01e.c000.gz.parquet
Argument error: INT96 is deprecated. As interim enable READ_INT96_AS_FIXED  flag to read as byte array.

No complains for data from 2020 and newer:

$> parquet-cli cat -c fetch_time -n 5 s3a://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2020-05/subset=warc/part-00243-2224c996-15d6-400a-8ae4-2d0740e74c18.c000.gz.parquet
1579483394000
1580078106000
1580035997000
1579264777000
1579422799000

Tasks:

  • pin the usage of int64 timestamps (shouldn't be implemented by passing a configuration parameter as done in 500d454)
  • rewrite pre-2020 index partitions

sebastian-nagel avatar Dec 10 '21 15:12 sebastian-nagel