cc-index-table icon indicating copy to clipboard operation
cc-index-table copied to clipboard

How to use AWS Athena to query CC-NEWS data ?

Open vansenic opened this issue 1 year ago • 1 comments

Overview:

I want to query something in the CC-NEWS, but in this paper: https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/, all data in //s3:commoncrawl/cc-index/table/cc-main/warc/.

My Question:

How to use AWS Athena to query CC-NEWS data ?

Or differentiate news from //s3:commoncrawl/cc-index/table/cc-main/warc/?

vansenic avatar Feb 18 '23 10:02 vansenic

Unfortunately, there is yet no index for the news dataset.

sebastian-nagel avatar Feb 19 '23 21:02 sebastian-nagel