cc-index-table icon indicating copy to clipboard operation
cc-index-table copied to clipboard

Index Common Crawl archives in tabular format

Results 8 cc-index-table issues
Sort by recently updated
recently updated
newest added

(depends on #10) [Zstandard compression](https://en.wikipedia.org/wiki/Zstandard) is directly integrated into Parquet ([PARQUET-1866](https://issues.apache.org/jira/browse/PARQUET-1866)). Also Athena now [supports Zstd as Parquet compression](https://docs.aws.amazon.com/athena/latest/ug/release-note-2021-11-24.html). Time to explore whether switching from gzip to zstd brings improvements...

See #7 and [announcement of January 2020 crawl](https://commoncrawl.org/2020/02/january-2020-crawl-archive-now-available/). Recent Parquet library versions (1.12.2) start to complain about the int96 timestamps: ``` $> parquet-cli cat -c fetch_time -n 5 s3a://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2018-43/subset=warc/part-00247-f47c372a-e3d4-4f2b-b7a0-a939c04fd01e.c000.gz.parquet Argument...

Historical hostname -> IP and IP -> hostname (reverse IP) datasets are currently quite hard to come by: https://opendata.stackexchange.com/questions/1951/dataset-of-domain-names the only super convenient methods being websites such as https://viewdns.info/reverseip/ which...

Bumps [guava](https://github.com/google/guava) from 31.1-jre to 32.0.0-jre. Release notes Sourced from guava's releases. 32.0.0 Maven <dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>32.0.0-jre</version> <!-- or, for Android: --> <version>32.0.0-android</version> </dependency> Jar files 32.0.0-jre.jar 32.0.0-android.jar Guava...

dependencies

Bumps spark-core_2.12 from 3.3.2 to 3.4.0. [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.apache.spark:spark-core_2.12&package-manager=maven&previous-version=3.3.2&new-version=3.4.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a...

dependencies

- no host name is extracted in the following situations - URL contains 4 slashes after the protocol: https:////example.org/ - while [java.net.URL](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/URL.html) extracts an empty hostname, the Nutch's OkHTTP-based protocol...

Overview: I want to query something in the CC-NEWS, but in this paper: `https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/`, all data in `//s3:commoncrawl/cc-index/table/cc-main/warc/`. My Question: How to use AWS Athena to query CC-NEWS data ?...

The example queries below [src/sql/examples/cc-index/](/commoncrawl/cc-index-table/tree/main/src/sql/examples/cc-index) were developed using Athena engine v1 or v2. There might be issues when engine v3 ([based on Trino instead of PrestoDb](https://aws.amazon.com/about-aws/whats-new/2022/10/amazon-athena-announces-upgraded-query-engine/)) is used. Eg. `(num_pages/total_pages_host)...