Hacker News Dataset

Open alexey-milovidov opened this issue 4 years ago • 26 comments

Download data from the official API:

https://github.com/HackerNews/API

seq 0 2990 | xargs -P100 -I{} bash -c '
    BEGIN=$(({} * 10000));
    END=$((({} + 1) * 10000 - 1));
    echo $BEGIN $END;
    curl -sS --retry 100 "https://hacker-news.firebaseio.com/v0/item/[${BEGIN}-${END}].json" | pv > "hn{}.json"'

It will take about a day. The size of files is 12.8 GB.

As an alternative, you can download prepared files from http://files.pushshift.io/hackernews/ But this source is abandoned and does not update.

Cleanup the download:

for i in *.json; do echo $i; sed 's/{/\n{/g' $i | grep -v -P '^null$' > ${i}.tmp && mv ${i}.tmp ${i}; done
find . -size 40000c | xargs rm
grep -l -o -F '}null' *.json | xargs sed -i -r 's/}(null)+/}/g'

Create table:

CREATE TABLE hackernews
(
id UInt32,
deleted UInt8,
type Enum('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),
by LowCardinality(String),
time DateTime,
text String,
dead UInt8,
parent UInt32,
poll UInt32,
kids Array(UInt32),
url String,
score Int32,
title String,
parts Array(UInt32),
descendants Int32
)
ENGINE = MergeTree ORDER BY id

Insert data:

clickhouse-client --query "INSERT INTO hackernews FROM INFILE '*.json' FORMAT JSONEachRow" --progress

24 seconds, 1 202 257 rows/sec.

The data is available in Playground: https://gh-api.clickhouse.tech/play?user=play#U0VMRUNUIHRvWWVhcih0aW1lKSBBUyBkLCBjb3VudCgpIEFTIGMsIGJhcihjLCAwLCAxMDAwMDAwMCwgMTAwKSBGUk9NIGhhY2tlcm5ld3MgR1JPVVAgQlkgZCBPUkRFUiBCWSBk

Oct 04 '21 01:10 alexey-milovidov

It turns out there's another reason this is a problem: it becomes impossible to use threads within the scope of a single file (script), since the classes declared in the script won't be autoloadable without re-running the code that led to the creation of the thread in the first place.

ext-parallel might suck for a number of reasons, but one thing it really got right is the closure-based execution units.

Oct 02 '21 13:10 dktapps