ClickHouse
ClickHouse copied to clipboard
Hacker News Dataset
- Download data from the official API:
https://github.com/HackerNews/API
seq 0 2990 | xargs -P100 -I{} bash -c '
BEGIN=$(({} * 10000));
END=$((({} + 1) * 10000 - 1));
echo $BEGIN $END;
curl -sS --retry 100 "https://hacker-news.firebaseio.com/v0/item/[${BEGIN}-${END}].json" | pv > "hn{}.json"'
It will take about a day. The size of files is 12.8 GB.
As an alternative, you can download prepared files from http://files.pushshift.io/hackernews/ But this source is abandoned and does not update.
- Cleanup the download:
for i in *.json; do echo $i; sed 's/{/\n{/g' $i | grep -v -P '^null$' > ${i}.tmp && mv ${i}.tmp ${i}; done
find . -size 40000c | xargs rm
grep -l -o -F '}null' *.json | xargs sed -i -r 's/}(null)+/}/g'
- Create table:
CREATE TABLE hackernews
(
id UInt32,
deleted UInt8,
type Enum('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),
by LowCardinality(String),
time DateTime,
text String,
dead UInt8,
parent UInt32,
poll UInt32,
kids Array(UInt32),
url String,
score Int32,
title String,
parts Array(UInt32),
descendants Int32
)
ENGINE = MergeTree ORDER BY id
- Insert data:
clickhouse-client --query "INSERT INTO hackernews FROM INFILE '*.json' FORMAT JSONEachRow" --progress
24 seconds, 1 202 257 rows/sec.
- The data is available in Playground: https://gh-api.clickhouse.tech/play?user=play#U0VMRUNUIHRvWWVhcih0aW1lKSBBUyBkLCBjb3VudCgpIEFTIGMsIGJhcihjLCAwLCAxMDAwMDAwMCwgMTAwKSBGUk9NIGhhY2tlcm5ld3MgR1JPVVAgQlkgZCBPUkRFUiBCWSBk
It turns out there's another reason this is a problem: it becomes impossible to use threads within the scope of a single file (script), since the classes declared in the script won't be autoloadable without re-running the code that led to the creation of the thread in the first place.
ext-parallel might suck for a number of reasons, but one thing it really got right is the closure-based execution units.