Pandemic-Knowledge
Pandemic-Knowledge copied to clipboard
A fully-featured multi-source data pipeline for continuously extracting knowledge from COVID-19 data.
Pandemic-Knowledge
A fully-featured multi-source data pipeline for continuously extracting knowledge from COVID-19 data.
- Contamination figures
- Vaccination figures
- Death figures
- COVID-19-related news (Google News, Twitter)
What you can achieve
Live contaminations map + Latest news | Last 7 days news |
---|---|
![]() |
![]() |
France 3-weeks live map (Kibana Canvas) | Live vaccinations map |
---|---|
![]() |
![]() |
Context
This project was realized over 4 days as part of a MSc hackathon from ETNA, a french computer science school.
The incentives were both to experiment/prototype a big data pipeline and contribute to an open source project.
Install
Below, you'll find the procedure to process COVID-related file and news into the Pandemic Knowledge database (elasticsearch).
The process is scheduled to run every 24 hours so you can update the files and obtain the latest news
-
Pandemic-Knowledge
- What you can achieve
- Context
-
Install
- Env file
- Initialize elasticsearch
- Initialize Prefect
- Run Prefect workers
- COVID-19 data
- News data
- News web app
Env file
Running this project on your local computer ? Just copy the .env.example
file :
cp .env.example .env
Open this .env
file and edit password-related variables.
Initialize elasticsearch
Raise your host's ulimits for ElasticSearch to handle high I/O :
sudo sysctl -w vm.max_map_count=500000
Then :
docker-compose -f create-certs.yml run --rm create_certs
docker-compose up -d es01 es02 es03 kibana
Initialize Prefect
Create a ~/.prefect/config.toml
file with the following content :
# debug mode
debug = true
# base configuration directory (typically you won't change this!)
home_dir = "~/.prefect"
backend = "server"
[server]
host = "http://172.17.0.1"
port = "4200"
host_port = "4200"
endpoint = "${server.host}:${server.port}"
Run Prefect :
docker-compose up -d prefect_postgres prefect_hasura prefect_graphql prefect_towel prefect_apollo prefect_ui
We need to create a tenant. Execute on your host :
pip3 install prefect
prefect backend server
prefect server create-tenant --name default --slug default
Access the web UI at localhost:8081
Run Prefect workers
Agents are services that run your scheduled flows.
-
Open and optionally edit the
agent/config.toml
file. -
Let's instanciate 3 workers :
docker-compose -f agent/docker-compose.yml up -d --build --scale agent=3 agent
:information_source: You can run the agent on another machine than the one with the Prefect server. Edit the
agent/config.toml
file for that.
COVID-19 data
Injection scripts should are scheduled in Prefect so they automatically inject data with the latest news (delete + inject).
There are several data source supported by Pandemic Knowledge
-
Our World In Data; used by Google
- docker-compose slug :
insert_owid
- MinIO bucket :
contamination-owid
- Format : CSV
- docker-compose slug :
-
OpenCovid19-Fr
- docker-compose slug :
insert_france
- Format : CSV (download from Internet)
- docker-compose slug :
-
Public Health France - Virological test results (official source)
- docker-compose slug :
insert_france_virtests
- Format : CSV (download from Internet)
- docker-compose slug :
-
Start MinIO and import your files according to the buckets evoked upper.
For Our World In Data, create the
contamination-owid
bucket and import the CSV file inside.docker-compose up -d minio
MinIO is available at
localhost:9000
-
Download dependencies and start the injection service of your choice. For instance :
pip3 install -r ./flow/requirements.txt docker-compose -f insert.docker-compose.yml up --build insert_owid
-
In Kibana, create an index pattern
contamination_owid_*
-
Once injected, we recommend to adjust the number of replicas in the DevTool :
PUT /contamination_owid_*/_settings { "index" : { "number_of_replicas" : "2" } }
-
Start making your dashboards in Kibana !
News data
There are two sources for news :
- Google News (elasticsearch index:
news_googlenews
) - Twitter (elasticsearch index:
news_tweets
)
- Run the Google News crawler :
docker-compose -f crawl.docker-compose.yml up --build crawl_google_news # and/or crawl_tweets
-
In Kibana, create a
news_*
index pattern -
Edit the index pattern fields :
Name | Type | Format |
---|---|---|
img | string | Url |
link | string with Type: Image with empty URL template | Url |
- Create your visualisation
News web app
Browse through the news with our web application.
-
Make sure you've accepted the self-signed certificate of Elasticsearch at
https://localhost:9200
-
Start-up the app
docker-compose -f news_app/docker-compose.yml up --build -d
-
Discover the app at
localhost:8080
TODOs
Possible improvements :
- [ ] Using Dask for parallelizing process of CSV lines by batch of 1000
- [ ] Removing indices only when source process is successful (adding new index, then remove old index)
- [ ] Removing indices only when crawling is successful (adding new index, then remove old index)
Useful commands
To stop everything :
docker-compose down
docker-compose -f agent/docker-compose.yml down
docker-compose -f insert.docker-compose.yml down
docker-compose -f crawl.docker-compose.yml down
To start each service, step by step :
docker-compose up -d es01 es02 es03 kibana
docker-compose up -d minio
docker-compose up -d prefect_postgres prefect_hasura prefect_graphql prefect_towel prefect_apollo prefect_ui
docker-compose -f agent/docker-compose.yml up -d --build --scale agent=3 agent