docker_datalake Process area : Airflow and workflow implementation

Process area : Airflow and workflow implementation

Open vincentnam opened this issue 3 years ago • 0 comments

Airflow is one of the most important service in the datalake architecture. There is a lot of work to do in. Airflow handles all the workflows / pipelines for data management, data transformation, data analysis. There are several part in this part : a engineering part and a research part.

At this moment, the research part is linked to semantic web, ontologies and metadata management. This part is closely linked to the DataNoos research group in IRIT (https://www.irit.fr/datanoos/) :

How to handle interoperability between already developped projects for workflows ?
How to create workflows with a generic method ? (https://github.com/Barski-lab/cwl-airflow could be a (part of the) solution)
How to keep informations on workflows for data life cycle ?
What metadata format / metadata architecture should be used in the datalake ?

The engineering part is splitted in 2 parts :

Workflow structure implementation

What should be the software architecture to be able to handle an increasing number of differents workflows ?
How to generate workflows from data ?

The idea of the project is to create from a graph database and a document oriented database, a workflow implementation in Airflow. As Airflow and Neo4J both use graph, the main idea is to create from Neo4J the DAG with each task defined in a MongoDB document. (See Airflow documentation : https://airflow.apache.org/ for concepts)

The objectif is to design and implement tools to : - Read workflow data - Read the task - Automatic implementation of Airflow workflow for new data

Logs of each workflows have to be saved, "requestable" and make it possible to recreate the whole process later from logs (the whole workflow, the run and the processed data).

Workflow content implementation

Implements generic data transformations for default data transformation in the datalake and give a minimum service for not related users to the data analysis domain The real challenge is to define several workflow for each data type with unknown data. Indeed, we will take an example with a CSV file. Most of the time, it can contains :
- Time series

Timestamp	Value	Type
1613988555	14.2	"Celsius"
1613988655	14.3	"Celsius"
...	...	...

Relational data (with a structure behind)

Market	Vendor	Product	Number
"Market_1"	"Philippe"	"142D54S7"	1
"Market_2"	"Erik"	"142D54S7"	14
...	...	...	...

(Some other kind of data could be defined)

The workflow should be :

Read the data (CSV, JSON, XML, etc..)
Detect data type
- Timeseries :
  - Concat all the point into a time serie
  - Insert the time serie into InfluxDB
- Relational data :
  - Detect the data architecture / Create a data architecture
  - Format data
  - Insert data into a relational database

The data come as file or a stream. In a stream, the file extension can't be used. A solution should be designed (probably with Kafka) to extract from the data the format type.

At this point, the file type that need to be processed are : .csv, .json, .xml, .txt, .jpg, .png, .zip, .rar, .tar, .tar.gz (and other), .xls, .xlsx, .svg, .pdf

Image recommendation tool :

As a part of Airflow workflows implementation, a project is to create a metadata database over the content to create a recommandation tool to create image dataset. The goals is to detect (extracts) elements in each image, create a Neo4J-based recommandation tool and get the list of image that match some criteria.

Based on the same concept, it can be done on audio and video data.

IA, machine learning tools and Deep Learning could be used in this project.

Jan 22 '21 15:01 vincentnam

docker_datalake docker_datalake copied to clipboard

Process area : Airflow and workflow implementation

Workflow structure implementation

Workflow content implementation

Timeseries :

Relational data :

docker_datalake
docker_datalake copied to clipboard