docker_datalake
docker_datalake copied to clipboard
Data stream project
Data management :
Data management is composed of 2 parts : batch data and stream data.
The stream part is not currently well implemented in the architecture. Tools are needed to be inserted in the architecture.
Goals :
- Insert data as stream (use case : IoT sensors every x seconds)
- Consume data as a stream (use case : online machine learning)
With a finer granularity, stream data are just several batch data in a row. For the raw data area, each sample coming from a data stream could be seen as a dataset. But it has to be thinked in a higher level :
- How to handle data stream metadata ?
- How to handle data stream in Openstack Swift / raw data area ?
- Should all the samples be concatened every x period of time ?
- Should all the samples be considered as a batch data ?
- How to handle the whole stream ? (1 source = 1 stream ? What if source is a complex source (complex data sensors (energy / multi sensors devices) ?)
- How to automatically handle stream ? (On ID ? On ID + type of data ? etc..)
Streaming data insertion :
~~First step is to deploy the data insertion tools : Kafka.~~
- [ ] ~~Deploy Kafka for data insertion~~
- [ ] Design the metadata and data management services
Streaming data consumption tools :
Once done, there are 2 parts : data streaming consumption and real-time data streaming consumption. Real-time consumption is a specific case of streaming data. Indeed, data are consumed in stream but the time between the data creation and the data consumption are constrained by a maximum deadline to respect.
Real-time architecture have to be fully designed around that use case : is it possible to deliver a real-time consumption service among a datalake ? The question has to be discussed. At this moment, the real-time data processing development is paused until an answer is found.
At this stage, Kafka could be a good answer but architecture design discussions and benchmark have to be done.
- [ ] State of art for data streaming handling tools
- [ ] Design the solution for data streaming consumption (Kafka ?)