docker_datalake icon indicating copy to clipboard operation
docker_datalake copied to clipboard

Data stream project

Open vincentnam opened this issue 3 years ago • 0 comments

Data management :

Data management is composed of 2 parts : batch data and stream data.

The stream part is not currently well implemented in the architecture. Tools are needed to be inserted in the architecture.

Goals :

  • Insert data as stream (use case : IoT sensors every x seconds)
  • Consume data as a stream (use case : online machine learning)

With a finer granularity, stream data are just several batch data in a row. For the raw data area, each sample coming from a data stream could be seen as a dataset. But it has to be thinked in a higher level :

  • How to handle data stream metadata ?
  • How to handle data stream in Openstack Swift / raw data area ?
    • Should all the samples be concatened every x period of time ?
    • Should all the samples be considered as a batch data ?
      • How to handle the whole stream ? (1 source = 1 stream ? What if source is a complex source (complex data sensors (energy / multi sensors devices) ?)
      • How to automatically handle stream ? (On ID ? On ID + type of data ? etc..)

Streaming data insertion :

~~First step is to deploy the data insertion tools : Kafka.~~

  • [ ] ~~Deploy Kafka for data insertion~~
  • [ ] Design the metadata and data management services

Streaming data consumption tools :

Once done, there are 2 parts : data streaming consumption and real-time data streaming consumption. Real-time consumption is a specific case of streaming data. Indeed, data are consumed in stream but the time between the data creation and the data consumption are constrained by a maximum deadline to respect. Real-time architecture have to be fully designed around that use case : is it possible to deliver a real-time consumption service among a datalake ? The question has to be discussed. At this moment, the real-time data processing development is paused until an answer is found.
At this stage, Kafka could be a good answer but architecture design discussions and benchmark have to be done.

  • [ ] State of art for data streaming handling tools
  • [ ] Design the solution for data streaming consumption (Kafka ?)

vincentnam avatar Jan 25 '21 10:01 vincentnam