inlong icon indicating copy to clipboard operation
inlong copied to clipboard

[Umbrella] InLong offline synchronization feature

Open aloyszhang opened this issue 11 months ago • 0 comments

Motivation

Currently, InLong provides real-time data synchronization based on the Flink engine, which has the advantage of low latency. Compared to real-time synchronization, offline data synchronization(not supported yet) pays more attention to synchronization throughput and efficiency.

To enhance the usage scenarios of InLong, we plan to add support for offline data synchronization capability in InLong. The implementation is based on the Flink computing engine uniformly. Real-time synchronization tasks run in the manner of Flink stream tasks, while offline synchronization runs in the manner of Flink batch tasks. This approach can ensure the consistency of real-time and offline synchronization tasks' code as much as possible, reducing maintenance costs.

Solution

The offline synchronization feature of the InLong dataset integration provides sources and sinks for processing data, corresponding to data sources and destinations, and combines with the scheduling system to synchronize full or incremental data from the data source to the data target.

InLong supports scheduling offline synchronization tasks by setting specific trigger times(including year, month, day, hour, and minute) through the scheduling system.

Offline synchronization tasks are created by the Manager (including scheduling information), and the specific data synchronization logic is implemented through the InLong Sort module.

Logical Architecture

image

Key Competency

Job Configuration: Support Wizard Mode(Configuration through page wizard) and OpenAPI mode.

Scheduling Configuration: Support Wizard Mode(Configuration through page wizard) and OpenAPI mode

Job Type: Support Periodic Incremental Synchronization and Periodic Full Synchronization

Scheduling: Built-in simple periodic scheduling capability, complex capabilities such as task dependencies are supported by third-party scheduling systems.

Data Source: RMDB, Message Queue and Big data storage(Hive,StarRocks,Iceberg etc.)

Data Sink: RMDB, Message Queue and Big data storage(Hive,StarRocks,Iceberg etc.)

Compute Engine: Flink

Offline Job Operation and Maintenance: Job start,stop and running status monitoring

Special Handling: Dirty Data Processing Capability

Data Flow Architecture

image

  1. The user creates an offline synchronization task.
  2. The manager saves task information and scheduling information in the DB.
  3. After task approval, the offline synchronization task information is encapsulated.
  4. Register scheduling information with the scheduling system; InLong has a built-in simple scheduling solution (Quartz), while complete scheduling capabilities rely on third-party scheduling systems (DolphinScheduler, US, etc.).
  5. The scheduling system regularly generates scheduling instances.
  6. For the initial run, the manager constructs a Flink batch job.
  7. Submit the Flink batch job to the Flink cluster.

Task list

new dev branch

Since this is a big feature for InLong, so, create a new branch for development, and after development and testing are completed, merge it back to master.

  • [x] create new dev branch https://github.com/apache/inlong/tree/dev-offline-sync

Manager

Offline Synchronization Task Management: Definition and Management of Offline Synchronization Tasks

  • [x] #9781
  • [x] #9813
    • [ ] Page wizard mode
    • [x] OpenAPI mode(already covered by existed OpenAPI)
  • [x] #9862
  • [x] #10353

Scheduling Management: Scheduling task definition, scheduling instance definition, scheduling task management (CRUD)

  • [x] #10247
    • [ ] Page wizard mode
    • [x] OpenAPI mode
  • [x] #10437
  • [x] #10360
  • [x] #10566
  • [ ] Support for periodic scheduling capability
    • [ ] Scheduling instance definition
    • [ ] #10395
    • [ ] #10561
    • [ ] Plugin-based scheduling framework support
      • [x] #10396
      • [x] #10514
      • [ ] DolphinScheduler, US, etc.

Offline Task Submission

  • [x] #10459

Offline Task Operation and Maintenance

  • [ ] Start (task submission), stop
  • [ ] Retrieve running status
  • [ ] Task logs, exceptions

Sort

Flink Task Encapsulation: Add support for Flink environment in batch mode

Flink Batch Capability Support

  • [x] #10054
  • [x] #10053
  • [x] #10055
  • [x] #10069
  • [x] #10056
  • [x] #10057
  • [x] #10091
  • [x] #10560

InLong Component

Other for not specified component

Are you willing to submit PR?

  • [X] Yes, I am willing to submit a PR!

Code of Conduct

aloyszhang avatar Mar 05 '24 10:03 aloyszhang