texera Amber Fault Tolerance: Logging

Amber Fault Tolerance: Logging

Open shengquan-ni opened this issue 2 years ago • 0 comments

This PR includes a full lifecycle of log-based fault tolerance:

Added LogStorage, which is an abstraction of where the log is persisted. Currently implemented LocalFS and HDFS.
Added AsyncLogWriter, which is a thread that pushes the serialized log records into LogStorage.
Added SerializationManager, which does the compression and conversion of determinant from in-mem format to byte array, controlled by the AsyncLogWriter.
Added LogManager, which handles the logging for both data and control determinants and manages the lifecycle of AsyncLogWriter and LogStorage.
Added TimeService and OperatorContext, the operator can access the time service through the context in its own logic to get the timestamp logged by the log manager. We can also use the same context to put the pause manager if we want to let the operator pause itself. Right now due to some scala-java incompatibility issues of trait, I cannot put a member variable inside IOperatorExecutor. We need to do code refactoring for this.

Current Design: Texera Overall Infrastructure-Messaging module - to delete drawio (1)

Purposed Design: Texera Overall Infrastructure-Messaging module - to delete drawio (2)

logging-related TODOs:

~~For better integration on the python side, the serialization of control messages transforms the scala control payload to protobuf objects. However, not all control messages can be transformed. This issue needs to be addressed. I'm thinking let the scala side also uses protobuf object directly.~~ In order to do this, we need to refactor a lot of objects to protobuf for the controller, which is not our focus right now. So I decided to leave this integration aside for now.

Future TODOs:

I and zuozhi decided to keep all data sent in sender's memory until a checkpoint is made. This part will be added in the checkpoint implementation.

Jun 28 '22 01:06 shengquan-ni