milvus-backup icon indicating copy to clipboard operation
milvus-backup copied to clipboard

Support for Incremental Backup and Restore

Open huanghaoyuanhhy opened this issue 5 months ago • 6 comments

Background

We are considering adding an incremental backup feature to Milvus Backup.

A key design question is whether we should support:

  1. Point-in-time recovery (PITR) — restore to any arbitrary timestamp within the backup window.
  2. Restore-to-backup-time — restore only to the specific timestamp(s) of available backups.

Since Milvus is not a transactional processing (TP) database, the demand for PITR may be lower compared to OLTP systems. Many Milvus use cases might only require restoring to the exact backup creation time, which can simplify both backup and restore logic.

The choice between these approaches will affect the implementation strategy:

  • PITR typically requires continuous backups or binlog replay.
  • Restore-to-backup-time can be achieved with periodic incremental snapshot backups.

Key Questions for Discussion

  • Do we need arbitrary point-in-time recovery for Milvus workloads, or is restoring to a backup’s creation time sufficient?

Looking forward to feedback from the community.

huanghaoyuanhhy avatar Aug 11 '25 08:08 huanghaoyuanhhy

@huanghaoyuanhhy How do you define a backup window? Are you referring to the time between two consecutive backup jobs?

Andy6132024 avatar Aug 15 '25 09:08 Andy6132024

@Andy6132024 If you are referring to PITR (point-in-time recovery), then the backup tool needs to run continuously, since changes must be captured without gaps. In this case, there isn’t really a concept of “between two backup jobs.” Instead, the backup window refers to the entire period during which the backup tool is running and collecting logs.

huanghaoyuanhhy avatar Aug 18 '25 03:08 huanghaoyuanhhy

Point-in-time recovery (PITR) is preferred, provided it does not introduce significant resource overhead, degrade system performance, or interfere with other critical functions.

ayushk5-ai avatar Aug 18 '25 08:08 ayushk5-ai

@Andy6132024 If you are referring to PITR (point-in-time recovery), then the backup tool needs to run continuously, since changes must be captured without gaps. In this case, there isn’t really a concept of “between two backup jobs.” Instead, the backup window refers to the entire period during which the backup tool is running and collecting logs.

Thanks for explaining it. In that case, PITR sounds like a better choice. But if it's too complex to implement, we are good with second option as well.

Andy6132024 avatar Aug 20 '25 08:08 Andy6132024

@Andy6132024 @ayushk5-ai

Thanks for sharing your preference. At the current design, PITR would introduce very high costs — both in terms of performance overhead (the backup tool would need to process traffic similar to a datanode) and implementation complexity (e.g., handling imports, ensuring consistency across logs and snapshots, and covering corner cases).

Our current inclination is to adopt snapshot-based incremental backups. This approach already satisfies the majority of backup and restore needs, especially considering that Milvus is designed for approximate search rather than strict transactional guarantees. Snapshot-based backups are also simpler to implement, easier to test, lighter on system resources, and help reduce storage usage compared to full backups.

This is also the reason why we opened this issue for discussion — we want to better understand if there are strong real-world scenarios where the additional cost of PITR would be justified.

huanghaoyuanhhy avatar Aug 28 '25 08:08 huanghaoyuanhhy

@Andy6132024 @ayushk5-ai

Thanks for sharing your preference. At the current design, PITR would introduce very high costs — both in terms of performance overhead (the backup tool would need to process traffic similar to a datanode) and implementation complexity (e.g., handling imports, ensuring consistency across logs and snapshots, and covering corner cases).

Our current inclination is to adopt snapshot-based incremental backups. This approach already satisfies the majority of backup and restore needs, especially considering that Milvus is designed for approximate search rather than strict transactional guarantees. Snapshot-based backups are also simpler to implement, easier to test, lighter on system resources, and help reduce storage usage compared to full backups.

This is also the reason why we opened this issue for discussion — we want to better understand if there are strong real-world scenarios where the additional cost of PITR would be justified.

Thanks. Is there a timeline of implementation?

Andy6132024 avatar Sep 02 '25 07:09 Andy6132024