delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

test: add test for concurrent checkpoint during table load

Open alexwilcoxson-rel opened this issue 1 year ago • 8 comments

Description

This draft PR adds a unit test to demonstrate a concurrency issue with checkpoints and an object store with slow list_with_offset, e.g. Azure when the log directory is large

This test outputs the following (with some comments for explanation)

### Initial table load, one checkpoint exists at version 2

Some("kernel::snapshot::tests::test_concurrent_checkpoint") - cp: CheckpointMetadata {
    version: 2,
    size: 1,
    parts: None,
    size_in_bytes: None,
    num_of_add_files: None,
    checkpoint_schema: None,
    checksum: None,
}, checkpoint_files: [
    ObjectMeta {
        location: Path {
            raw: "_delta_log/00000000000000000002.checkpoint.parquet",
        },
        last_modified: 2024-01-31T04:47:54.160159326Z,
        size: 12712,
        e_tag: Some(
            "3d46c38-6103694fd241f-31a8",
        ),
        version: None,
    },
]

### tokio task loading table with slow list

Some("tokio-runtime-worker") - getting: Path { raw: "_delta_log/_last_checkpoint" }
Some("tokio-runtime-worker") - slow list: Some(Path { raw: "_delta_log" })

### main thread loading table during checkpoint

Some("kernel::snapshot::tests::test_concurrent_checkpoint") - cp: CheckpointMetadata {
    version: 2,
    size: 1,
    parts: None,
    size_in_bytes: None,
    num_of_add_files: None,
    checkpoint_schema: None,
    checksum: None,
}, checkpoint_files: [
    ObjectMeta {
        location: Path {
            raw: "_delta_log/00000000000000000002.checkpoint.parquet",
        },
        last_modified: 2024-01-31T04:47:54.160159326Z,
        size: 12712,
        e_tag: Some(
            "3d46c38-6103694fd241f-31a8",
        ),
        version: None,
    },
]

### checkpoint created on main thread

Some("kernel::snapshot::tests::test_concurrent_checkpoint") - checkpoint created

### table load complete on tokio thread, the checkpoint in memory is version 2, but after the slow list the checkpoint file list contains checkpoint 2 and 3, which causes the panic

Some("tokio-runtime-worker") - cp: CheckpointMetadata {
    version: 2,
    size: 1,
    parts: None,
    size_in_bytes: None,
    num_of_add_files: None,
    checkpoint_schema: None,
    checksum: None,
}, checkpoint_files: [
    ObjectMeta {
        location: Path {
            raw: "_delta_log/00000000000000000003.checkpoint.parquet",
        },
        last_modified: 2024-01-31T16:22:45.874545313Z,
        size: 21472,
        e_tag: Some(
            "3d46c3a-610404a023771-53e0",
        ),
        version: None,
    },
    ObjectMeta {
        location: Path {
            raw: "_delta_log/00000000000000000002.checkpoint.parquet",
        },
        last_modified: 2024-01-31T04:47:54.160159326Z,
        size: 12712,
        e_tag: Some(
            "3d46c38-6103694fd241f-31a8",
        ),
        version: None,
    },
]


thread 'tokio-runtime-worker' panicked at crates/core/src/kernel/snapshot/log_segment.rs:462:5:
assertion `left == right` failed
  left: 2
 right: 1
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Error: JoinError::Panic(Id(17), ...)
test kernel::snapshot::tests::test_concurrent_checkpoint ... FAILED

alexwilcoxson-rel avatar Jan 31 '24 16:01 alexwilcoxson-rel

Thanks for the reproduction case!

rtyler avatar Jan 31 '24 18:01 rtyler

@alexwilcoxson-rel thanks for taking the time to create a reproduction case!

@roeap IIRC you two were chatting in Slack about this, do you think you'll take a look at this or should I add it to my backlog :smile:

rtyler avatar Feb 13 '24 16:02 rtyler

FWIW what we have done so far is just propagate up an actual Err (within our fork) and have retries for this case in our concurrent writers. When it does come up it is typically resolved within one try.

We're close to reducing though to one committer per table where is should be a non issue for us.

alexwilcoxson-rel avatar Feb 13 '24 17:02 alexwilcoxson-rel

@alexwilcoxson-rel can you check against main if this is still a problem?

ion-elgreco avatar Mar 25 '24 07:03 ion-elgreco

@ion-elgreco it does not appear to be an issue after pulling in latest main? do we have an idea of what changed? I notice the assert_eq! and TODO to throw a proper error is still in the code.

edit: perhaps #2270 ?

alexwilcoxson-rel avatar Apr 01 '24 18:04 alexwilcoxson-rel

@ion-elgreco it does not appear to be an issue after pulling in latest main? do we have an idea of what changed? I notice the assert_eq! and TODO to throw a proper error is still in the code.

edit: perhaps #2270 ?

Yeah we are just skipping checkpoint files now if they are higher than the _latest_version

ion-elgreco avatar Apr 01 '24 18:04 ion-elgreco

@rtyler @ion-elgreco given this is resolved and the reproduction test is passing do we want to keep the test?

alexwilcoxson-rel avatar May 09 '24 15:05 alexwilcoxson-rel

Personally, I always vote yes on more meaningful tests :).

roeap avatar May 09 '24 16:05 roeap