rocksdb Sync WAL during db Close()

Context/Summary: Below crash test found out we don't sync WAL upon DB close, which can lead to unsynced data loss. This PR syncs it.

./db_stress --threads=1 --disable_auto_compactions=1 --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=0 --adaptive_readahead=0 --adm_policy=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=0 --block_protection_bytes_per_key=2 --block_size=16384 --bloom_before_level=1 --bloom_bits=29.895303579352174 --bottommost_compression_type=disable --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=0 --compaction_readahead_size=0 --compaction_style=0 --compaction_ttl=0 --compress_format_version=2 --compressed_secondary_cache_ratio=0 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=4 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=0 --default_temperature=kUnknown --default_write_temperature=kUnknown --delete_obsolete_files_period_micros=0 --delpercent=0 --delrangepercent=0 --destroy_db_initially=1 --detect_filter_construct_corruption=1 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=1 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=none --fill_cache=0 --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=6 --index_shortening=0 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=0 --key_len_percent_dist=1,30,69 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=0 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=0 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=2500000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=0 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=0 --min_write_buffer_number_to_merge=1 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=3 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=0 --recycle_log_file_num=0 --reopen=2 --report_bg_io_stats=1 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --skip_stats_update_on_db_open=1 --snapshot_hold_ops=0 --soft_pending_compaction_bytes_limit=68719476736 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=0 --subcompactions=3 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=3 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=100

 Verification failed for column family 0 key 000000000000B9D1000000000000012B000000000000017D (4756691): value_from_db: , value_from_expected: 010000000504070609080B0A0D0C0F0E111013121514171619181B1A1D1C1F1E212023222524272629282B2A2D2C2F2E313033323534373639383B3A3D3C3F3E, msg: Iterator verification: Value not found: NotFound: 
Verification failed :(

Test:

New UT
Same stress test command failed before this fix but pass after
CI

Apr 17 '24 21:04 hx235

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Apr 17 '24 21:04 facebook-github-bot

Can we sync the WAL during DB close?

Apr 17 '24 21:04 ajkr

@hx235 has updated the pull request. You must reimport the pull request before landing.

Apr 18 '24 08:04 facebook-github-bot

Can we sync the WAL during DB close?

Good point - updated.

Apr 18 '24 08:04 hx235

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Apr 18 '24 08:04 facebook-github-bot

Working on a UT

Apr 24 '24 18:04 hx235

@hx235 has updated the pull request. You must reimport the pull request before landing.

Apr 25 '24 00:04 facebook-github-bot

@hx235 has updated the pull request. You must reimport the pull request before landing.

Apr 25 '24 01:04 facebook-github-bot

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Apr 25 '24 01:04 facebook-github-bot

Looking into the ASAN test failure. The other one seems flaky.

Apr 25 '24 01:04 hx235

@hx235 has updated the pull request. You must reimport the pull request before landing.

Apr 25 '24 02:04 facebook-github-bot

@hx235 has updated the pull request. You must reimport the pull request before landing.

Apr 25 '24 02:04 facebook-github-bot

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Apr 25 '24 02:04 facebook-github-bot

@hx235 has updated the pull request. You must reimport the pull request before landing.

May 13 '24 23:05 facebook-github-bot

@hx235 has updated the pull request. You must reimport the pull request before landing.

May 13 '24 23:05 facebook-github-bot

@hx235 has updated the pull request. You must reimport the pull request before landing.

May 14 '24 22:05 facebook-github-bot

@hx235 has updated the pull request. You must reimport the pull request before landing.

May 17 '24 16:05 facebook-github-bot

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

May 17 '24 16:05 facebook-github-bot

@hx235 has updated the pull request. You must reimport the pull request before landing.

May 17 '24 19:05 facebook-github-bot

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

May 17 '24 19:05 facebook-github-bot

@hx235 has updated the pull request. You must reimport the pull request before landing.

May 20 '24 23:05 facebook-github-bot

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

May 20 '24 23:05 facebook-github-bot

@hx235 merged this pull request in facebook/rocksdb@d7b938882e4e81892721d5787c446348ffb782c8.

May 21 '24 00:05 facebook-github-bot

Sorry for missing the earlier discussion. I have some issues with this change:

Destructors should involve blocking operations as minimally as possible, as they might be run when an exception is thrown, and a sync might be waiting minutes for a network request to timeout. In my opinion the DB destructor shouldn't do any blocking operations to guarantee data persistence any more than you would get from a process crash. Calling Close() is the way to ensure data is persisted, as it's the only way to see any final failures.
As such, whether to wait for sync should be a close-time option. The meaning of Close() should be apparent from local information, not from far-off info/state.
"avoid_sync_during_shutdown. If set true, it will sync WALs" Release note typo
avoid_xxx option names are awkward if xxx also makes sense as an option name. Especially so if the default is true, which is considered the unnatural default for boolean options.
The unit test seems to rely on a bug in FaultInjectionTestFS. It does not explicitly invoke any features of it and the test fails with a normal env. I believe an intended fault_fs->DropUnsyncedFileData() is missing.

May 31 '24 21:05 pdillinger

Destructors should involve blocking operations as minimally as possible, as they might be run when an exception is thrown, and a sync might be waiting minutes for a network request to timeout. In my opinion the DB destructor shouldn't do any blocking operations to guarantee data persistence any more than you would get from a process crash. Calling Close() is the way to ensure data is persisted, as it's the only way to see any final failures.

We could try that. I'd be slightly worried about users of delete db; who were previously getting persistence, at least for disableWAL writes. I know they weren't getting that persistence when memtable flush failed, but I think there's a difference between not getting that persistence in a scenario most users probably never encountered, and never getting that persistence.

It would be nice if there's some programmatic way to encourage Close() to be called. Maybe we can log a warning after recovery from non-Close() shutdown. We just need to make Close() log a marker.

Jun 04 '24 18:06 ajkr

@pdillinger @tedyu @ajkr With more discussion, we decided to revert related changes https://github.com/facebook/rocksdb/commit/733150f6aaf52c8dcedb843fc329b472bba4ffd7 and https://github.com/facebook/rocksdb/pull/12556.

We will move the decision of sync wal to internals by having another Close() functions that users can call. Eventually we want to move avoid_flush_during_shutdown to that new Close() and have no blocking operation in CloseImpl()/DB destructor.

Jun 04 '24 22:06 hx235

Destructors should involve blocking operations as minimally as possible We could try that. I'd be slightly worried about users of delete db; who were previously getting persistence

Yeah, that's probably a major version with advanced notice kind of a change.

It would be nice if there's some programmatic way to encourage Close() to be called. Maybe we can log a warning after recovery from non-Close() shutdown. We just need to make Close() log a marker.

Also, in debug builds, ~DB without prior Close() could assert that there's an uncaught exception (https://en.cppreference.com/w/cpp/error/uncaught_exception). This shouldn't have any false positives (was no opportunity to call Close() manually) and minimal false negatives (e.g. your whole program is in a destructor triggered from an exception).

@ajkr @hx235

Jun 05 '24 20:06 pdillinger

It would be nice if there's some programmatic way to encourage Close() to be called.

It goes waaay beyond "encouraging" but we can make the destructor private (and add necessary classes as friends if any). This might quite possibly be unrealistic though in terms of effort needed to upgrade all clients.

Jun 05 '24 20:06 ltamasi

Should we at least update the wiki / home page instruction to encourage Close() to be called for the shutdown?

https://github.com/facebook/rocksdb/wiki/Basic-Operations#closing-a-database

and

https://github.com/facebook/rocksdb/blob/main/docs/_docs/getting-started.md

Jul 29 '24 17:07 jaykorean

rocksdb rocksdb copied to clipboard

Sync WAL during db Close()

rocksdb
rocksdb copied to clipboard