rocksdb
rocksdb copied to clipboard
Sync WAL during db Close()
Context/Summary: Below crash test found out we don't sync WAL upon DB close, which can lead to unsynced data loss. This PR syncs it.
./db_stress --threads=1 --disable_auto_compactions=1 --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=0 --adaptive_readahead=0 --adm_policy=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=0 --block_protection_bytes_per_key=2 --block_size=16384 --bloom_before_level=1 --bloom_bits=29.895303579352174 --bottommost_compression_type=disable --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=0 --compaction_readahead_size=0 --compaction_style=0 --compaction_ttl=0 --compress_format_version=2 --compressed_secondary_cache_ratio=0 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=4 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=0 --default_temperature=kUnknown --default_write_temperature=kUnknown --delete_obsolete_files_period_micros=0 --delpercent=0 --delrangepercent=0 --destroy_db_initially=1 --detect_filter_construct_corruption=1 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=1 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=none --fill_cache=0 --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=6 --index_shortening=0 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=0 --key_len_percent_dist=1,30,69 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=0 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=0 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=2500000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=0 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=0 --min_write_buffer_number_to_merge=1 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=3 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=0 --recycle_log_file_num=0 --reopen=2 --report_bg_io_stats=1 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --skip_stats_update_on_db_open=1 --snapshot_hold_ops=0 --soft_pending_compaction_bytes_limit=68719476736 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=0 --subcompactions=3 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=3 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=100
Verification failed for column family 0 key 000000000000B9D1000000000000012B000000000000017D (4756691): value_from_db: , value_from_expected: 010000000504070609080B0A0D0C0F0E111013121514171619181B1A1D1C1F1E212023222524272629282B2A2D2C2F2E313033323534373639383B3A3D3C3F3E, msg: Iterator verification: Value not found: NotFound:
Verification failed :(
Test:
- New UT
- Same stress test command failed before this fix but pass after
- CI
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
Can we sync the WAL during DB close?
@hx235 has updated the pull request. You must reimport the pull request before landing.
Can we sync the WAL during DB close?
Good point - updated.
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
Working on a UT
@hx235 has updated the pull request. You must reimport the pull request before landing.
@hx235 has updated the pull request. You must reimport the pull request before landing.
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
Looking into the ASAN test failure. The other one seems flaky.
@hx235 has updated the pull request. You must reimport the pull request before landing.
@hx235 has updated the pull request. You must reimport the pull request before landing.
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@hx235 has updated the pull request. You must reimport the pull request before landing.
@hx235 has updated the pull request. You must reimport the pull request before landing.
@hx235 has updated the pull request. You must reimport the pull request before landing.
@hx235 has updated the pull request. You must reimport the pull request before landing.
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@hx235 has updated the pull request. You must reimport the pull request before landing.
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@hx235 has updated the pull request. You must reimport the pull request before landing.
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@hx235 merged this pull request in facebook/rocksdb@d7b938882e4e81892721d5787c446348ffb782c8.
Sorry for missing the earlier discussion. I have some issues with this change:
- Destructors should involve blocking operations as minimally as possible, as they might be run when an exception is thrown, and a sync might be waiting minutes for a network request to timeout. In my opinion the DB destructor shouldn't do any blocking operations to guarantee data persistence any more than you would get from a process crash. Calling Close() is the way to ensure data is persisted, as it's the only way to see any final failures.
- As such, whether to wait for sync should be a close-time option. The meaning of Close() should be apparent from local information, not from far-off info/state.
- "
avoid_sync_during_shutdown
. If set true, it will sync WALs" Release note typo - avoid_xxx option names are awkward if xxx also makes sense as an option name. Especially so if the default is true, which is considered the unnatural default for boolean options.
- The unit test seems to rely on a bug in FaultInjectionTestFS. It does not explicitly invoke any features of it and the test fails with a normal env. I believe an intended
fault_fs->DropUnsyncedFileData()
is missing.
- Destructors should involve blocking operations as minimally as possible, as they might be run when an exception is thrown, and a sync might be waiting minutes for a network request to timeout. In my opinion the DB destructor shouldn't do any blocking operations to guarantee data persistence any more than you would get from a process crash. Calling Close() is the way to ensure data is persisted, as it's the only way to see any final failures.
We could try that. I'd be slightly worried about users of delete db;
who were previously getting persistence, at least for disableWAL
writes. I know they weren't getting that persistence when memtable flush failed, but I think there's a difference between not getting that persistence in a scenario most users probably never encountered, and never getting that persistence.
It would be nice if there's some programmatic way to encourage Close()
to be called. Maybe we can log a warning after recovery from non-Close()
shutdown. We just need to make Close()
log a marker.
@pdillinger @tedyu @ajkr With more discussion, we decided to revert related changes https://github.com/facebook/rocksdb/commit/733150f6aaf52c8dcedb843fc329b472bba4ffd7 and https://github.com/facebook/rocksdb/pull/12556.
We will move the decision of sync wal to internals by having another Close() functions that users can call. Eventually we want to move avoid_flush_during_shutdown to that new Close() and have no blocking operation in CloseImpl()/DB destructor.
- Destructors should involve blocking operations as minimally as possible We could try that. I'd be slightly worried about users of
delete db;
who were previously getting persistence
Yeah, that's probably a major version with advanced notice kind of a change.
It would be nice if there's some programmatic way to encourage
Close()
to be called. Maybe we can log a warning after recovery from non-Close()
shutdown. We just need to makeClose()
log a marker.
Also, in debug builds, ~DB without prior Close() could assert that there's an uncaught exception (https://en.cppreference.com/w/cpp/error/uncaught_exception). This shouldn't have any false positives (was no opportunity to call Close() manually) and minimal false negatives (e.g. your whole program is in a destructor triggered from an exception).
@ajkr @hx235
It would be nice if there's some programmatic way to encourage
Close()
to be called.
It goes waaay beyond "encouraging" but we can make the destructor private (and add necessary classes as friends if any). This might quite possibly be unrealistic though in terms of effort needed to upgrade all clients.
Should we at least update the wiki / home page instruction to encourage Close()
to be called for the shutdown?
https://github.com/facebook/rocksdb/wiki/Basic-Operations#closing-a-database
and
https://github.com/facebook/rocksdb/blob/main/docs/_docs/getting-started.md