rocksdb icon indicating copy to clipboard operation
rocksdb copied to clipboard

Sync WAL during db Close()

Open hx235 opened this issue 10 months ago • 20 comments

Context/Summary: Below crash test found out we don't sync WAL upon DB close, which can lead to unsynced data loss. This PR syncs it.

./db_stress --threads=1 --disable_auto_compactions=1 --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=0 --adaptive_readahead=0 --adm_policy=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=0 --block_protection_bytes_per_key=2 --block_size=16384 --bloom_before_level=1 --bloom_bits=29.895303579352174 --bottommost_compression_type=disable --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=0 --compaction_readahead_size=0 --compaction_style=0 --compaction_ttl=0 --compress_format_version=2 --compressed_secondary_cache_ratio=0 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=4 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=0 --default_temperature=kUnknown --default_write_temperature=kUnknown --delete_obsolete_files_period_micros=0 --delpercent=0 --delrangepercent=0 --destroy_db_initially=1 --detect_filter_construct_corruption=1 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=1 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=none --fill_cache=0 --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=6 --index_shortening=0 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=0 --key_len_percent_dist=1,30,69 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=0 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=0 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=2500000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=0 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=0 --min_write_buffer_number_to_merge=1 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=3 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=0 --recycle_log_file_num=0 --reopen=2 --report_bg_io_stats=1 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --skip_stats_update_on_db_open=1 --snapshot_hold_ops=0 --soft_pending_compaction_bytes_limit=68719476736 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=0 --subcompactions=3 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=3 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=100

 Verification failed for column family 0 key 000000000000B9D1000000000000012B000000000000017D (4756691): value_from_db: , value_from_expected: 010000000504070609080B0A0D0C0F0E111013121514171619181B1A1D1C1F1E212023222524272629282B2A2D2C2F2E313033323534373639383B3A3D3C3F3E, msg: Iterator verification: Value not found: NotFound: 
Verification failed :(

Test:

  • New UT
  • Same stress test command failed before this fix but pass after
  • CI

hx235 avatar Apr 17 '24 21:04 hx235

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Apr 17 '24 21:04 facebook-github-bot

Can we sync the WAL during DB close?

ajkr avatar Apr 17 '24 21:04 ajkr

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Apr 18 '24 08:04 facebook-github-bot

Can we sync the WAL during DB close?

Good point - updated.

hx235 avatar Apr 18 '24 08:04 hx235

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Apr 18 '24 08:04 facebook-github-bot

Working on a UT

hx235 avatar Apr 24 '24 18:04 hx235

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Apr 25 '24 00:04 facebook-github-bot

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Apr 25 '24 01:04 facebook-github-bot

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Apr 25 '24 01:04 facebook-github-bot

Looking into the ASAN test failure. The other one seems flaky.

hx235 avatar Apr 25 '24 01:04 hx235

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Apr 25 '24 02:04 facebook-github-bot

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar Apr 25 '24 02:04 facebook-github-bot

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Apr 25 '24 02:04 facebook-github-bot

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar May 13 '24 23:05 facebook-github-bot

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar May 13 '24 23:05 facebook-github-bot

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar May 14 '24 22:05 facebook-github-bot

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar May 17 '24 16:05 facebook-github-bot

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar May 17 '24 16:05 facebook-github-bot

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar May 17 '24 19:05 facebook-github-bot

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar May 17 '24 19:05 facebook-github-bot

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar May 20 '24 23:05 facebook-github-bot

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar May 20 '24 23:05 facebook-github-bot

@hx235 merged this pull request in facebook/rocksdb@d7b938882e4e81892721d5787c446348ffb782c8.

facebook-github-bot avatar May 21 '24 00:05 facebook-github-bot

Sorry for missing the earlier discussion. I have some issues with this change:

  • Destructors should involve blocking operations as minimally as possible, as they might be run when an exception is thrown, and a sync might be waiting minutes for a network request to timeout. In my opinion the DB destructor shouldn't do any blocking operations to guarantee data persistence any more than you would get from a process crash. Calling Close() is the way to ensure data is persisted, as it's the only way to see any final failures.
  • As such, whether to wait for sync should be a close-time option. The meaning of Close() should be apparent from local information, not from far-off info/state.
  • "avoid_sync_during_shutdown. If set true, it will sync WALs" Release note typo
  • avoid_xxx option names are awkward if xxx also makes sense as an option name. Especially so if the default is true, which is considered the unnatural default for boolean options.
  • The unit test seems to rely on a bug in FaultInjectionTestFS. It does not explicitly invoke any features of it and the test fails with a normal env. I believe an intended fault_fs->DropUnsyncedFileData() is missing.

pdillinger avatar May 31 '24 21:05 pdillinger

  • Destructors should involve blocking operations as minimally as possible, as they might be run when an exception is thrown, and a sync might be waiting minutes for a network request to timeout. In my opinion the DB destructor shouldn't do any blocking operations to guarantee data persistence any more than you would get from a process crash. Calling Close() is the way to ensure data is persisted, as it's the only way to see any final failures.

We could try that. I'd be slightly worried about users of delete db; who were previously getting persistence, at least for disableWAL writes. I know they weren't getting that persistence when memtable flush failed, but I think there's a difference between not getting that persistence in a scenario most users probably never encountered, and never getting that persistence.

It would be nice if there's some programmatic way to encourage Close() to be called. Maybe we can log a warning after recovery from non-Close() shutdown. We just need to make Close() log a marker.

ajkr avatar Jun 04 '24 18:06 ajkr

@pdillinger @tedyu @ajkr With more discussion, we decided to revert related changes https://github.com/facebook/rocksdb/commit/733150f6aaf52c8dcedb843fc329b472bba4ffd7 and https://github.com/facebook/rocksdb/pull/12556.

We will move the decision of sync wal to internals by having another Close() functions that users can call. Eventually we want to move avoid_flush_during_shutdown to that new Close() and have no blocking operation in CloseImpl()/DB destructor.

hx235 avatar Jun 04 '24 22:06 hx235

  • Destructors should involve blocking operations as minimally as possible We could try that. I'd be slightly worried about users of delete db; who were previously getting persistence

Yeah, that's probably a major version with advanced notice kind of a change.

It would be nice if there's some programmatic way to encourage Close() to be called. Maybe we can log a warning after recovery from non-Close() shutdown. We just need to make Close() log a marker.

Also, in debug builds, ~DB without prior Close() could assert that there's an uncaught exception (https://en.cppreference.com/w/cpp/error/uncaught_exception). This shouldn't have any false positives (was no opportunity to call Close() manually) and minimal false negatives (e.g. your whole program is in a destructor triggered from an exception).

@ajkr @hx235

pdillinger avatar Jun 05 '24 20:06 pdillinger

It would be nice if there's some programmatic way to encourage Close() to be called.

It goes waaay beyond "encouraging" but we can make the destructor private (and add necessary classes as friends if any). This might quite possibly be unrealistic though in terms of effort needed to upgrade all clients.

ltamasi avatar Jun 05 '24 20:06 ltamasi

Should we at least update the wiki / home page instruction to encourage Close() to be called for the shutdown?

https://github.com/facebook/rocksdb/wiki/Basic-Operations#closing-a-database

and

https://github.com/facebook/rocksdb/blob/main/docs/_docs/getting-started.md

jaykorean avatar Jul 29 '24 17:07 jaykorean