ClickHouse DB::ReplicatedMergeTreeQueue::createLogEntriesToFetchBrokenParts encounter error Code: 84. DB::Exception: Directory already exists and is not empty.

Our clickhouse is encountering error:

2022.09.02 20:46:03.548731 [ 549 ] {7f8a252c-6df9-430e-a46d-21d4e7705bf6} <Warning> ClusterProxy::SelectStreamFactory: Local replica of shard 1 is stale (delay: 1662151563s.)
2022.09.02 20:46:35.876197 [ 178 ] {} <Warning> bagdb.pipeline_jobs_lo (ReplicatedMergeTreeRestartingThread): Table was in readonly mode. Will try to activate it.
2022.09.02 20:46:35.958054 [ 178 ] {} <Warning> bagdb.pipeline_jobs_lo (6d2e1e13-aea4-4e86-ad2e-1e13aea4be86): Directory covered-by-broken_202209_14216_15635_615 (to detach to) already exists. Will detach to directory with '_tryN' suffix.
2022.09.02 20:46:35.958083 [ 178 ] {} <Warning> bagdb.pipeline_jobs_lo (6d2e1e13-aea4-4e86-ad2e-1e13aea4be86): Directory covered-by-broken_202209_14216_15635_615_try1 (to detach to) already exists. Will detach to directory with '_tryN' suffix.
2022.09.02 20:46:35.958097 [ 178 ] {} <Warning> bagdb.pipeline_jobs_lo (6d2e1e13-aea4-4e86-ad2e-1e13aea4be86): Directory covered-by-broken_202209_14216_15635_615_try2 (to detach to) already exists. Will detach to directory with '_tryN' suffix.
2022.09.02 20:46:35.958109 [ 178 ] {} <Warning> bagdb.pipeline_jobs_lo (6d2e1e13-aea4-4e86-ad2e-1e13aea4be86): Directory covered-by-broken_202209_14216_15635_615_try3 (to detach to) already exists. Will detach to directory with '_tryN' suffix.
2022.09.02 20:46:35.958122 [ 178 ] {} <Warning> bagdb.pipeline_jobs_lo (6d2e1e13-aea4-4e86-ad2e-1e13aea4be86): Directory covered-by-broken_202209_14216_15635_615_try4 (to detach to) already exists. Will detach to directory with '_tryN' suffix.
2022.09.02 20:46:35.958135 [ 178 ] {} <Warning> bagdb.pipeline_jobs_lo (6d2e1e13-aea4-4e86-ad2e-1e13aea4be86): Directory covered-by-broken_202209_14216_15635_615_try5 (to detach to) already exists. Will detach to directory with '_tryN' suffix.
2022.09.02 20:46:35.958147 [ 178 ] {} <Warning> bagdb.pipeline_jobs_lo (6d2e1e13-aea4-4e86-ad2e-1e13aea4be86): Directory covered-by-broken_202209_14216_15635_615_try6 (to detach to) already exists. Will detach to directory with '_tryN' suffix.
2022.09.02 20:46:35.958160 [ 178 ] {} <Warning> bagdb.pipeline_jobs_lo (6d2e1e13-aea4-4e86-ad2e-1e13aea4be86): Directory covered-by-broken_202209_14216_15635_615_try7 (to detach to) already exists. Will detach to directory with '_tryN' suffix.
2022.09.02 20:46:35.958173 [ 178 ] {} <Warning> bagdb.pipeline_jobs_lo (6d2e1e13-aea4-4e86-ad2e-1e13aea4be86): Directory covered-by-broken_202209_14216_15635_615_try8 (to detach to) already exists. Will detach to directory with '_tryN' suffix.
2022.09.02 20:46:35.958185 [ 178 ] {} <Warning> bagdb.pipeline_jobs_lo (6d2e1e13-aea4-4e86-ad2e-1e13aea4be86): Directory covered-by-broken_202209_14216_15635_615_try9 (to detach to) already exists. Will detach to directory with '_tryN' suffix.
2022.09.02 20:46:35.962149 [ 178 ] {} <Error> bagdb.pipeline_jobs_lo (ReplicatedMergeTreeRestartingThread): void DB::ReplicatedMergeTreeRestartingThread::run(): Code: 84. DB::Exception: Directory /var/lib/clickhouse/store/6d2/6d2e1e13-aea4-4e86-ad2e-1e13aea4be86/detached/covered-by-broken_202209_14216_15635_615_try9 already exists and is not empty. (DIRECTORY_ALREADY_EXISTS), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0xb8a20fa in /usr/bin/clickhouse
1. DB::localBackup(std::__1::shared_ptr<DB::IDisk> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, std::__1::optional<unsigned long>, bool) @ 0x16ddc380 in /usr/bin/clickhouse
2. DB::IMergeTreeDataPart::makeCloneInDetached(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::shared_ptr<DB::StorageInMemoryMetadata const> const&) const @ 0x16b1a545 in /usr/bin/clickhouse
3. DB::StorageReplicatedMergeTree::removePartAndEnqueueFetch(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x168e3093 in /usr/bin/clickhouse
4. DB::ReplicatedMergeTreeQueue::createLogEntriesToFetchBrokenParts() @ 0x16d7a24b in /usr/bin/clickhouse
5. DB::ReplicatedMergeTreeRestartingThread::tryStartup() @ 0x16dbb4f0 in /usr/bin/clickhouse
6. DB::ReplicatedMergeTreeRestartingThread::runImpl() @ 0x16db2045 in /usr/bin/clickhouse
7. DB::ReplicatedMergeTreeRestartingThread::run() @ 0x16dafd5e in /usr/bin/clickhouse
8. DB::BackgroundSchedulePoolTaskInfo::execute() @ 0x15569778 in /usr/bin/clickhouse
9. DB::BackgroundSchedulePool::threadFunction() @ 0x1556ca36 in /usr/bin/clickhouse
10. ? @ 0x1556d8ae in /usr/bin/clickhouse
11. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0xb94d577 in /usr/bin/clickhouse
12. ? @ 0xb95099d in /usr/bin/clickhouse
13. ? @ 0x7fc6e142a609 in ?
14. __clone @ 0x7fc6e1351293 in ?
 (version 22.6.3.35 (official build))

Please advise how to recover from this kind of error.

Sep 02 '22 20:09 vincentyang-plus

Related rows in system.detached_parts

database|table           |partition_id|name                                         |disk   |reason           |min_block_number|max_block_number|level|
--------+----------------+------------+---------------------------------------------+-------+-----------------+----------------+----------------+-----+
bagdb   |pipeline_jobs_lo|            |covered-by-broken_202209_14216_15635_615_try1|default|                 |                |                |     |
bagdb   |pipeline_jobs_lo|            |covered-by-broken_202209_14216_15635_615_try9|default|                 |                |                |     |
bagdb   |pipeline_jobs_lo|            |covered-by-broken_202209_14216_15635_615_try2|default|                 |                |                |     |
bagdb   |pipeline_jobs_lo|            |covered-by-broken_202209_14216_15635_615_try6|default|                 |                |                |     |
bagdb   |pipeline_jobs_lo|            |covered-by-broken_202209_14216_15635_615_try5|default|                 |                |                |     |
bagdb   |pipeline_jobs_lo|            |covered-by-broken_202209_14216_15635_615_try4|default|                 |                |                |     |
bagdb   |pipeline_jobs_lo|            |covered-by-broken_202209_14216_15635_615_try7|default|                 |                |                |     |
bagdb   |pipeline_jobs_lo|            |covered-by-broken_202209_14216_15635_615_try8|default|                 |                |                |     |
bagdb   |pipeline_jobs_lo|            |covered-by-broken_202209_14216_15635_615_try3|default|                 |                |                |     |
bagdb   |pipeline_jobs_lo|202209      |covered-by-broken_202209_14216_15635_615     |default|covered-by-broken|           14216|           15635|  615|

Sep 02 '22 20:09 vincentyang-plus

I have tried to removed all the /var/lib/clickhouse/store/6d2/6d2e1e13-aea4-4e86-ad2e-1e13aea4be86/detached/covered-by-broken_202209_14216_15635_615_try* directory, which doesn't work. Those directories are recreated by server itself.

Sep 02 '22 21:09 vincentyang-plus

We also saw that several times (we don't have more details)

Sep 09 '22 16:09 filimonov

The root cause is that replication fails to start multiple times in a row creating new detached part each time. After 10 attempts to start replication it fails with this error. We need to find the root cause why the first 10 attempts were unsuccessful.

Sep 16 '22 18:09 tavplubix

Before the issue happen, the server node was not responding, so we force reboot the node. If you could tell which log to collect, I can try. To walk around the readonly issue, we can re-created the table, and migrate the data to that new table and drop the original one that failed.

Sep 19 '22 20:09 vincentyang-plus

BTW it looks like 21.11 does not have those try1 ... try9 and just replica can not start with a message like

select * from system.replicas WHERE is_readonly FORMAT Vertical;

last_queue_update_exception: Code: 84. DB::Exception: Directory /var/lib/clickhouse/data/dbname/table_name/detached/covered-by-broken_a23b5b41d58ecb61543883e9831fb115_0_26725_25001 already exists and is not empty. (DIRECTORY_ALREADY_EXISTS) (version 21.11.5.33 (official build))

if i detach table / remove files from detached / attach table again, the same thing repeats.

Oct 20 '22 11:10 filimonov

A workaround is to detach the table, remove the parts that are being detached manually and attach the table back.

Nov 28 '22 13:11 SaltTan

A workaround is to detach the table, remove the parts that are being detached manually and attach the table back.

hello @SaltTan ! We encountered an issue with version 22.3.3.44 after adding new nodes to the cluster, and they share the same zk cluster. but this issue only occurs on the new nodes. Could you please advise on how to manually remove the parts that are being detached? according to mentioned by this issue, does it mean we need to run the command rm store/6d2/6d2e1e13-aea4-4e86-ad2e-1e13aea4be86/detached ?

Jun 05 '24 02:06 escapekyg

A workaround is to detach the table, remove the parts that are being detached manually and attach the table back.

hello @SaltTan ! We encountered an issue with version 22.3.3.44 after adding new nodes to the cluster, and they share the same zk cluster. but this issue only occurs on the new nodes. Could you please advise on how to manually remove the parts that are being detached? according to mentioned by this issue, does it mean we need to run the command rm store/6d2/6d2e1e13-aea4-4e86-ad2e-1e13aea4be86/detached ?

here https://github.com/ClickHouse/ClickHouse/issues/58126#issuecomment-1866588458

Jun 05 '24 19:06 SaltTan

ClickHouse ClickHouse copied to clipboard

DB::ReplicatedMergeTreeQueue::createLogEntriesToFetchBrokenParts encounter error Code: 84. DB::Exception: Directory already exists and is not empty.

ClickHouse
ClickHouse copied to clipboard