moosefs [maybe BUG] 4.57.7: replication may not always be safe

As per #709, I had millions of chunks moved to a dedicated chunkserver (on a stable reliable server with redundant storage). During replication, for unrelated reason I have normally rebooted machine with destination chunkserver. Once rebooted, one missing chunk appeared, from (unimportant) dataset that was in transition to rebooted machine. Chunkserver is configured with HDD_FSYNC_BEFORE_CLOSE = 1 so no chunks should have been lost due to normal reboot.

Perhaps replication is not perfectly atomic/transactional, as apparently replicated chunk can be lost like that?

P.S. Replication reduced goal from two replicas to one.

Sep 18 '25 04:09 onlyjob

Second missing chunk appeared, both with "type of missing chunk" == "Invalid copies".

Sep 18 '25 04:09 onlyjob

Another missing chunk appeared during replication, making 3 total. Is this because number of copies/replicas reduced from 2 to 1 before relocation of data? If that's the case then moving the only copy could be quite fragile. I doubt that a chunk would go missing if replication ensured compliance with destination goal and only after that removed redundant overgoal copies.

P.S. Confirmed by mfsfileinfo: indeed number of replicas were reduced before ensuring correct data placement according to strict storage class definition. Therefore it is a bug, IMHO.

P.P.S. Data was migrating from less to more reliable storage, so all errors happened at the source. At least two chunks were lost due to minor read errors from ageing HDD.

Sep 18 '25 09:09 onlyjob

So...

Properly shut down chunkserver process (not killed, no hardware failure) always finishes all pending write operations before it finishes its work. If you have HDD_FSYNC_BEFORE_CLOSE on, the chunks will be fsynced before closing them. An information about successful replication is sent by the chunkserver to the master only when the whole chunk has been replicated. If this information is not sent, the master doesn't send instructions to the other chunk server to delete the "old" copy.

That means replicating a chunk with only one copy is perfectly safe, as long as there are absolutely no hardware issues on either machine with the chunk's copy ("old" or "new").

But yes, if the "old" copy has deteriorated and the chunkserver hasn't had time yet to get around to it in its scrubbing routine, then while trying to replicate, the system will realise that the only available copy is broken and you will be left with an invalid chunk and a missing file.

In your case, some of your two-copy-chunks have deteriorated and you were unlucky in that the master randomly decided to delete the "good" copy (or maybe they both deteriorated? who knows?)

I know, at first glance, it looks like a "bug", that MooseFS doesn't check "which copy is better to transform to this new 1-goal scenario".

The thing is, MooseFS doesn't work like that. These are 2 separate issues for MooseFS:

problem 1: chunk XXX is overgoal
problem 2: chunk XXX is on a wrong label

These "issues" (like other issues with chunks: endangered, undergoal) have priorities. It so happens that overgoal has a higher priority than wrong label.

The algorithm will take a chunk, look at it and ask: what is potentially wrong with this chunk? And of all the issues it "sees" with the chunk, it will take the one with highest priority and try to solve it. If, for some reason, it can't be solved (at this moment or permanently), it will take the next priority issue. And so on.

In your case, the overgoal issue will be solved first, most of the time, because if deletion limits are reached on some chunkservers, then some of the chunks may be first moved to the correct label, solving the issue with lower priority because the issue with higher priority is temporarily unsolvable. So even swapping the priorities doesn't guarantee you that all the chunks will get replicated before their overgoal copies will get deleted.

And we have to try to perform jobs with lower priority if higher priority is impossible to perform, otherwise, if a high priority job is stuck for some reason, we would block all other jobs across MooseFS.

Historically, I think, wrong label had higher priority than overgoal, but then lots of users ended up with situation when their disk were full and they wanted to downgrade some data from, let's say, 3 copies to 2 copies, but also some of the data was on wrong labels, so the deletions were not performed (or were performed very slowly, because MooseFS stubbornly tried to replicate first). We changed this and these kind of issues stopped being reported. So this priority choice is not random.

We absolutely do not recommend keeping chunks in one copy, in any scenario. We have recurring discussions about banning this storage level from MooseFS, and starting always with 2 copies. But people always say "we need 1 copy for temporary data". OK, use it, but only for temporary data - that means data that you will absolutely never cry over when it's lost. Like temporary files for a calculation that has to be restarted anyway if it's interrupted, cache files, stuff like that. But never for any even remotely important data.

Sep 18 '25 11:09 chogata

Apart from data safety/reliability, another reason to postpone reduction of goal until relocation of data is completed as per storage class change is performance: replication from two chuinkservers is probably faster than from one.

I trust that you know the best which technical approach can ensure that. But please let's spare readers from lecture about low security of one data copy. People should know the risk, and frankly one data copy on mirrored RAID, RAID-6, ZFS with redundancy, etc. is not that bad. Nothing warrants banning flexibility of configuration that allow one data copy.

As for this particular incident, I came to think that rebooting probably did not cause loss of the chunk. Reboot appears to merely coincide with minor read error on the chunkserver with the only remaining data copy. That error followed by two more errors like that, all on the same HDD.

I would dispute "not-a-bug" verdict with another argument: if STRICT storage class means anything, it should mean that data placement is honoured with priority. Isn't that what STRICT storage class definition is about? That's why I insist that for STRICT storage classes correct data placement should be ensured first, followed by compliance with number of replicas. In other words, data should be in the right place first, before cleansweep process.

Non-randomness of priority change is not the point of my critique. It stands to reason that with normal storage classes deletion before replication might be valid; while for strict storage classes deletion after replication should be expected.

Thanks.

Sep 18 '25 21:09 onlyjob

Is this issue what the recent blog entry is based on?

Oct 04 '25 13:10 edrock200

Is this issue what the recent blog entry is based on?

Was it difficult to mention link to the blog post??

I'm guessing it is this one: https://moosefs.com/blog/when-replication-isnt-atomic-lessons-from-moosefs-chunk-migration/

Oct 05 '25 05:10 onlyjob

Is this issue what the recent blog entry is based on?

Was it difficult to mention link to the blog post??

I'm guessing it is this one: https://moosefs.com/blog/when-replication-isnt-atomic-lessons-from-moosefs-chunk-migration/

I see you were able to find it quite easily. See, that wasnt so hard was it??

Oct 06 '25 17:10 edrock200

Listen, @edrock200:

"Sorry", "duly noted" or "I'll remember to do so" would have been a more appropriate response when your neglect to cite the source was pointed out.
Your calculation of my effort may be inaccurate. Regardless, even if I wasted little time discovering content that you should have mentioned, a time was wasted.
All readers would have to waste their time doing redundant search for context that you should have cited.
For future readers it would have been increasingly difficult to find unmentioned blog post because it will be not so "recent", meaning that the more time passes the more effort will be needed for discovery.

Oct 07 '25 22:10 onlyjob

Listen, @edrock200:

"Sorry", "duly noted" or "I'll remember to do so" would have been a more appropriate response when your neglect to cite the source was pointed out.

Your calculation of my effort may be inaccurate. Regardless, even if I wasted little time discovering content that you should have mentioned, a time was wasted.

All readers would have to waste their time doing redundant search for context that you should have cited.

For future readers it would have been increasingly difficult to find unmentioned blog post because it will be not so "recent", meaning that the more time passes the more effort will be needed for discovery.

Listen @onlyjob , First, I didn't "calculate" anything. I'm not part of the MFS team nor did I write the blog. Second, the MFS team is a saint for dealing with you and your horrific attitude and nonsensical posts with your poor setup that doesn't follow IT 101 principles. Regardless, since you seemed destined to add unhelpful content, I won't engage further. Good day to you sir.

Oct 08 '25 14:10 edrock200

Indeed MooseFS team is amazing, honestly. However unwarranted hostility of @edrock200 completely inappropriate. "Poor setup that does not follow IT 101 principles" is nonsense. Nothing in this issue was a mistake but rather an experiment to test MooseFS behaviour, develop better understanding, expose potential bugs, and discuss the experience.

Oct 09 '25 10:10 onlyjob

We absolutely do not recommend keeping chunks in one copy, in any scenario. We have recurring discussions about banning this storage level from MooseFS, and starting always with 2 copies.

This is standard practice on all distributed filesystems. 1 copy is simply asking for trouble. Data loss is almost guaranteed by a lot of factors, including bad hardware. In the Ceph community you have to manually set a flag with a gigantic warning (something like "--yes-i-really-mean-it-let-me-do-it") if you try to create a single replica pool. A giant flag like that could be added to MooseFS in case the user requires it.

Oct 13 '25 10:10 JoaGamo

One copy of disposable data on reliable storage is not "asking for trouble" but a calculated risk -- a perfectly valid option in some circumstances, when admin knows what he is doing. Let's not digress please, @JoaGamo -- this issue is about safety of replication and priority of goal reduction versus data placement/relocation.

P.S. Ceph is rubbish. Yours truly is a Ceph refugee. I would probably have never discovered MooseFS if Ceph was any good.

Oct 14 '25 13:10 onlyjob

One copy of disposable data on reliable storage is not "asking for trouble" but a calculated risk -- a perfectly valid option in some circumstances, when admin knows what he is doing. Let's not digress please, @JoaGamo -- this issue is about safety of replication and priority of goal reduction versus data placement/relocation.

P.S. Ceph is rubbish. Yours truly is a Ceph refugee. I would probably have never discovered MooseFS if Ceph was any good.

I'm also a Ceph refugee, that's why I'm participating here. The lack of proper fsync() for data safety is a problem in MooseFS, I do agree. It was discussed years ago in #115 and it's still open, but awaiting a PR.

Nov 07 '25 22:11 JoaGamo