opentelemetry-collector-contrib
opentelemetry-collector-contrib copied to clipboard
[extension/storage/filestorage] Default fsync to true
Description:
See #31200, setting fsync to false by default is dangerous and will lead to data loss.
Link to tracking Issue: #31200
I'm in support of this change.
If we do make the change, we'll need to roll it out with a feature gate.
I fully agree this is a breaking change, but I'm not convinced that flipping a feature gate is easier/more feasible than changing the configuration value?
I fully agree this is a breaking change, but I'm not convinced that flipping a feature gate is easier/more feasible than changing the configuration value?
I think the benefit is to give users warning that the default will change, but maybe it's just more hassle then it's worth.
I think we should clearly list it as a breaking change in the changelog, but I don't think we need to feature flag it. It won't break anything, just potentially slow things down slightly.
I'd like to see some benchmarks before we merge this. The storage extension is used by the persistent queue in core, which writes each request to the queue. Enabling or disabling fsync - which in this case only protects against hardware failure and similar events - should be a decision the user makes consciously, as it has potentially major performance implications depending on the storage medium in question.
@swiatekm-sumo you can see some benchmarks on the original file storage PR: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/3087
@swiatekm-sumo you can see some benchmarks on the original file storage PR: #3087
Based on this comment, we observed a 3x difference in performance.
We had a longer discussion about the risks of this setting on that thread as well. Based on rereading, it's not clear to me that this change should happen. Perhaps we should consider a third option which is to add more detailed documentation about the parameter and remove the default value, forcing users to make a decision about it.
I will say that in the absence of a clear consensus I think we need to default to no breaking change.
Part of the problem is that the value of this setting should likely depend on what the extension is used for. The persistent queue use case is performance-sensitive, but use cases where we just periodically store some relatively small state, like in filelog receiver, would probably be better off with fsync on.
There's also some effort to make the persistent queue less taxing on the storage, so once that lands we could revisit whether enabling fsync remains painful there.
It's a shame that bbolt panics on corruption instead of allowing a graceful way to truncate the database
We could also consider other storage engines. Our API is very simple and maps readily to any embedded transactional key-value store. I once ran some benchmarks across a variety of them and didn't need to do much work to replace bbolt.
Bbbolt also has some other properties which are undesirable from our perspective, like not dealing with a full disk very gracefully.
@swiatekm-sumo, I'd be interested to see your findings on other storage engines. I seem to recall writing up a comparison vs other options when first implementing the extension but can't seem to find it now.
@djaglowski I'll see if I can dig it up. Even if I can't, I don't think it'll be difficult to reproduce.
It was pointed out in SIG today that the performance impact of enabling fsync was original stated as "3 orders of magnitude", rather than the 3x I stated above.
In any case, we agreed that given the tradeoffs here, we should serious consider switching away from bbolt. I suggest we should wait until @swiatekm-sumo has a chance to look for his previous analysis. Otherwise, if anyone has potential alternatives in mind, please feel free to call them out.
This PR was marked stale due to lack of activity. It will be closed in 14 days.
Closed as inactive. Feel free to reopen if this PR is still being worked on.