opentelemetry-collector-contrib icon indicating copy to clipboard operation
opentelemetry-collector-contrib copied to clipboard

[extension/storage/filestorage] Default fsync to true

Open bouk opened this issue 1 year ago • 2 comments

Description:

See #31200, setting fsync to false by default is dangerous and will lead to data loss.

Link to tracking Issue: #31200

bouk avatar Feb 13 '24 08:02 bouk

I'm in support of this change.

If we do make the change, we'll need to roll it out with a feature gate.

I fully agree this is a breaking change, but I'm not convinced that flipping a feature gate is easier/more feasible than changing the configuration value?

andrzej-stencel avatar Feb 13 '24 15:02 andrzej-stencel

I fully agree this is a breaking change, but I'm not convinced that flipping a feature gate is easier/more feasible than changing the configuration value?

I think the benefit is to give users warning that the default will change, but maybe it's just more hassle then it's worth.

djaglowski avatar Feb 13 '24 15:02 djaglowski

I think we should clearly list it as a breaking change in the changelog, but I don't think we need to feature flag it. It won't break anything, just potentially slow things down slightly.

bouk avatar Feb 13 '24 16:02 bouk

I'd like to see some benchmarks before we merge this. The storage extension is used by the persistent queue in core, which writes each request to the queue. Enabling or disabling fsync - which in this case only protects against hardware failure and similar events - should be a decision the user makes consciously, as it has potentially major performance implications depending on the storage medium in question.

swiatekm avatar Feb 16 '24 13:02 swiatekm

@swiatekm-sumo you can see some benchmarks on the original file storage PR: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/3087

bouk avatar Feb 16 '24 16:02 bouk

@swiatekm-sumo you can see some benchmarks on the original file storage PR: #3087

Based on this comment, we observed a 3x difference in performance.

We had a longer discussion about the risks of this setting on that thread as well. Based on rereading, it's not clear to me that this change should happen. Perhaps we should consider a third option which is to add more detailed documentation about the parameter and remove the default value, forcing users to make a decision about it.

I will say that in the absence of a clear consensus I think we need to default to no breaking change.

djaglowski avatar Feb 16 '24 16:02 djaglowski

Part of the problem is that the value of this setting should likely depend on what the extension is used for. The persistent queue use case is performance-sensitive, but use cases where we just periodically store some relatively small state, like in filelog receiver, would probably be better off with fsync on.

There's also some effort to make the persistent queue less taxing on the storage, so once that lands we could revisit whether enabling fsync remains painful there.

swiatekm avatar Feb 16 '24 16:02 swiatekm

It's a shame that bbolt panics on corruption instead of allowing a graceful way to truncate the database

bouk avatar Feb 16 '24 16:02 bouk

We could also consider other storage engines. Our API is very simple and maps readily to any embedded transactional key-value store. I once ran some benchmarks across a variety of them and didn't need to do much work to replace bbolt.

Bbbolt also has some other properties which are undesirable from our perspective, like not dealing with a full disk very gracefully.

swiatekm avatar Feb 16 '24 17:02 swiatekm

@swiatekm-sumo, I'd be interested to see your findings on other storage engines. I seem to recall writing up a comparison vs other options when first implementing the extension but can't seem to find it now.

djaglowski avatar Feb 16 '24 17:02 djaglowski

@djaglowski I'll see if I can dig it up. Even if I can't, I don't think it'll be difficult to reproduce.

swiatekm avatar Feb 16 '24 17:02 swiatekm

It was pointed out in SIG today that the performance impact of enabling fsync was original stated as "3 orders of magnitude", rather than the 3x I stated above.

In any case, we agreed that given the tradeoffs here, we should serious consider switching away from bbolt. I suggest we should wait until @swiatekm-sumo has a chance to look for his previous analysis. Otherwise, if anyone has potential alternatives in mind, please feel free to call them out.

djaglowski avatar Feb 21 '24 17:02 djaglowski

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Mar 07 '24 05:03 github-actions[bot]

Closed as inactive. Feel free to reopen if this PR is still being worked on.

github-actions[bot] avatar Mar 21 '24 05:03 github-actions[bot]