Add ZFS.md
Overview
Add the documentation on how to run a DA node with ZFS compression
Summary by CodeRabbit
- New Features
- Introduced a new documentation section for setting up a bridge node with optional on-the-fly compression using ZFS.
- Added a comprehensive guide on configuring a DA node to utilize ZFS compression, including requirements, step-by-step instructions, and command examples for various network environments.
Walkthrough
The documentation has been enhanced with two significant updates. An additional section titled "Optional: enable on-fly compression with ZFS" has been added to the existing how-to-guides/bridge-node.md file. Furthermore, a new guide has been introduced in how-to-guides/zfs.md, detailing the setup of a Data Availability (DA) node with on-the-fly compression using ZFS. This guide includes hardware requirements, installation steps, and advanced tuning options for users.
Changes
| Files | Change Summary |
|---|---|
| how-to-guides/bridge-node.md | New section added: "Optional: enable on-fly compression with ZFS." |
| how-to-guides/zfs.md | New document added providing a comprehensive guide for setting up a DA node with ZFS compression, including requirements and detailed instructions. |
Poem
🐇 In the meadow where data flows,
A bridge node now with ZFS glows.
Compression on-the-fly, oh what a sight,
Storage efficiency, taking flight!
With each command, we hop and cheer,
Optimizing nodes, the future is here! 🌟
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
🪧 Tips
Chat
There are 3 ways to chat with CodeRabbit:
- Review comments: Directly reply to a review comment made by CodeRabbit. Example:
I pushed a fix in commit <commit_id>, please review it.Generate unit testing code for this file.Open a follow-up GitHub issue for this discussion.
- Files and specific lines of code (under the "Files changed" tab): Tag
@coderabbitaiin a new review comment at the desired location with your query. Examples:@coderabbitai generate unit testing code for this file.@coderabbitai modularize this function.
- PR comments: Tag
@coderabbitaiin a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:@coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.@coderabbitai read src/utils.ts and generate unit testing code.@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.@coderabbitai help me debug CodeRabbit configuration file.
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.
CodeRabbit Commands (Invoked using PR comments)
@coderabbitai pauseto pause the reviews on a PR.@coderabbitai resumeto resume the paused reviews.@coderabbitai reviewto trigger an incremental review. This is useful when automatic reviews are disabled for the repository.@coderabbitai full reviewto do a full review from scratch and review all the files again.@coderabbitai summaryto regenerate the summary of the PR.@coderabbitai generate docstringsto generate docstrings for this PR. (Beta)@coderabbitai resolveresolve all the CodeRabbit review comments.@coderabbitai configurationto show the current CodeRabbit configuration for the repository.@coderabbitai helpto get help.
Other keywords and placeholders
- Add
@coderabbitai ignoreanywhere in the PR description to prevent this PR from being reviewed. - Add
@coderabbitai summaryto generate the high-level summary at a specific location in the PR description. - Add
@coderabbitaianywhere in the PR title to generate the title automatically.
CodeRabbit Configuration File (.coderabbit.yaml)
- You can programmatically configure CodeRabbit by adding a
.coderabbit.yamlfile to the root of your repository. - Please see the configuration documentation for more information.
- If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation:
# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
Documentation and Community
- Visit our Documentation for detailed information on how to use CodeRabbit.
- Join our Discord Community to get help, request features, and share feedback.
- Follow us on X/Twitter for updates and announcements.
hmm @Wondertan I think we'd probably advise not to include this hack in docs?
@jcstein, mentioning that in the docs may be helpful, especially if storage issues are pressing for node runners, as it's ~ a 2x reduction.
@kinrokinro, for how long have you been running the node with compression on? Do you see it running stable?
you been running the node with compression on? Do you see it running stable?
For a month and yes, it running stable, our bridge is always in sync and everything looks pretty good to me, that's the reason why I recommend to add this option to the docs.
Anyways we always can mark it as "not recommended" if there is any concern from your side.
BTW zstd-3 is a light compression, max is zstd-19.
Currently we're running both our bridges on testnet and mainnet on zfs, so our metrics is available in your OTEL
thank you for the context and feedback! I think it is safe to recommend this if @Wondertan approves. And then I will work into the sidebar menu
you been running the node with compression on? Do you see it running stable?
For a month and yes, it running stable, our bridge is always in sync and everything looks pretty good to me, that's the reason why I recommend to add this option to the docs.
Anyways we always can mark it as "not recommended" if there is any concern from your side.
BTW zstd-3 is a light compression, max is zstd-19.
I think zstd-3 is optimal as with more zstd-x more time is needed to compress so with bigger load node could be behind sync all the time.
I think zstd-3 is optimal as with more zstd-x more time is needed to compress so with bigger load node could be behind sync all the time.
confirming with @kinrokinro that this uses zstd-3?
I think zstd-3 is optimal as with more zstd-x more time is needed to compress so with bigger load node could be behind sync all the time.
confirming with @kinrokinro that this uses zstd-3?
according to the doc that @kinrokinro wrote - yes, it uses zstd-3 and with this compression he achieved 2.05x ratio. I found an interesting discussion regarding ZFS levels, it gives overall info about ZFS levels. Here is: https://www.reddit.com/r/zfs/comments/sxx9p7/a_simple_real_world_zfs_compression_speed_an/
According to it level more than zstd-10 leads to crazy time downgrade and ratio isn't so huge to sacrifice with performance. I'm going to try zstd-5 with my mocha bridge node to see if no issues with performance and what compression ratio I will get comparing to zstd-3, probably just after full sync ask to share @kinrokinro and compare with what I got.
I think zstd-3 is optimal as with more zstd-x more time is needed to compress so with bigger load node could be behind sync all the time.
confirming with @kinrokinro that this uses zstd-3?
Yes, we use zstd-3 compression level
@kinrokinro do you want to add this to the menu? or think that it is okay as is, linked in the "optional" section?
@kinrokinro do you want to add this to the menu? or think that it is okay as is, linked in the "optional" section?
This should be optional of course
It would be great to add here that processor is more important than RAM here, tried with 2x Intel Xeon Silver 4108 but it doesn't fly, have to request more powerful. Bridge node just syncs too slow and CPU is always more than 90% while RAM (64GB) about 50%.
It would be great to add here that processor is more important than RAM here, tried with 2x Intel Xeon Silver 4108 but it does fly, have to request more powerful. Bridge node just syncs too slow and CPU is always more that 90% while RAM (64GB) about 50%.
Perhaps it because of more powerful compression you choose
It would be great to add here that processor is more important than RAM here, tried with 2x Intel Xeon Silver 4108 but it does fly, have to request more powerful. Bridge node just syncs too slow and CPU is always more that 90% while RAM (64GB) about 50%.
Perhaps it because of more powerful compression you choose
Will provide here much more info on issues that I got with charts, just still in syncing state. But after switching from 2x Intel Xeon Silver 4108 to 2x Intel Xeon Gold 6130 my node is flying very well, just want to see final result on compression ratio with level what we set (zstd-5).
so this https://github.com/celestiaorg/docs/pull/1694/commits/baead03109e63825a544a0cb0d249cb1068bb49b should resolve the issue you were facing @mogoll92 ?
so this baead03 should resolve the issue you were facing @mogoll92 ?
not completely, unfortunately.
With more synced data performance of data node has been decreasing. I thought that 2x Intel Xeon Gold 6130 solved that issue as it more powerful and I didn't see huge load on it, around 50% with +/- 10%. On screenshot you see that node reached 1.5m blocks very fast, but after it had been started degrading in performance. I assume that now it's because of NVMe disks are on PCIe-3 that why node stuck on I/O and it makes load on processor and system overall. I should mentioned that I kept testnet bridge on the server with 2x Intel Xeon Silver 4108 and disks supports PCIe-3 without zfs and node was syncing very well and working as well. Now I rent AMD EPYC 7313P and drives with PCIe-4, will try with that setup.
Also adding to decreasing in performance, on the screenshot below you can see that node progresses with 20k blocks every 3 hours now (sadly). Hope more capable I/O drives will solve it.
not completely, unfortunately. With more synced data performance of data node has been decreasing.
based on this will have to recommend not to use this yet, but happy to leave PR open until when it makes sense to add
not completely, unfortunately. With more synced data performance of data node has been decreasing.
based on this will have to recommend not to use this yet, but happy to leave PR open until when it makes sense to add
Agree, The 4 days are almost passed and node is still in syncing state after switching to new hardware. Less than 1m blocks are left with syncing speed 40k blocks per 12 hours but should say the syncing speed depends on how much data it has so it's faster sometimes. Let's see how it will go after reaching 2m blocks.
Will drop here all insights on issues and so on after node will get synced and get confirmed that metrics are being sent well. This testing might be worth to convert into some forum post for user's who will decides to run DA node with zfs later.
not completely, unfortunately. With more synced data performance of data node has been decreasing.
based on this will have to recommend not to use this yet, but happy to leave PR open until when it makes sense to add
Agree, The 4 days are almost passed and node is still in syncing state after switching to new hardware. Less than 1m blocks are left with syncing speed 40k blocks per 12 hours but should say the syncing speed depends on how much data it has so it's faster sometimes. Let's see how it will go after reaching 2m blocks.
Will drop here here all insights on issues and so on after node will get synced and get confirmed that metrics send well. This testing might be worth to convert into some forum post for user's who will decides to run DA node with zfs later.
Could you try with zstd-3, not zstd-5?
not completely, unfortunately. With more synced data performance of data node has been decreasing.
based on this will have to recommend not to use this yet, but happy to leave PR open until when it makes sense to add
Agree, The 4 days are almost passed and node is still in syncing state after switching to new hardware. Less than 1m blocks are left with syncing speed 40k blocks per 12 hours but should say the syncing speed depends on how much data it has so it's faster sometimes. Let's see how it will go after reaching 2m blocks. Will drop here here all insights on issues and so on after node will get synced and get confirmed that metrics send well. This testing might be worth to convert into some forum post for user's who will decides to run DA node with zfs later.
Could you try with zstd-3, not zstd-5?
Yeah, I left the idea to try syncing with zstd-5 as it requires more processor's resources and maximum that I got compared to zstd-3 is 5% in saving storage. So I will not recommend to try everything higher than zstd-3 or do it on own risk.
Now node is on zstd-3 compression from the beginning.
(Moche bridge node) Alright, I've made some investigation previous month and what I've found...
Jumping ahead, all things regarding CPU, disks huge usage related ONLY during syncing bridge node, after that it becomes normal and only some really small spikes are happened. ZFS uses certain amount of CPU to process the iops, most of which are compression and checksumming. I used the following hardware setup for my node: CPU: AMD EPYC 7313P RAM: ECC DDR4 128GB 3200MHz Disks: 3x Micron 7450 MTFDKBG3T8TFR
Node took about 9 days to get fully synced and here is some hardware resources consumption:
Below I would like to list recommendations to start use ZFS.
- CPU: From the screenshot above, you can see that the CPU is really crucial, but I would say only during synchronization. Once the node is synced, CPU usage drops to 5-10%, with some spikes up to 100%, which is okay for bridge nodes, even without ZFS. I would recommend selecting the processor carefully for a server and using the one shared above as a starting point. The more powerful the processor, the less time it will take for your node to sync. However, if you have enough time, you can opt for something similar to what I have.
- RAM: Huge RAM isn't necessary here, 64GB DDR4 should be fine for sync and node running after it.
- Disk(s): The disk is a bit tricky here, as low throughput could lead to higher iowait, and as a result, increase CPU load, which is already stressed by ZFS. I would recommend using only NVMe drives with PCIe-4 support (which usually provides acceptable I/O speed). Also, check if the disk has a 4096-byte (4KB) physical sector size, as that will be important for tuning the ZFS pool correctly. Additionally, the more I/O size the disk supports (minimum/optimal), the better, though it's not strictly necessary. I’ve used disks with a 4096 bytes minimum/optimal size, and it's been fine. I know SAMSUNG disks offer higher I/O sizes, but it's up to your preference. With the following command you can check if disks support 4096 physical sector size.
sudo fdisk -l
- ZFS: I've found a couple of things that are important on zfs pool creation and dataset turning.
- ashift property. The ashift property determines the block allocation size that ZFS will use per vdev. Ideally this value should be set to the sector size of the underlying physical device (the sector size being the smallest physical unit that can be read or written from/to that device), that's why it's important to disk to have 4KB physical sector size to avoid bottlenecks with I/O. So once you are sure you have recommended disk's physical sectors I recommend to set ashift property to ashift=12 on pool creation. This property is immutable and can't be changed with time. Ex:
zpool create -o ashift=12 $ZFS_POOL_NAME /dev/nvme0n1
Below I would like to show I/O wait and disk load with ashift=8 and corresponding disk physical sector size 512-byte. Yeah, NVMe disk with PCIe-4 support could have this :)
And with correctly configured zpool ashift properly.
The I/O wait in the first case is terrible, max iowait is up to 30% and average - almost 3%, which causes additional load on the CPU and results in the node frequently getting stuck.
-
zstd compression algorithm. I wouldn't recommend going higher than zstd-3. The higher the compression level, the more CPU is consumed during sync. Testing with zstd-5 showed that the maximum compression ratio difference compared to zstd-3 was only 5%, which is not worth the performance loss just to save 5% of storage. Therefore, the recommendation the same like in the guide is to use zstd-3.
-
recordsize property. ZFS splits files into blocks before writing them to disk. The recordsize defines the maximum size of these blocks. By default, it's set to 128KB, but you can adjust it depending on your workload and performance needs.I've set recordsize=256KB for the dataset, considering that Celestia DA nodes store a significant amount of data that can benefit from larger recordsize blocks. For the mainnet, you could even increase the recordsize to 512KB, as it handles a much larger volume of data. This property can be set at any time, but it’s recommended to set it from the beginning, as changing it later won’t affect data that’s already stored.
zfs set recordsize=256K $ZFS_POOL_NAME/$ZFS_DATASET_NAME
It gives me a better compression ratio and seems to reduce the load, as there’s no need to split data into 128KB blocks that could instead be stored in 256KB blocks, resulting in fewer I/O operations. Here is some weird numbers as for 256KB blocks it shows huge compression, but I got ~5% more in saving storage compared to 128KB recordsize. (compared with @kinrokinro he has 1.97x)
Block Size Histogram
block psize lsize asize
size Count Size Cum. Count Size Cum. Count Size Cum.
512: 21 10.5K 10.5K 21 10.5K 10.5K 0 0 0
1K: 8.95K 8.95M 8.96M 8.95K 8.95M 8.96M 0 0 0
2K: 8.94K 31.3M 40.3M 8.94K 31.3M 40.3M 0 0 0
4K: 3.77M 15.1G 15.1G 52.6K 210M 251M 572K 2.24G 2.24G
8K: 1.06M 10.3G 25.4G 90.8K 1.37G 1.62G 4.22M 35.4G 37.6G
16K: 1.12M 25.5G 50.9G 621K 9.93G 11.6G 1.18M 26.7G 64.4G
32K: 1.85M 84.5G 135G 262K 15.3G 26.8G 1.86M 84.9G 149G
64K: 15.8M 1.42T 1.55T 452K 30.6G 57.4G 15.8M 1.42T 1.57T
128K: 14.6M 2.74T 4.29T 3.18M 432G 490G 14.6M 2.74T 4.31T
256K: 34.6K 8.64G 4.30T 33.7M 8.42T 8.90T 36.0K 9.07G 4.31T
And overall compression.
zfs get compressratio celestia_main && du -sh /celestia_main/bridge/.celestia-bridge-mocha-4/
NAME PROPERTY VALUE SOURCE
celestia_main compressratio 2.06x -
6.4T /celestia_main/bridge/.celestia-bridge-mocha-4/
Summarise. Considering all above I suggest to use the following tunes for zfs pool and dataset with taking into account all hardware recommendations.
zpool create -o ashift=12 $ZFS_POOL_NAME /dev/nvme0n1
zfs set recordsize=256K $ZFS_POOL_NAME/$ZFS_DATASET_NAME
After syncing, hardware resource consumption is fine, and the node works well. I know the Celestia team released code that significantly reduced the disk space required for storing data, but considering the 1GB blocks feature that Celestia announced recently, this tool could become relevant again. In this case I plan to use it in the future and recommend it to others.
@mogoll92 @jcstein @kinrokinro
After many days of testing, we finally found the solution. I’ve shared all the key learnings here: Celestia Testnet - Bridge DA Node ZFS Optimizations.
Initially, we were stuck grinding through 20,000 blocks in four hours. However, after testing and optimization, we managed to complete the sync in just a couple of hours, last know was around 250,000 blocks every 2 hours or less.
One major takeaway was that the high CPU usage we saw was largely due to IOWAIT, which is essentially the system waiting for disk I/O. This can often look like high CPU load on graphs but isn’t actual CPU work. The IOWAIT came from a disk backlog, with tasks sometimes taking 20 to 60 seconds or even longer to process/complete. While upgrading to Gen5 NVMe could have helped, we didn’t want to rely solely on more hardware.
Through various tweaks, we reduced the backlog to just 100ms to 4 seconds, leading to much faster sync times. At one point, we tried increasing the recordsize, which improved sync speed even more, but it came at the cost of higher RAM usage and increased disk space. To strike a balance, we decided to keep the default 128K recordsize.
It was a long process, but the results speak for themselves! Tested on Gen4 PCIE NVME, DDR4 RAM with Intel XEON
- Enable compression
- Disable ZFS Auto trimming
- Disable ZFS sync
- Disable ZFS prefetch
@mogoll92 @jcstein @kinrokinro After many days of testing, we finally found the solution. I’ve shared all the key learnings here: Celestia Testnet - Bridge DA Node ZFS Optimizations.
Initially, we were stuck grinding through 20,000 blocks in four hours. However, after testing and optimization, we managed to complete the sync in just a couple of hours, last know was around 250,000 blocks every 2 hours or less.
One major takeaway was that the high CPU usage we saw was largely due to IOWAIT, which is essentially the system waiting for disk I/O. This can often look like high CPU load on graphs but isn’t actual CPU work. The IOWAIT came from a disk backlog, with tasks sometimes taking 20 to 60 seconds or even longer to process/complete. While upgrading to Gen5 NVMe could have helped, we didn’t want to rely solely on more hardware.
Through various tweaks, we reduced the backlog to just 100ms to 4 seconds, leading to much faster sync times. At one point, we tried increasing the recordsize, which improved sync speed even more, but it came at the cost of higher RAM usage and increased disk space. To strike a balance, we decided to keep the default 128K recordsize.
It was a long process, but the results speak for themselves! Tested on Gen4 PCIE NVME, DDR4 RAM with Intel XEON
- Enable compression
- Disable ZFS Auto trimming
- Disable ZFS sync
- Disable ZFS prefetch
I would strongly advise against disabling sync because of data integrity: we are working with the database where data integrity is the main point, if something goes wrong - the database will be corrupted. You should not disable sync no matter what, IMO.
All of these tuning steps (other content is mostly the same as I can see) have been added to the documentation, along with remarks and instructions on how to use the sync=disabled setting properly (only during the sync).
@mogoll92 @jcstein @kinrokinro After many days of testing, we finally found the solution. I’ve shared all the key learnings here: Celestia Testnet - Bridge DA Node ZFS Optimizations.
Initially, we were stuck grinding through 20,000 blocks in four hours. However, after testing and optimization, we managed to complete the sync in just a couple of hours, last know was around 250,000 blocks every 2 hours or less.
One major takeaway was that the high CPU usage we saw was largely due to IOWAIT, which is essentially the system waiting for disk I/O. This can often look like high CPU load on graphs but isn’t actual CPU work. The IOWAIT came from a disk backlog, with tasks sometimes taking 20 to 60 seconds or even longer to process/complete. While upgrading to Gen5 NVMe could have helped, we didn’t want to rely solely on more hardware.
Through various tweaks, we reduced the backlog to just 100ms to 4 seconds, leading to much faster sync times. At one point, we tried increasing the recordsize, which improved sync speed even more, but it came at the cost of higher RAM usage and increased disk space. To strike a balance, we decided to keep the default 128K recordsize.
It was a long process, but the results speak for themselves! Tested on Gen4 PCIE NVME, DDR4 RAM with Intel XEON
- Enable compression
- Disable ZFS Auto trimming
- Disable ZFS sync
- Disable ZFS prefetch
Agree with @kinrokinro that disabling sync may lead to corruption, so need to be careful. Also could you drop some stats, charts, how long has it take, iowait during sync, space taken etc.?
All of these tuning steps (other content is mostly the same as I can see) have been added to the documentation, along with remarks and instructions on how to use the
sync=disabledsetting properly (only during the sync).
Could you also add info before creating a pool on ashift and hardware to you doc, please? I dropped it here. https://github.com/celestiaorg/docs/pull/1694#issuecomment-2396654759
It's crucial to have the pool with ashift 12 as less values could lead to performance degradation and huge iowait.
All of these tuning steps (other content is mostly the same as I can see) have been added to the documentation, along with remarks and instructions on how to use the
sync=disabledsetting properly (only during the sync).Could you also add info before creating a pool on ashift and hardware to you doc, please? I dropped it here. #1694 (comment)
It's crucial to have the pool with ashift 12 as less values could lead to performance degradation and huge iowait.
Added, no concerns from my side about this settings (we have ashift=12 and recordsize=128K by default).
I agree that the sync should be enabled after its completed.