docs Add ZFS.md

Overview

Add the documentation on how to run a DA node with ZFS compression

Summary by CodeRabbit

New Features
- Introduced a new documentation section for setting up a bridge node with optional on-the-fly compression using ZFS.
- Added a comprehensive guide on configuring a DA node to utilize ZFS compression, including requirements, step-by-step instructions, and command examples for various network environments.

Sep 03 '24 15:09 kinrokinro

Walkthrough

The documentation has been enhanced with two significant updates. An additional section titled "Optional: enable on-fly compression with ZFS" has been added to the existing how-to-guides/bridge-node.md file. Furthermore, a new guide has been introduced in how-to-guides/zfs.md, detailing the setup of a Data Availability (DA) node with on-the-fly compression using ZFS. This guide includes hardware requirements, installation steps, and advanced tuning options for users.

Changes

Files	Change Summary
how-to-guides/bridge-node.md	New section added: "Optional: enable on-fly compression with ZFS."
how-to-guides/zfs.md	New document added providing a comprehensive guide for setting up a DA node with ZFS compression, including requirements and detailed instructions.

Poem

🐇 In the meadow where data flows,
A bridge node now with ZFS glows.
Compression on-the-fly, oh what a sight,
Storage efficiency, taking flight!
With each command, we hop and cheer,
Optimizing nodes, the future is here! 🌟

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

Sep 03 '24 15:09 coderabbitai[bot]

hmm @Wondertan I think we'd probably advise not to include this hack in docs?

Sep 04 '24 15:09 jcstein

@jcstein, mentioning that in the docs may be helpful, especially if storage issues are pressing for node runners, as it's ~ a 2x reduction.

Sep 04 '24 15:09 Wondertan

@kinrokinro, for how long have you been running the node with compression on? Do you see it running stable?

Sep 04 '24 15:09 Wondertan

you been running the node with compression on? Do you see it running stable?

For a month and yes, it running stable, our bridge is always in sync and everything looks pretty good to me, that's the reason why I recommend to add this option to the docs.

Anyways we always can mark it as "not recommended" if there is any concern from your side.

BTW zstd-3 is a light compression, max is zstd-19.

Sep 04 '24 16:09 kinrokinro

Currently we're running both our bridges on testnet and mainnet on zfs, so our metrics is available in your OTEL

Sep 04 '24 16:09 kinrokinro

thank you for the context and feedback! I think it is safe to recommend this if @Wondertan approves. And then I will work into the sidebar menu

Sep 04 '24 16:09 jcstein

you been running the node with compression on? Do you see it running stable?

For a month and yes, it running stable, our bridge is always in sync and everything looks pretty good to me, that's the reason why I recommend to add this option to the docs.

Anyways we always can mark it as "not recommended" if there is any concern from your side.

BTW zstd-3 is a light compression, max is zstd-19.

I think zstd-3 is optimal as with more zstd-x more time is needed to compress so with bigger load node could be behind sync all the time.

Sep 09 '24 10:09 mogoll92

I think zstd-3 is optimal as with more zstd-x more time is needed to compress so with bigger load node could be behind sync all the time.

confirming with @kinrokinro that this uses zstd-3?

Sep 09 '24 11:09 jcstein

I think zstd-3 is optimal as with more zstd-x more time is needed to compress so with bigger load node could be behind sync all the time.

confirming with @kinrokinro that this uses zstd-3?

according to the doc that @kinrokinro wrote - yes, it uses zstd-3 and with this compression he achieved 2.05x ratio. I found an interesting discussion regarding ZFS levels, it gives overall info about ZFS levels. Here is: https://www.reddit.com/r/zfs/comments/sxx9p7/a_simple_real_world_zfs_compression_speed_an/

According to it level more than zstd-10 leads to crazy time downgrade and ratio isn't so huge to sacrifice with performance. I'm going to try zstd-5 with my mocha bridge node to see if no issues with performance and what compression ratio I will get comparing to zstd-3, probably just after full sync ask to share @kinrokinro and compare with what I got.

Sep 09 '24 12:09 mogoll92

I think zstd-3 is optimal as with more zstd-x more time is needed to compress so with bigger load node could be behind sync all the time.

confirming with @kinrokinro that this uses zstd-3?

Yes, we use zstd-3 compression level

Sep 09 '24 14:09 kinrokinro

@kinrokinro do you want to add this to the menu? or think that it is okay as is, linked in the "optional" section?

Sep 10 '24 08:09 jcstein

@kinrokinro do you want to add this to the menu? or think that it is okay as is, linked in the "optional" section?

This should be optional of course

Sep 10 '24 14:09 kinrokinro

It would be great to add here that processor is more important than RAM here, tried with 2x Intel Xeon Silver 4108 but it doesn't fly, have to request more powerful. Bridge node just syncs too slow and CPU is always more than 90% while RAM (64GB) about 50%.

Sep 10 '24 16:09 mogoll92

It would be great to add here that processor is more important than RAM here, tried with 2x Intel Xeon Silver 4108 but it does fly, have to request more powerful. Bridge node just syncs too slow and CPU is always more that 90% while RAM (64GB) about 50%.

Perhaps it because of more powerful compression you choose

Sep 10 '24 16:09 kinrokinro

It would be great to add here that processor is more important than RAM here, tried with 2x Intel Xeon Silver 4108 but it does fly, have to request more powerful. Bridge node just syncs too slow and CPU is always more that 90% while RAM (64GB) about 50%.

Perhaps it because of more powerful compression you choose

Will provide here much more info on issues that I got with charts, just still in syncing state. But after switching from 2x Intel Xeon Silver 4108 to 2x Intel Xeon Gold 6130 my node is flying very well, just want to see final result on compression ratio with level what we set (zstd-5).

Sep 10 '24 20:09 mogoll92

so this https://github.com/celestiaorg/docs/pull/1694/commits/baead03109e63825a544a0cb0d249cb1068bb49b should resolve the issue you were facing @mogoll92 ?

Sep 12 '24 11:09 jcstein

so this baead03 should resolve the issue you were facing @mogoll92 ?

not completely, unfortunately. With more synced data performance of data node has been decreasing. I thought that 2x Intel Xeon Gold 6130 solved that issue as it more powerful and I didn't see huge load on it, around 50% with +/- 10%. On screenshot you see that node reached 1.5m blocks very fast, but after it had been started degrading in performance. I assume that now it's because of NVMe disks are on PCIe-3 that why node stuck on I/O and it makes load on processor and system overall. I should mentioned that I kept testnet bridge on the server with 2x Intel Xeon Silver 4108 and disks supports PCIe-3 without zfs and node was syncing very well and working as well. Now I rent AMD EPYC 7313P and drives with PCIe-4, will try with that setup. bad_perfromance

Also adding to decreasing in performance, on the screenshot below you can see that node progresses with 20k blocks every 3 hours now (sadly). Hope more capable I/O drives will solve it. Screenshot 2024-09-12 at 15 36 59

Sep 12 '24 12:09 mogoll92

not completely, unfortunately. With more synced data performance of data node has been decreasing.

based on this will have to recommend not to use this yet, but happy to leave PR open until when it makes sense to add

Sep 16 '24 11:09 jcstein

not completely, unfortunately. With more synced data performance of data node has been decreasing.

based on this will have to recommend not to use this yet, but happy to leave PR open until when it makes sense to add

Agree, The 4 days are almost passed and node is still in syncing state after switching to new hardware. Less than 1m blocks are left with syncing speed 40k blocks per 12 hours but should say the syncing speed depends on how much data it has so it's faster sometimes. Let's see how it will go after reaching 2m blocks.

Will drop here all insights on issues and so on after node will get synced and get confirmed that metrics are being sent well. This testing might be worth to convert into some forum post for user's who will decides to run DA node with zfs later.

Sep 16 '24 12:09 mogoll92

not completely, unfortunately. With more synced data performance of data node has been decreasing.

based on this will have to recommend not to use this yet, but happy to leave PR open until when it makes sense to add

Agree, The 4 days are almost passed and node is still in syncing state after switching to new hardware. Less than 1m blocks are left with syncing speed 40k blocks per 12 hours but should say the syncing speed depends on how much data it has so it's faster sometimes. Let's see how it will go after reaching 2m blocks.

Will drop here here all insights on issues and so on after node will get synced and get confirmed that metrics send well. This testing might be worth to convert into some forum post for user's who will decides to run DA node with zfs later.

Could you try with zstd-3, not zstd-5?

Sep 16 '24 12:09 kinrokinro

not completely, unfortunately. With more synced data performance of data node has been decreasing.

based on this will have to recommend not to use this yet, but happy to leave PR open until when it makes sense to add

Agree, The 4 days are almost passed and node is still in syncing state after switching to new hardware. Less than 1m blocks are left with syncing speed 40k blocks per 12 hours but should say the syncing speed depends on how much data it has so it's faster sometimes. Let's see how it will go after reaching 2m blocks. Will drop here here all insights on issues and so on after node will get synced and get confirmed that metrics send well. This testing might be worth to convert into some forum post for user's who will decides to run DA node with zfs later.

Could you try with zstd-3, not zstd-5?

Yeah, I left the idea to try syncing with zstd-5 as it requires more processor's resources and maximum that I got compared to zstd-3 is 5% in saving storage. So I will not recommend to try everything higher than zstd-3 or do it on own risk.

Now node is on zstd-3 compression from the beginning.

Sep 16 '24 12:09 mogoll92

(Moche bridge node) Alright, I've made some investigation previous month and what I've found...

Jumping ahead, all things regarding CPU, disks huge usage related ONLY during syncing bridge node, after that it becomes normal and only some really small spikes are happened. ZFS uses certain amount of CPU to process the iops, most of which are compression and checksumming. I used the following hardware setup for my node: CPU: AMD EPYC 7313P RAM: ECC DDR4 128GB 3200MHz Disks: 3x Micron 7450 MTFDKBG3T8TFR

Node took about 9 days to get fully synced and here is some hardware resources consumption: Screenshot 2024-10-07 at 12 47 00 Screenshot 2024-10-07 at 12 45 41 Screenshot 2024-10-07 at 14 20 06

Below I would like to list recommendations to start use ZFS.

CPU: From the screenshot above, you can see that the CPU is really crucial, but I would say only during synchronization. Once the node is synced, CPU usage drops to 5-10%, with some spikes up to 100%, which is okay for bridge nodes, even without ZFS. I would recommend selecting the processor carefully for a server and using the one shared above as a starting point. The more powerful the processor, the less time it will take for your node to sync. However, if you have enough time, you can opt for something similar to what I have.
RAM: Huge RAM isn't necessary here, 64GB DDR4 should be fine for sync and node running after it.
Disk(s): The disk is a bit tricky here, as low throughput could lead to higher iowait, and as a result, increase CPU load, which is already stressed by ZFS. I would recommend using only NVMe drives with PCIe-4 support (which usually provides acceptable I/O speed). Also, check if the disk has a 4096-byte (4KB) physical sector size, as that will be important for tuning the ZFS pool correctly. Additionally, the more I/O size the disk supports (minimum/optimal), the better, though it's not strictly necessary. I’ve used disks with a 4096 bytes minimum/optimal size, and it's been fine. I know SAMSUNG disks offer higher I/O sizes, but it's up to your preference. With the following command you can check if disks support 4096 physical sector size.

sudo fdisk -l

ZFS: I've found a couple of things that are important on zfs pool creation and dataset turning.

ashift property. The ashift property determines the block allocation size that ZFS will use per vdev. Ideally this value should be set to the sector size of the underlying physical device (the sector size being the smallest physical unit that can be read or written from/to that device), that's why it's important to disk to have 4KB physical sector size to avoid bottlenecks with I/O. So once you are sure you have recommended disk's physical sectors I recommend to set ashift property to ashift=12 on pool creation. This property is immutable and can't be changed with time. Ex:

zpool create -o ashift=12 $ZFS_POOL_NAME /dev/nvme0n1

Below I would like to show I/O wait and disk load with ashift=8 and corresponding disk physical sector size 512-byte. Yeah, NVMe disk with PCIe-4 support could have this :) Screenshot 2024-10-07 at 13 35 52

And with correctly configured zpool ashift properly. Screenshot 2024-10-07 at 13 39 17

The I/O wait in the first case is terrible, max iowait is up to 30% and average - almost 3%, which causes additional load on the CPU and results in the node frequently getting stuck.

zstd compression algorithm. I wouldn't recommend going higher than zstd-3. The higher the compression level, the more CPU is consumed during sync. Testing with zstd-5 showed that the maximum compression ratio difference compared to zstd-3 was only 5%, which is not worth the performance loss just to save 5% of storage. Therefore, the recommendation the same like in the guide is to use zstd-3.
recordsize property. ZFS splits files into blocks before writing them to disk. The recordsize defines the maximum size of these blocks. By default, it's set to 128KB, but you can adjust it depending on your workload and performance needs.I've set recordsize=256KB for the dataset, considering that Celestia DA nodes store a significant amount of data that can benefit from larger recordsize blocks. For the mainnet, you could even increase the recordsize to 512KB, as it handles a much larger volume of data. This property can be set at any time, but it’s recommended to set it from the beginning, as changing it later won’t affect data that’s already stored.

zfs set recordsize=256K $ZFS_POOL_NAME/$ZFS_DATASET_NAME

It gives me a better compression ratio and seems to reduce the load, as there’s no need to split data into 128KB blocks that could instead be stored in 256KB blocks, resulting in fewer I/O operations. Here is some weird numbers as for 256KB blocks it shows huge compression, but I got ~5% more in saving storage compared to 128KB recordsize. (compared with @kinrokinro he has 1.97x)

Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:     21  10.5K  10.5K     21  10.5K  10.5K      0      0      0
     1K:  8.95K  8.95M  8.96M  8.95K  8.95M  8.96M      0      0      0
     2K:  8.94K  31.3M  40.3M  8.94K  31.3M  40.3M      0      0      0
     4K:  3.77M  15.1G  15.1G  52.6K   210M   251M   572K  2.24G  2.24G
     8K:  1.06M  10.3G  25.4G  90.8K  1.37G  1.62G  4.22M  35.4G  37.6G
    16K:  1.12M  25.5G  50.9G   621K  9.93G  11.6G  1.18M  26.7G  64.4G
    32K:  1.85M  84.5G   135G   262K  15.3G  26.8G  1.86M  84.9G   149G
    64K:  15.8M  1.42T  1.55T   452K  30.6G  57.4G  15.8M  1.42T  1.57T
   128K:  14.6M  2.74T  4.29T  3.18M   432G   490G  14.6M  2.74T  4.31T
   256K:  34.6K  8.64G  4.30T  33.7M  8.42T  8.90T  36.0K  9.07G  4.31T

And overall compression.

zfs get compressratio celestia_main && du -sh /celestia_main/bridge/.celestia-bridge-mocha-4/

NAME           PROPERTY       VALUE  SOURCE
celestia_main  compressratio  2.06x  -

6.4T	/celestia_main/bridge/.celestia-bridge-mocha-4/

Summarise. Considering all above I suggest to use the following tunes for zfs pool and dataset with taking into account all hardware recommendations.

zpool create -o ashift=12 $ZFS_POOL_NAME /dev/nvme0n1

zfs set recordsize=256K $ZFS_POOL_NAME/$ZFS_DATASET_NAME

After syncing, hardware resource consumption is fine, and the node works well. I know the Celestia team released code that significantly reduced the disk space required for storing data, but considering the 1GB blocks feature that Celestia announced recently, this tool could become relevant again. In this case I plan to use it in the future and recommend it to others.

Oct 07 '24 11:10 mogoll92

@mogoll92 @jcstein @kinrokinro
After many days of testing, we finally found the solution. I’ve shared all the key learnings here: Celestia Testnet - Bridge DA Node ZFS Optimizations.

Initially, we were stuck grinding through 20,000 blocks in four hours. However, after testing and optimization, we managed to complete the sync in just a couple of hours, last know was around 250,000 blocks every 2 hours or less.

One major takeaway was that the high CPU usage we saw was largely due to IOWAIT, which is essentially the system waiting for disk I/O. This can often look like high CPU load on graphs but isn’t actual CPU work. The IOWAIT came from a disk backlog, with tasks sometimes taking 20 to 60 seconds or even longer to process/complete. While upgrading to Gen5 NVMe could have helped, we didn’t want to rely solely on more hardware.

Through various tweaks, we reduced the backlog to just 100ms to 4 seconds, leading to much faster sync times. At one point, we tried increasing the recordsize, which improved sync speed even more, but it came at the cost of higher RAM usage and increased disk space. To strike a balance, we decided to keep the default 128K recordsize.

It was a long process, but the results speak for themselves! Tested on Gen4 PCIE NVME, DDR4 RAM with Intel XEON

Enable compression
Disable ZFS Auto trimming
Disable ZFS sync
Disable ZFS prefetch

Oct 12 '24 11:10 GeoddHQ

@mogoll92 @jcstein @kinrokinro After many days of testing, we finally found the solution. I’ve shared all the key learnings here: Celestia Testnet - Bridge DA Node ZFS Optimizations.

Initially, we were stuck grinding through 20,000 blocks in four hours. However, after testing and optimization, we managed to complete the sync in just a couple of hours, last know was around 250,000 blocks every 2 hours or less.

One major takeaway was that the high CPU usage we saw was largely due to IOWAIT, which is essentially the system waiting for disk I/O. This can often look like high CPU load on graphs but isn’t actual CPU work. The IOWAIT came from a disk backlog, with tasks sometimes taking 20 to 60 seconds or even longer to process/complete. While upgrading to Gen5 NVMe could have helped, we didn’t want to rely solely on more hardware.

Through various tweaks, we reduced the backlog to just 100ms to 4 seconds, leading to much faster sync times. At one point, we tried increasing the recordsize, which improved sync speed even more, but it came at the cost of higher RAM usage and increased disk space. To strike a balance, we decided to keep the default 128K recordsize.

It was a long process, but the results speak for themselves! Tested on Gen4 PCIE NVME, DDR4 RAM with Intel XEON

Enable compression

Disable ZFS Auto trimming

Disable ZFS sync

Disable ZFS prefetch

I would strongly advise against disabling sync because of data integrity: we are working with the database where data integrity is the main point, if something goes wrong - the database will be corrupted. You should not disable sync no matter what, IMO.

Oct 12 '24 17:10 kinrokinro

All of these tuning steps (other content is mostly the same as I can see) have been added to the documentation, along with remarks and instructions on how to use the sync=disabled setting properly (only during the sync).

Oct 12 '24 17:10 kinrokinro

@mogoll92 @jcstein @kinrokinro After many days of testing, we finally found the solution. I’ve shared all the key learnings here: Celestia Testnet - Bridge DA Node ZFS Optimizations.

Initially, we were stuck grinding through 20,000 blocks in four hours. However, after testing and optimization, we managed to complete the sync in just a couple of hours, last know was around 250,000 blocks every 2 hours or less.

One major takeaway was that the high CPU usage we saw was largely due to IOWAIT, which is essentially the system waiting for disk I/O. This can often look like high CPU load on graphs but isn’t actual CPU work. The IOWAIT came from a disk backlog, with tasks sometimes taking 20 to 60 seconds or even longer to process/complete. While upgrading to Gen5 NVMe could have helped, we didn’t want to rely solely on more hardware.

Through various tweaks, we reduced the backlog to just 100ms to 4 seconds, leading to much faster sync times. At one point, we tried increasing the recordsize, which improved sync speed even more, but it came at the cost of higher RAM usage and increased disk space. To strike a balance, we decided to keep the default 128K recordsize.

It was a long process, but the results speak for themselves! Tested on Gen4 PCIE NVME, DDR4 RAM with Intel XEON

Enable compression

Disable ZFS Auto trimming

Disable ZFS sync

Disable ZFS prefetch

Agree with @kinrokinro that disabling sync may lead to corruption, so need to be careful. Also could you drop some stats, charts, how long has it take, iowait during sync, space taken etc.?

Oct 12 '24 19:10 mogoll92

All of these tuning steps (other content is mostly the same as I can see) have been added to the documentation, along with remarks and instructions on how to use the sync=disabled setting properly (only during the sync).

Could you also add info before creating a pool on ashift and hardware to you doc, please? I dropped it here. https://github.com/celestiaorg/docs/pull/1694#issuecomment-2396654759

It's crucial to have the pool with ashift 12 as less values could lead to performance degradation and huge iowait.

Oct 12 '24 19:10 mogoll92

All of these tuning steps (other content is mostly the same as I can see) have been added to the documentation, along with remarks and instructions on how to use the sync=disabled setting properly (only during the sync).

Could you also add info before creating a pool on ashift and hardware to you doc, please? I dropped it here. #1694 (comment)

It's crucial to have the pool with ashift 12 as less values could lead to performance degradation and huge iowait.

Added, no concerns from my side about this settings (we have ashift=12 and recordsize=128K by default).

Oct 13 '24 01:10 kinrokinro

I agree that the sync should be enabled after its completed.