nydus Speed up Nydus-image create

For a standard Linux kernel source, nydus-image create now takes 8s by taking the default configuration.

There are 65119 files in this kernel source, and

# 1st run
$time ./target-fusedev/release/nydus-image create -B liubo/src.bootstrap -D liubo liubo/kernel_src
[2022-05-19 02:08:28.078232 +08:00] INFO [rafs/src/metadata/md_v5.rs:28] rafs superblock features: COMPRESS_LZ4_BLOCK | DIGESTER_BLAKE3 | EXPLICIT_UID_GID
[2022-05-19 02:08:28.340749 +08:00] INFO [src/bin/nydus-image/main.rs:622] build successfully: BuildOutput { artifacts: [BuildOutputArtifact { bootstrap_name: "", blobs: [] }], blobs: ["95922bb8a3ed6bd320f1c8376c26197ef2107cb8ce6d9094a3a17da0dbe44810"], last_blob_size: Some(301576727), last_bootstrap_name: "" }

real    0m8.099s
user    0m6.667s
sys     0m1.345s

# 2nd run
real    0m7.999s
user    0m6.682s
sys     0m1.315s

# 3rd run
real    0m7.970s
user    0m6.702s
sys     0m1.267s

The above process involves

iterating directories recursively
reading, compressing and writing data to the final blob
writing bootstrap

Looks like the bottleneck is not about IO since I've tried running the whole process in temps, which only gave 5% improvement.

IMO, there is some space that we can make it faster, the ideal goal is within 1s.

May 18 '22 18:05 liubogithub

We can use trace to analyze where are the main time consumption, and then make optimization, the points that I can think of so far:

concurrently do blake3/sha256 hash calculation and lz4/zstd chunk compression, and whether we can concurrently write blob data?
do not generate a tree structure for a single-layer to reduce the overhead of traversing nodes.
disable rafs format and digest validation for bootstrap checking and parent bootstrap loading.
use cached mode instead direct mode to load bootstrap on merging operation.
improve tree.apply performance.

May 19 '22 03:05 imeoer

@yawqi would offer some help on profiling converting images with nydus-image by flamegraph, many thanks!

@jiangliu what do you think?

Nov 07 '22 05:11 liubogithub

test This is a simple flamegraph I generate with flamegraph-rs and following commands, other use cases need to be further tested.

flamegraph -o ./test.svg -- ./nydus-image create -B blobs-v6/wq.bootstrap -D blobs-v6 -v 6 linux-6.0

Nov 09 '22 08:11 yawqi

As per the discussion offline, nydus-image spends most time on sha256 compression, which is good and matches with our expectation.

Many thanks for the efforts, @yawqi. Can you please also do another flamegraph run with a relatively huge parent bootstrap to see the typical bootstrap-loading cost?

Nov 09 '22 19:11 liubogithub

parent son

flamegraph -o ./parent.svg -- ./nydus-image create -B blobs-v6/wq.bootstrap -D blobs-v6 -v 6 ../workplace/github.com/yawqi/image-service
flamegraph -o ./son.svg -- ./nydus-image create --parent-bootstrap blobs-v6/wq.bootstrap -B blobs-v6-son/wq.bootstrap -D blobs-v6 -v 6 ./linux-6.0

I am not sure whether I am doing it the right way. The first flamegraph is building the parent bootstrap, and the second flamegraph is building a bootstrap whose parent is previous bootstrap.

The nydus-image's version is v2.1.1. The size of the parent bootstrap is 2.2MB, and the son bootstrap is 18MB.

Nov 10 '22 08:11 yawqi

It seems we can disable bootstrap/digest validation first, and then improve the speed of tree.apply.

Nov 11 '22 02:11 imeoer

I conduct the same operations with master release build. To be noticed, the master use zstd and sha256 as default, while the v2.1.1 use lz4 as default.

flamegraph -o ./parent-master.svg -- ./nydus-image-master create -B blobs-v6-master/wq.bootstrap -D blobs-v6-master -v 6 ../workplace/github.com/yawqi/image-service
flamegraph -o ./child-master.svg -- ./nydus-image-master create --parent-bootstrap blobs-v6-master/wq.bootstrap -B blobs-v6-master/wq-child.bootstrap -D blobs-v6-master -v 6 ./linux-6.0

Here is the following results, the upper one is parent, the lower one is child: parent-master child-master

The size of parent's source(./workplace/github.com/yawqi/image-service) is 4.2G, the size of child's source(linux-6.0) is 1.4GB. 截屏2022-11-11 10 58 51

Nov 11 '22 02:11 yawqi

截屏2022-11-11 13 17 24 截屏2022-11-11 13 17 10 The time consumed by lz4+blake3 is much faster than zstd+sha256 when creating my nydus image of linux-6.0 repo.

zstd+blake3: 截屏2022-11-11 15 51 27

lz4+sha256: 截屏2022-11-11 15 48 57

Nov 11 '22 05:11 yawqi

截屏2022-11-11 13 26 58 截屏2022-11-11 13 26 53 The time consumed by lz4+blake3 is much faster than zstd+sha256 when creating my nydus image of my nydus repo.