lotus Infra improvements for CC Sector sealing

Improvement Suggestion:

Lotus Miner software is tested on Datacenter hardware prior to public releases Previously released Lotus versions did not always exercise full miner workflows, from sealing to window posts. The infra team should advocate for and support creating automation workflows and providing hardware to test these flows prior to releases.

Success criteria:

[ ] Upgrade infra(1 mainnet bootstrap node, testnet infra(scratch nodes), and 1 miner on testnet)) whenever there is a new rc/final release. (P1)
- Health check is out of scope for this ticket
[ ] Investigate the sealing on intel/cloud - what's the feasible solution? (P2)
[ ] CC sector is sealed and proven on calibnet on every RC (P3 - blocked by item 2)
[ ] CC sector is sealed and proven on every major release (P3 - blocked by item 2)
[ ] Successfully making one deal (P2)
[ ] A collective grafana board for metrics

Jul 26 '21 16:07 BlocksOnAChain

@travisperson we need upgrade nodes for internal infra and add pass the checks listed in our release process, for example this list: https://github.com/filecoin-project/lotus/issues/6270

Jul 26 '21 16:07 BlocksOnAChain

@travisperson any estimates on how much work we have left on this issue? would be great if we can test it out with 1.11.1 or the follow-up releases? FYI: @jennijuju @BigLep

Aug 05 '21 15:08 BlocksOnAChain

I don't think the automation for this will occur for v1.11.1.

Upgrade infra(1 mainnet bootstrap node, testnet infra(scratch nodes), and 1 miner on testnet)) whenever there is a new rc/final release. (P1)

This work is basically finished now there is an PR with some documentation in the description in the lotus-infra repo (https://github.com/filecoin-project/lotus-infra/pull/420) the integration steps into lotus needs to be done. I do not have permissions to setup this up.

Investigate the sealing on intel/cloud - what's the feasible solution? (P2)

This is still on-going. This is broken out into 3 tasks right now

Get a successful 512MiB benchmark (lotus-bench) ran on a p3 aws ec2 machine using the GPU (current)
Run a 32GiB sealing benchmark on a p3 aws ec2 machine using GPU.
Report sealing times

CC sector is sealed and proven on calibnet on every RC (P3 - blocked by item 2)

Pick integration path for p3 machines (one off spin up for each task or continue running and add to our k8s clusters)
Write job in lotua-infra deployment code
Write benchmark reporting
Execute job as part of lotus-release-automation (this will piggy back on the upgrade stuff above.

I tried many things to get a successful 512MiB benchmark ran with a GPU, but I'm having to resort to what I know might work, which is running though a complete network setup. Anytime I've tried to setup a single instance benchmarks have always failed with a porep verification issue.

Aug 06 '21 14:08 travisperson

Get a successful 512MiB benchmark (lotus-bench) ran on a p3 aws ec2 machine using the GPU (current)

Done!

Our automation for setting up networks still works. I'm still unsure of what I was doing wrong when I was manually setting up this test. I'm going to kick off three 32GiB seal operations with some different configurations to get a good idea of the sealing time on the available servers to use for automation. Should have results Monday of next week.

BELLMAN_CUSTOM_GPU=Tesla V100-SXM2-16GB:5120
----
results (v28) SectorSize:(536870912), SectorNumber:(1)
seal: addPiece: 2.243073341s (228.3 MiB/s)
seal: preCommit phase 1: 4m11.731171834s (2.034 MiB/s)
seal: preCommit phase 2: 52.990231854s (9.662 MiB/s)
seal: commit phase 1: 30.014666ms (16.66 GiB/s)
seal: commit phase 2: 1m4.786643271s (7.903 MiB/s)
seal: verify: 4.832282ms
unseal: 4m17.351049108s  (1.99 MiB/s)

generate candidates: 270.839µs (1.803 TiB/s)
compute winning post proof (cold): 2.465105582s
compute winning post proof (hot): 2.255274389s
verify winning post proof (cold): 89.412825ms
verify winning post proof (hot): 5.513176ms

compute window post proof (cold): 1.993851407s
compute window post proof (hot): 1.887036497s
verify window post proof (cold): 28.828929ms
verify window post proof (hot): 5.066489ms

Aug 06 '21 17:08 travisperson

benchmark-0

environment variable list:
BELLMAN_CUSTOM_GPU=Tesla V100-SXM2-16GB:5120
----
results (v28) SectorSize:(34359738368), SectorNumber:(1)
seal: addPiece: 3m20.716164372s (163.3 MiB/s)
seal: preCommit phase 1: 26h22m18.238850173s (353.4 KiB/s)
seal: preCommit phase 2: 2h47m53.421897627s (3.253 MiB/s)
seal: commit phase 1: 2.780160389s (11.51 GiB/s)
seal: commit phase 2: 27m25.520509263s (19.91 MiB/s)
seal: verify: 10.230417ms
unseal: 26h23m40.838269043s  (353.1 KiB/s)

generate candidates: 16.586315ms (1.884 TiB/s)
compute winning post proof (cold): 2.950296129s
compute winning post proof (hot): 2.609653456s
verify winning post proof (cold): 127.358636ms
verify winning post proof (hot): 5.458753ms

compute window post proof (cold): 7m23.073211503s
compute window post proof (hot): 3m35.019365705s
verify window post proof (cold): 13.782046217s
verify window post proof (hot): 19.892327ms

benchmark-1

environment variable list:
FIL_PROOFS_MAXIMIZE_CACHING=1
FIL_PROOFS_USE_GPU_COLUMN_BUILDER=1
FIL_PROOFS_USE_GPU_TREE_BUILDER=1
BELLMAN_CUSTOM_GPU=Tesla V100-SXM2-16GB:5120
----
results (v28) SectorSize:(34359738368), SectorNumber:(1)
seal: addPiece: 3m54.426498669s (139.8 MiB/s)
seal: preCommit phase 1: 26h16m52.989693002s (354.6 KiB/s)
seal: preCommit phase 2: 37m40.310534965s (14.5 MiB/s)
seal: commit phase 1: 3.834395112s (8.346 GiB/s)
seal: commit phase 2: 1h29m57.533951208s (6.071 MiB/s)
seal: verify: 10.191527ms
unseal: 26h20m45.232831306s  (353.8 KiB/s)

generate candidates: 27.168556ms (1.15 TiB/s)
compute winning post proof (cold): 7.301135963s
compute winning post proof (hot): 6.149815308s
verify winning post proof (cold): 100.044665ms
verify winning post proof (hot): 5.193694ms

compute window post proof (cold): 21m6.689347137s
compute window post proof (hot): 17m15.227218786s
verify window post proof (cold): 13.787953841s
verify window post proof (hot): 19.887303ms

benchmark-2

environment variable list:
FIL_PROOFS_MAXIMIZE_CACHING=1
FIL_PROOFS_USE_GPU_COLUMN_BUILDER=1
FIL_PROOFS_USE_GPU_TREE_BUILDER=1
FIL_PROOFS_USE_MULTICORE_SDR=1
BELLMAN_CUSTOM_GPU=Tesla V100-SXM2-16GB:5120
----
results (v28) SectorSize:(34359738368), SectorNumber:(1)
seal: addPiece: 3m49.48970764s (142.8 MiB/s)
seal: preCommit phase 1: 22h19m40.140909032s (417.4 KiB/s)
seal: preCommit phase 2: 41m36.197958093s (13.13 MiB/s)
seal: commit phase 1: 3.351899219s (9.547 GiB/s)
seal: commit phase 2: 1h28m51.982620733s (6.146 MiB/s)
seal: verify: 9.467024ms
unseal: 22h2m14.279382477s  (422.9 KiB/s)

generate candidates: 1.112349ms (28.09 TiB/s)
compute winning post proof (cold): 6.510175138s
compute winning post proof (hot): 6.214723987s
verify winning post proof (cold): 92.523442ms
verify winning post proof (hot): 5.248734ms

compute window post proof (cold): 21m12.663614077s
compute window post proof (hot): 17m19.645446494s
verify window post proof (cold): 13.77662879s
verify window post proof (hot): 19.893733ms

Aug 09 '21 17:08 travisperson

@travisperson:

This work is basically finished now there is an PR with some documentation in the description in the lotus-infra repo (filecoin-project/lotus-infra#420) the integration steps into lotus needs to be done. I do not have permissions to setup this up.

Thanks for the updates.

So what specifically do we need to do to fully land the P1? Do you need lotus permissions for "A webhook will need to be configured in the lotus project repo". Does this mean needing Admin permissions? We can get you added.

Aug 14 '21 00:08 BigLep

The P1 work is now finished, and everything is integrated.

Aug 23 '21 22:08 travisperson

@jennijuju did we test this flow with the latest releases, can we close this issue, based on the comments above we have from @travisperson ?

Oct 14 '21 14:10 BlocksOnAChain

lotus lotus copied to clipboard

Infra improvements for CC Sector sealing

Improvement Suggestion:

lotus
lotus copied to clipboard