cluster icon indicating copy to clipboard operation
cluster copied to clipboard

Dragonfly arm and x86 resources request for testing and buildging

Open gaius-qi opened this issue 1 year ago β€’ 9 comments

Please fill out the details below to file a request for access to the CNCF Community Infrastructure Lab. Please note that access is targeted to people working on specific open source projects; this is not designed just to get your feet wet. The most important answer is the URL of the project you'll be working with.

If you're looking to learn Kubernetes and related technologies, please try out Killercoda or Play With Kubernetes instead.

First and Last Name

Wenbo Qi

Email

[email protected]

Company/Organization

Ant Group

Job Title

Senior Software Engineer

Project Title (i.e., a summary of what do you want to do, not what is the name of the open source project you're working with)

  1. E2E Testing of Dragonfly on arm and x86.
  2. Building of Dragonfly on arm and x86.

Briefly describe the project (i.e., what are the details of what you're planning to do with these servers?)

The purpose of these instances are to set up a E2E test of Dragonfly on arm and x86. The generally idea would be every night we would pull the main or any other release branch and run a full regression test on it, then publish the results.

The E2E testing results are essential to Dragonfly development process, since releases are based on the results, and it can be accessible by the public.

Is the code that you're going to run 100% open source? If so, what is the URL or URLs where it is located? What is your association with that project?

Yes, it's the CNCF hosted project Dragonfly, https://github.com/dragonflyoss. I'm one of the Core Maintainer.

What kind of machines and how many do you expect to use (see: https://deploy.equinix.com/product/bare-metal/servers/)?

c3.large.arm64 -- 4 instances(Need to build P2P) c3.medium.x86 -- 7 instances(Need to build P2P)

What operating system and networking are you planning to use?

  • Jenkins Server & Worker nodes
  • Deploy the public Dragonfly Manager service
  • Testing Nodes for pipeline
  • Building Nodes for pipeline
  • P2P Stress testing.

Any other relevant details we should know about?

Ubuntu 18.04/2 x 10 Gbps

gaius-qi avatar Jun 21 '24 08:06 gaius-qi

@gaius-qi thanks. We’ll review this and provide an update.

idvoretskyi avatar Jun 21 '24 09:06 idvoretskyi

@gaius-qi thanks. We’ll review this and provide an update.

@idvoretskyi Thanks! I don't think the arm needs such a large spec, but the list only has c3.large.arm64.

gaius-qi avatar Jun 21 '24 09:06 gaius-qi

Understood, thanks!

Ihor Dvoretskyi @.***> Sr. Developer Advocate, Cloud Native Computing Foundation https://cncf.io/

On Fri, 21 Jun 2024 at 12:33β€―PM, Gaius @.***> wrote:

@gaius-qi https://github.com/gaius-qi thanks. We’ll review this and provide an update.

@idvoretskyi https://github.com/idvoretskyi Thanks! I don't think the arm needs such a large spec, but the list only has c3.large.arm64.

β€” Reply to this email directly, view it on GitHub https://github.com/cncf/cluster/issues/271#issuecomment-2182386573, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA45O5RTH5RZWVXSP366MLZIPXPRAVCNFSM6AAAAABJVNRRN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBSGM4DMNJXGM . You are receiving this because you were mentioned.Message ID: @.***>

idvoretskyi avatar Jun 21 '24 09:06 idvoretskyi

Could it be possible to do this sort of E2E testing within Containers or VMs? It looks like the need for multiple machines is to build a p2p network, but, you could still mock that up within a single machine (assuming the resource footprint isn't that large)

jeefy avatar Jun 21 '24 18:06 jeefy

Could it be possible to do this sort of E2E testing within Containers or VMs? It looks like the need for multiple machines is to build a p2p network, but, you could still mock that up within a single machine (assuming the resource footprint isn't that large)

@jeefy I think it is good to testing within containers. But I need at least 2 machines to do the stress test, as well as the bandwidth usage between the test machines in a p2p network. With more than 2 machines, I can better simulate the bandwidth usage in the real network.

gaius-qi avatar Jun 21 '24 23:06 gaius-qi

Is it possible to automate the setup / teardown of these clusters? If you can provision everything through automation, and tear it all down successfully when you're done with the test run, it's possible to save on resources.

Our typical systems have 2x25G NICs so you should be able to exercise quite a bit of bandwidth. Best to start at least with same datacenter testing before you do bandwidth tests between DCs.

vielmetti avatar Jun 24 '24 17:06 vielmetti

Is it possible to automate the setup / teardown of these clusters? If you can provision everything through automation, and tear it all down successfully when you're done with the test run, it's possible to save on resources.

Our typical systems have 2x25G NICs so you should be able to exercise quite a bit of bandwidth. Best to start at least with same datacenter testing before you do bandwidth tests between DCs.

@vielmetti I think it's good. I will make automate the setup/teardown of p2p clusters.

gaius-qi avatar Jun 25 '24 00:06 gaius-qi

@vielmetti Hey! Do I need to provide any other information? πŸ™πŸ™πŸ™

gaius-qi avatar Jul 03 '24 13:07 gaius-qi

@vielmetti @jeefy are we good with this?

idvoretskyi avatar Jul 03 '24 14:07 idvoretskyi

@gaius-qi we'll work on this internally and get back to you.

idvoretskyi avatar Jul 08 '24 10:07 idvoretskyi

we'll work on this internally and get back to you.

@idvoretskyi Thanks πŸ™πŸ™πŸ™

gaius-qi avatar Jul 08 '24 11:07 gaius-qi

Heya! We're +1 to this and we'll get you spun up on Equinix soon.

That said, we do have a couple questions still:

  1. Why do you need Jenkins when you're already using GitHub Actions in your existing E2E testing? https://github.com/dragonflyoss/Dragonfly2/actions/runs/9894081762/job/27330759913
  2. Could you potentially spin up and spin down the p2p e2e resources within a GitHub Action?

Thanks and look for an Equinix email invite soon. :)

jeefy avatar Jul 11 '24 16:07 jeefy

@jeefy Thank you very much for your reply!

  1. GitHub Action only provides a free 1c2g machine to run CI. Due to resource limitations, it is not possible to start multiple nodes of Kubernetes through minikube in Actions to simulate p2p transmission in Kubernetes. The e2e test can only start one node of Kubernetes to test the basic features of Dragonfly, and does not cover the use case of p2p transmission between multiple nodes. And only one node of kubernetes is started, and multiple services such as manager, scheduler, peer, mysql and redis need to be deployed, which takes up a lot of resources. Each e2e test takes more than 1 hour, refer to https://github.com/dragonflyoss/Dragonfly2/actions/runs/9894081792. Building the rust version of the client image will be even slower, taking more than 4 hours, refer to https://github.com/dragonflyoss/client/actions/runs/9887062386/job/27308015065. This is just the time to build the rust client image for the amd architecture. I tried building the client image for the arm architecture and it was even slower, which led me to remove support for the arm version image.

  2. The free actions provided by GitHub Actions are very limited and cannot simulate Kubernetes with more than 1 node. Dragonfly services include manager, scheduler, seed peer, peer, mysql and redis. Six services need to be started. If you want to test p2p transmission, you need to increase the number of peers. Therefore, the free resources provided by GitHub Actions cannot be run, and when there are too many services running, CI often becomes unstable and fails abnormally. It may be OOM, insufficient disk, etc.

gaius-qi avatar Jul 11 '24 17:07 gaius-qi

@gaius-qi we may offer the large GH runners as well, would that work?

https://docs.github.com/en/enterprise-cloud@latest/actions/using-github-hosted-runners/about-larger-runners/about-larger-runners

Also, what @jeefy said - any reason why you need Jenkins for this? If possible, our suggestion would be to use the GitHub large runners or GitHub self-hosted runners with Actions as a more effective and pure cloud native way :)

idvoretskyi avatar Jul 12 '24 09:07 idvoretskyi

@gaius-qi we may offer the large GH runners as well, would that work?

https://docs.github.com/en/enterprise-cloud@latest/actions/using-github-hosted-runners/about-larger-runners/about-larger-runners

Also, what @jeefy said - any reason why you need Jenkins for this? If possible, our suggestion would be to use the GitHub large runners or GitHub self-hosted runners with Actions as a more effective and pure cloud native way :)

@idvoretskyi I think provide large GH runners is good! I need arm64(Ubuntu) and x86_64 or amd64(Ubuntu) runners πŸ™

But for Deploy the public Dragonfly Manager service and P2P Stress testing still need vm. The size doesn’t need to be large. For stress testing I need to simulate transfers between multiple nodes. Deploy the public Dragonfly Manager => it is convenient for users to have the fastest experience without building a complete P2P cluster.

gaius-qi avatar Jul 12 '24 10:07 gaius-qi

Hi @idvoretskyi , just checking in on the progress of action large runner. How's it going? πŸ™

gaius-qi avatar Jul 26 '24 08:07 gaius-qi

It's on me, thank you for the reminder!

Ihor Dvoretskyi @.***> Sr. Developer Advocate, Cloud Native Computing Foundation https://cncf.io/

On Fri, 26 Jul 2024 at 11:08β€―AM, Gaius @.***> wrote:

Hi @idvoretskyi https://github.com/idvoretskyi , just checking in on the progress of action large runner. How's it going? πŸ™

β€” Reply to this email directly, view it on GitHub https://github.com/cncf/cluster/issues/271#issuecomment-2252202608, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA45O2F4A2WJXPQY74EW53ZOH7X5AVCNFSM6AAAAABJVNRRN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJSGIYDENRQHA . You are receiving this because you were mentioned.Message ID: @.***>

idvoretskyi avatar Jul 26 '24 11:07 idvoretskyi

It's on me, thank you for the reminder!

Ihor Dvoretskyi @.***>

Sr. Developer Advocate, Cloud Native Computing Foundation https://cncf.io/

On Fri, 26 Jul 2024 at 11:08β€―AM, Gaius @.***> wrote:

Hi @idvoretskyi https://github.com/idvoretskyi , just checking in on

the progress of action large runner. How's it going? πŸ™

β€”

Reply to this email directly, view it on GitHub

https://github.com/cncf/cluster/issues/271#issuecomment-2252202608, or

unsubscribe

https://github.com/notifications/unsubscribe-auth/AAA45O2F4A2WJXPQY74EW53ZOH7X5AVCNFSM6AAAAABJVNRRN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJSGIYDENRQHA

.

You are receiving this because you were mentioned.Message ID:

@.***>

Thanks πŸ˜ƒ

gaius-qi avatar Jul 26 '24 15:07 gaius-qi

@gaius-qi invite sent!

idvoretskyi avatar Jul 30 '24 12:07 idvoretskyi

@gaius-qi invite sent!

@idvoretskyi Thanks! I have accepted the invitation. Should I add resources by myself? πŸ™πŸ™πŸ™

gaius-qi avatar Jul 30 '24 16:07 gaius-qi

@idvoretskyi Thank you very much for your help! I will close the issue.

gaius-qi avatar Aug 03 '24 03:08 gaius-qi