globalping icon indicating copy to clipboard operation
globalping copied to clipboard

Benchmark probes / auto-scale tests per probe

Open jimaek opened this issue 2 years ago • 20 comments

A small raspberry pi should not be getting the same amount of tests as an 8 core intel server. Maybe run a local benchmark during startup or something? Because an 8 core Atom is not the same as an 8 core Xeon either

@patrykcieszkowski We didn't come up with anything concrete yet. Artem planned to do some manual benchmarks and see if he can come up with usable data to then integrate into the code. Maybe you have an idea how to do this?

jimaek avatar Mar 21 '22 09:03 jimaek

We can gather basic metadata with node's os module.

const os = require('os')
const cpuCount = os.cpus().length

patrykcieszkowski avatar Mar 21 '22 09:03 patrykcieszkowski

Yeah Artem suggested that but I am against such a simple approach because an 8 core Atom and an 8 core Ryzen are very different beasts :)

jimaek avatar Mar 21 '22 09:03 jimaek

It also outputs the clock speed, and available resources

[
  {
    model: 'Intel(R) Core(TM) i5-4690K CPU @ 3.50GHz',
    speed: 4268,
    times: {
      user: 150659900,
      nice: 277600,
      sys: 65754400,
      idle: 487990800,
      irq: 0
    }
  },
  {
    model: 'Intel(R) Core(TM) i5-4690K CPU @ 3.50GHz',
    speed: 4281,
    times: {
      user: 149682200,
      nice: 208500,
      sys: 64688700,
      idle: 27844500,
      irq: 0
    }
  }
]

not sure how useful would that be. We probably would need to run a set of benchmarks anyway and store this data internally so we can compare these results if we were to skip them

patrykcieszkowski avatar Mar 21 '22 09:03 patrykcieszkowski

Yeah its probably quite a big project, probably should not be part of MVP. I imagine a mix of benchmarks, data like number of cores, and real-time CPU usage stats being sent to the API which would then consider that data when routing tests to probes. Because e.g. if a 64 core Xeon reports 99% CPU usage we should probably stop sending tests for processing there until the load drops regardless of high benchmarks

jimaek avatar Mar 21 '22 09:03 jimaek

It's not what I really suggested :) My suggestion was to keep it simple and start with CPU usage. We can collect the average load for last minute (or maybe a shorter period) and use this info in routing. I don't see yet how CPU cores count can help us because, as @jimaek said, 6 Xeon cores is not the same as 6 raspberry cores.

Even more, probes currently do not utilize multiple cores so this is what we should work on as well.

We used this lib in past to collect system metrics btw.

zarianec avatar Mar 21 '22 13:03 zarianec

Even more, probes currently do not utilize multiple cores so this is what we should work on as well.

You mean the traceroute commands are in the same thread as probe? Then can you open a task to support all available cores?

jimaek avatar Mar 21 '22 13:03 jimaek

Yes, I wanted to use throng for clustering but it means that every single probe will create multiple separate ws clients.

zarianec avatar Mar 21 '22 14:03 zarianec

https://nodejs.org/docs/latest-v12.x/api/worker_threads.html

patrykcieszkowski avatar Mar 21 '22 14:03 patrykcieszkowski

It's not what I really suggested :) My suggestion was to keep it simple and start with CPU usage. We can collect the average load for last minute (or maybe a shorter period) and use this info in routing. I don't see yet how CPU cores count can help us because, as @jimaek said, 6 Xeon cores is not the same as 6 raspberry cores.

Even more, probes currently do not utilize multiple cores so this is what we should work on as well.

We used this lib in past to collect system metrics btw.

Using just CPU usage will create issues I think. First how will that module work inside the container? Will it report the full system CPU usage or only the one inside the container? How will it work with docker CPU limitation commands?

It could also result in tests being sent to idle raspberry pi instead of a xeon monster because the xeon has 60% load

jimaek avatar Mar 21 '22 14:03 jimaek

to keep it simple

Is a key here. Collecting CPU usage is easy, using it for routing is easy as well. We need to aggregate more information and knowledge of how our real probes will work before trying to solve some abstract problem. The benchmark-based solution will still need CPU usage info because even if it's a xeon with sky-high benchmark results, you can't send measurements to it if its CPU usage is 99%.

So I suggest starting with CPU usage and adding more complex solutions on top of it.

zarianec avatar Mar 21 '22 14:03 zarianec

We can, but first we will need answers to the questions above. Because it's not so simple either.

The simplest I think would actually just be number of cores, which has its own issues.

jimaek avatar Mar 21 '22 14:03 jimaek

Yeah, docker complicates things a lot because running a container with --cpus=1 will not affect the physical cores count in any way (at least nodejs will still see all of them).

zarianec avatar Mar 21 '22 15:03 zarianec

I have a stupidly simple idea. We could brute-force as many requests as possible on probe's initial run, until their CPU usage reaches certain level, and store those values in a JSON file. The next time they reconnect, they'll just forward this value to us.

patrykcieszkowski avatar Mar 21 '22 15:03 patrykcieszkowski

Doing actual network tests as a benchmark would create problems. Like the endpoint banning the IP for flooding, or the ISP, or even the router. Unless we use localhost as the target. But it's still could trigger local firewall rules. So the benchmark needs to be something representative enough, while still remaining lite enough to not abuse the system.

Yeah, docker complicates things a lot because running a container with --cpus=1 will not affect the physical cores count in any way (at least nodejs will still see all of them).

Yep that's one of my main problems with this approach. Too unreliable and too many scenarios.

jimaek avatar Mar 21 '22 15:03 jimaek

So there are ways to detect the cores we have available:

docker run --rm  ubuntu sh -c "cat /sys/fs/cgroup/cpuset/cpuset.cpus"
0-5
docker run --rm --cpuset-cpus=0,1 ubuntu sh -c "cat /sys/fs/cgroup/cpuset/cpuset.cpus"
0-1

And we can see the total host CPU usage. Which I guess makes sense to use. We could also run a small CPU benchmark on startup to figure our the power of the CPU. But not sure how to combine these parameters for best results.

Example problems: 2 probes on Isle of Man, 100 users ask to run a traceroute from the island at the same time.

1. Probe A has 4 core raspberry pie, and Probe B has 4 core Xeon.

Core detection is useless here. I guess CPU usage could slowly scale up the benchmarks. But how slow? Both have 0% usage, do we send 50 tests to each and just see what happens? Then we could overload Probe A and lose tests. A benchmark would work best here to allow us to know beforehand that Probe B should probably handle 90 tests.

2. Probe A has 4 core Xeon and Probe B has 64 core Xeon.

In theory the majority of tests should be handled by Probe B. But what if Probe B also has 90% CPU usage, while Probe A is at 20%? What algorithm do we use to correctly distribute the tests?

3. Probe A has 128 low powered cores and Probe B has 4 powerful Xeon cores.

How do we compare them if both have 0% CPU usage?

So need something simple enough but also smart enough to be able to handle at least those 3 scenarios. And later we can expand the logic to cover more scenarios.

And this doesnt even take into account the network part which I will open in a separate issue.

jimaek avatar Mar 21 '22 18:03 jimaek

I don't think the CPU is the decisive factor here. We have no idea what kind of network connection the probe has and overloading that may not only result in timeouts, but also in bad quality of the successful measurements.

I'd consider an entirely different approach. Have a queue on each probe and limited concurrency - this will improve reliability if a certain probe gets overloaded, at the cost of delaying the results. Then report the number of measurements waiting in the queue to the API, and use the number for routing (alternatively, the queues could be maintained directly at the API). The HW doesn't really matter, what matters is whether the measurements are pilling up.

This of course opens a question of how to set the queue size, but for that I'd suggest being very conservative - set a low limit that a raspberry pi can handle on a mediocre network. A more powerful device will process the requests a little faster and therefore get more of them. We won't reach full utilization on high-end devices this way, but that's IMO not the point - the point is getting reliable measurements with reasonable throughput.

MartinKolarik avatar Jun 30 '22 10:06 MartinKolarik

I think this task is a better fit for what you're describing https://github.com/jsdelivr/globalping/issues/52 But we cant delay the results, the real-time part is very important to the user experience.

Also the network doesnt really matter in terms how many tests the device can process. A 64 core Xeon can process a lot more tests on a 20mbit connection than a Raspberry pi on 1 Gbit. Assuming the net quality is equal. If not then its up to the other task to detect that.

So its still a hardware limitation and we need to think about CPU benchmarks.

jimaek avatar Jun 30 '22 10:06 jimaek

A 64 core Xeon can process a lot more tests on a 20mbit connection than a Raspberry pi on 1 Gbit. Assuming the net quality is equal.

But even the Raspberry can probably process the amount we need - I suppose it depends on what number of requests / second / location you think we should optimize for but the tests are fairly lightweight and there are many locations so I don't think this matters much. What matters more imo is stability in temporary spikes.

Queueing won't add delay in normal conditions. But if there are suddenly more requests, it's better to wait for 10s for a start and then get a result, then to see a start and then a timeout and have to retry manually.

MartinKolarik avatar Jun 30 '22 10:06 MartinKolarik

To be clear, the goal of Globalping is to be an API that many different tools and services will use to implement their own functionality. This means getting hundreds of test requests per second where each could be asking for 100 probes to run the test.

Imagine a CDN sponsoring us to get a big limit to run their own comparisons for all CDN providers out there, thousands of tests per minute per CDN endpoint. All just for 1 user. And on top of that thousands of free users using the API at the same time.

If we have 2 probes in Africa, 1 small and 1 big and there are 1000 tests they need to run every single minute how is your system will help them handle the load?

It feels like your solution assumes this will remain a relatively low traffic API where most probes will sit doing nothing all day. Otherwise the queues would fill immediately and low-end probes most probably break and the user would get a bunch of failed tests.

That's why I think we need to do pre-benchmarks and assign weights to probes to make sure spikes and heavy usage won't even get to the probes that most probably would break.

jimaek avatar Jun 30 '22 10:06 jimaek

Both of you are correct. Due to nature of the project - we can't possibly predict how performant the probe is going to be, but at the same time we have to make the most of probes we've got access to. A fixed test count for all probes would be wasteful.

A queuing system isn't the worst idea, but that should be a back-up option in case all the probes within the requested GEO are busy.

patrykcieszkowski avatar Jun 30 '22 20:06 patrykcieszkowski