0x-mesh icon indicating copy to clipboard operation
0x-mesh copied to clipboard

Potential high CPU usage causing stuttering UI

Open albrow opened this issue 4 years ago • 14 comments

Context

Augur has reported this issue on Discord. They are seeing UI stutters which they believe may be due to Mesh taking up too much CPU resources.

  • Are you running Mesh in the browser or as a standalone server? Are running Mesh inside of Docker or directly running the binary?

In the browser. Specifically Chrome. @pgebheim Do you know if this affects more browsers?

  • What version of Mesh are you running? Be as specific as possible (e.g., 8.0.1 instead of latest or 8).

@pgebheim can you answer this?

Expected Behavior

After an initial spike while compiling the Wasm, Mesh should not be using too much CPU.

Current Behavior

Mesh is apparently taking up a lot of CPU time and causing the UI to stutter.

Failure Information (for bugs)

See below regarding logs.

Steps to Reproduce

@pgebheim can you answer this? You mentioned that you are using a script to create Kovan orders? Can you send that our way so we can reproduce exactly what you are seeing?

Failure Logs

Logs and profiler output from Chrome have been sent to myself and @jalextowle but are too big for GitHub. Please reach out to me if you need them.

albrow avatar May 05 '20 00:05 albrow

@pgebheim do you have any other WebAssembly running at the same time as Mesh? I noticed a big discrepancy in the profile data that you sent me compared to running the example in the Mesh repo.


Here's our example. This is a profile captured over a time period of 30 seconds after Mesh has already warmed up (it was running for about 60 seconds prior to taking the capture). I used the Task Manager feature of Chrome to keep an eye on the CPU usage. It typically stayed around 7-10% with occasional spikes to 25% when there is a new batch of incoming orders. (Update: I did see it spike to 60% after letting it run for a lot longer. It sunk back down to 5% after about 1 second, which I believe is the polling rate of the Task Manager). This is on mainnet with around 2k-4k orders.

Screen Shot 2020-05-04 at 7 51 56 PM

Two things to note:

  1. Our example does not have a UI, so I can't easily confirm whether this would cause stutters. However, based on past experience, I would not expect the CPU usage that we're seeing to cause any issues.
  2. Our example does not create any orders in the browser. It is possible that this creates more CPU demand. I would love to see whatever scripts you are using to create orders on Kovan so that we could confirm.

Here is the profile data that you sent me:

Screen Shot 2020-05-04 at 7 51 48 PM

While there are a lot of similarities, the top entry when sorted by "Self Time" is something called "wasm-unnamed". I have not seen this before when looking at Mesh profiles, and I think it may be unrelated to Mesh. Is that possible?

albrow avatar May 05 '20 03:05 albrow

I captured my own performance profile by following the instructions from https://github.com/0xProject/0x-mesh/issues/803 and building and running Augur locally. I waited a little while for the application to warm up and then captured 30 seconds. Interestingly, I don't see anything called "wasm-unnamed". I wonder if this is just a difference in Chrome versions or Chrome settings? Here's what it looks like for me:

Screen Shot 2020-05-05 at 1 48 49 PM

Note that the top entry is something called "Function Call" which unfortunately doesn't tell us much. Digging into the flame graph gives a better picture of what is going on:

Screen Shot 2020-05-05 at 1 50 51 PM

The red marks are pointing out function calls that took longer than usual. Looks like there is one every few seconds. Let's zoom into one of these calls:

Screen Shot 2020-05-05 at 1 53 22 PM

I highlighted in red the portion of CPU time that is coming from Mesh. It looks like most of the time comes from mesh.getStatsAsync(). This takes up about 25% of the total CPU time for this particular "Function Call".

Here's the snippet of code that I believe corresponds this particular "Function Call":

  private notifyNewBlockEvent = async (blockNumber: number, logs: ParsedLog[]): Promise<void> => {
    let lowestBlock = await (await this
      .db).syncStatus.getLowestSyncingBlockForAllDBs();

    if (lowestBlock === -1) {
      lowestBlock = blockNumber;
    }

    const blocksBehindCurrent = blockNumber - lowestBlock;
    const percentSynced = ((lowestBlock / blockNumber) * 100).toFixed(4);

    const timestamp = await this.augur.getTimestamp();

    let stats: ZeroXStats = {peers: 0, orders: 0};
    if(this.augur.zeroX) {
      stats = await this.augur.zeroX.getStats();
    }

    this.augur.events.emit(SubscriptionEventName.NewBlock, {
      eventName: SubscriptionEventName.NewBlock,
      highestAvailableBlockNumber: blockNumber,
      lastSyncedBlockNumber: lowestBlock,
      blocksBehindCurrent,
      percentSynced,
      timestamp: timestamp.toNumber(),
      logs,
      ...stats
    });
  };

Digging into the code further it looks like mesh.getStatsAsync() is relatively expensive in terms of CPU cost due to inefficiencies in our current database implementation in the browser. This will be hopefully addressed after our big database refactor.

However, as far as I can tell, this is only taking up about ~25% of the CPU time associated with what the Chrome profiler calls "Function Call". If you want to reduce the stress on the CPU, based on these results the first place to look should be the rest of the notifyNewBlockEvent function, outside of what is going on in mesh.getStatsAsync().

If you do need to reduce this ~25% CPU time that Mesh is using without waiting for our database upgrade, we have a few options for temporary workarounds. The most straightforward would be simply not calling mesh.getStatsAsync() for every new block. If that's not an option, we may be able to create a new function or add some new options so that you can get only the information you need in a more efficient way. Let me know what you want to do.

albrow avatar May 05 '20 21:05 albrow

Note that I am not trying to measure CPU time on startup or during the warmup period while Mesh is getting orders for the first time. The reason is that there are a lot of variables that come into play (how good your browser is at compiling WebAssembly, how many orders are in the network, what's in the cache, etc.). I also think https://github.com/0xProject/0x-mesh/issues/803 is a confounding factor and could potentially be affecting the results. We can take a closer look at CPU time during startup after https://github.com/0xProject/0x-mesh/issues/803 is addressed.

albrow avatar May 05 '20 21:05 albrow

Oh, we can remove the call to getStatsAsync() right now because we ended up using a courser grained method of determining the state of the orderbook to display in the UI. I'll remove this now and see what sorta results we get.

pgebheim avatar May 05 '20 21:05 pgebheim

@pgebheim I removed the call to mesh.getStatsAsync() from inside of notifyNewBlockEvent on my end. Here's what the zoomed in flame graph looks like:

Screen Shot 2020-05-05 at 3 18 08 PM

It's very similar to before except that the part of the CPU time that Mesh was responsible for is gone. I think you should take a look at what else is going on inside of notifyNewBlockEvent and whether it could be optimized.

albrow avatar May 05 '20 22:05 albrow

Just updating this issue for the sake of consistency. While it is true that under the conditions described above Mesh is not responsible for a significant portion of the CPU time (after removing the unnecessary call mesh.getStats()), we have also shifted our attention to different conditions for profiling. During the initial start and warm up time, Mesh is taking up a significant portion of CPU time, mostly due inefficiencies in the database and the ordersync protocol. We are working on determining which hot functions make the most sense to optimize and then will be implementing those optimizations one by one. To give you a rough idea of the current top culprits:

  • Inefficient database operations across the board
  • Converting between Go and JavaScript types
  • Regex matching for custom order filters
  • Keccak-256 hashing

@jalextowle please feel free to update as needed.

albrow avatar May 08 '20 21:05 albrow

@pgebheim The main focus right now has been on optimizing the current JSON schema validation. This is how order filters are used to validate orders, and it's become apparent that our current JSON schema validation logic is quite slow when it is executed in the browser. I'm currently working on attempting to use a JSON schema validator library called ajv (https://ajv.js.org/) in the WebAssembly version of Mesh instead of gojsonschema. I'll have a better idea of whether or not this will be a performance improvement later today, and I will update with progress then.

Another optimization that I have worked on is attempting to release the main thread using go-routines. This seems to have been a dead end, but we are communicating with the Go team in order to see if there is a way to accomplish this goal in Go that has been compiled to WebAssembly. I'm interested to hear how things are working when you are running the browser node in a Web Worker.

jalextowle avatar May 15 '20 20:05 jalextowle

@jalextowle -- Thanks for the update. We're working on getting an initial cut of the PR in. According to the PR author this makes the UI not freeze up at all -- though the syncing is still rather slow.

pgebheim avatar May 15 '20 22:05 pgebheim

Any news?

loca555 avatar Jun 01 '20 17:06 loca555

@loca555 We have recently released version 9.4.0 of the Mesh network. This release includes a performance optimization that reduces the CPU usage of one hot spot that we identified while digging into this issue.

We are now working on several more optimizations, and I'll post any progress that we make here.

jalextowle avatar Jun 02 '20 01:06 jalextowle

See https://github.com/0xProject/0x-mesh/pull/820 for another optimization.

albrow avatar Jun 08 '20 21:06 albrow

https://github.com/0xProject/0x-mesh/pull/840 includes an optimization for FindMiniHeaders.

albrow avatar Jun 17 '20 01:06 albrow

https://github.com/0xProject/0x-mesh/pull/846 includes an optimization for computeOptimalChunkSizes.

albrow avatar Jun 22 '20 19:06 albrow

After a few rounds of optimization, we have greatly reduced CPU usage in the browser. In Chrome, we are seeing that Mesh does not hold onto the main thread for longer than 100ms at a time in most circumstances. In addition, Augur is now using Mesh in a Web Worker which eliminates stuttering altogether. There is still room for further CPU optimization if needed, but I think at this point we have solved the immediate problem of UI stuttering.

@pgebheim is it okay if we close this?

albrow avatar Aug 11 '20 00:08 albrow