mosaic icon indicating copy to clipboard operation
mosaic copied to clipboard

feat: parallel queries

Open domoritz opened this issue 1 year ago • 1 comments

Updated version of #438

Rather than sending requests in series, send them in parallel but serialize the order before returning them from the query manager.

domoritz avatar Aug 02 '24 20:08 domoritz

Some quick experiments on localhost.

Before

node: Screenshot 2024-08-05 at 10 42 07

python:

Screenshot 2024-08-05 at 10 46 02

After

node: Screenshot 2024-08-05 at 10 40 35

python:

Screenshot 2024-08-05 at 10 45 31

domoritz avatar Aug 05 '24 14:08 domoritz

I didn't do an in-depth analysis but in some quick comparisons it looks like query results are resolved more quickly. The network manager also shows queries running in parallel.

Here is an example with 2G network simulation in Firefox

Before: https://github.com/user-attachments/assets/bad87032-e756-49b0-8171-5ad3df0c0aac

After: https://github.com/user-attachments/assets/9fc46638-c536-4e56-aaec-ef6ad0a374d9

domoritz avatar Aug 14 '24 01:08 domoritz

One last question: did you try testing this against all of our current backends? (WASM, Node.js, Python, Rust)

One of the reasons I first ensured serialized queries was due to Node.js DuckDB failures. I believe it was due to an interaction at the time with the Arrow extension.

jheer avatar Aug 22 '24 14:08 jheer

I tested the (at least casually) the flight 200k and 10m examples on

  • node sever socket and rest
  • python server socket and rest
  • wasm

There is a problem when I brush on one chart and move the cursor over another chart because mosaic then tries to run the query to build the index and query the index at the same time (and the latter fails since the index is not there yet).

I guess this problem (index not ready yet) can also happen if the index creation is slow and you start interacting without the index being there. I think the solution here is to always prioritize exec queries and don't send any queries until the exec is done. I'll implement that unless you have another suggestion. We may still want to prevent activation of chart while the mouse is down. At some point it might be nice to have an actual dependency model for queries but that's a lot more invasive.

domoritz avatar Aug 22 '24 14:08 domoritz

I guess this problem (index not ready yet) can also happen if the index creation is slow and you start interacting without the index being there. I think the solution here is to always prioritize exec queries and don't send any queries until the exec is done. I'll implement that unless you have another suggestion. We may still want to prevent activation of chart while the mouse is down. At some point it might be nice to have an actual dependency model for queries but that's a lot more invasive.

I think blocking on exec queries probably makes sense. There are other specs with derived data sets on load that also assume serialization -- though in some cases multiple exec queries may be issued as a single DB request, I don't think this is always the case.

I agree that a full dependency model is a bigger lift. That said, checking (and if necessary, blocking) on data cube table construction within the data cube indexer might not be so bad as a stop-gap.

Either way, we should make sure we have accompanying unit tests for these cases.

jheer avatar Aug 22 '24 14:08 jheer

Added logic to prevent multiple parallel exec requests and also added a unit test.

domoritz avatar Aug 31 '24 18:08 domoritz

Updated the code to address your comments.

However, the socket connector includes its own blocking, forcing sequential query execution. We need to revisit that.

I tested with the ws connector and it works fine. We are unnecessarily trying to parallelize queries that are then executed in series by the connector, yes. I could add a flag to not parallelize in the query manager but that introduces an extra code path so I'd be in favor of merging as is.

domoritz avatar Sep 29 '24 21:09 domoritz

I tested with the ws connector and it works fine. We are unnecessarily trying to parallelize queries that are then executed in series by the connector, yes. I could add a flag to not parallelize in the query manager but that introduces an extra code path so I'd be in favor of merging as is.

I agree that we don't need another flag. My point was just to note that we need to update the socket connector to actually get benefits of parallel queries there.

jheer avatar Sep 29 '24 22:09 jheer

Yes, that would be nice. I don't see an obvious way right now to do that, though. When we get a result on the socket, we just send binary so we can't easily change the order of responses. We would need to add some control signal that's sent before a result to send query results out of order. Should we do this in this pull request or a follow up (I can file an issue for it)?

domoritz avatar Sep 30 '24 02:09 domoritz

I filed https://github.com/uwdata/mosaic/issues/547 as a follow up.

domoritz avatar Oct 03 '24 02:10 domoritz