zui icon indicating copy to clipboard operation
zui copied to clipboard

WASM Parser

Open jameskerr opened this issue 1 year ago • 7 comments

Update

This PR has been changed to rely on the "/compile" endpoint from the default lake which runs locally in Zui. It also changes the code so that all the information about a query is gathered once and stored in the Redux store. Components then do not need to re-parse the ast to get information about the current query.

The updates to the Zed Wasm package are still included in this PR, but zed-wasm is not used by Zui.

As a side effect, I've added a more robust mechanism for the zui-player tests to wait for the correct set of results after issuing a query. Many of our flaky integration tests were due to this problem.

Original

In an attempt to do away with having two Zed syntax parsers, we've exported the parse method from go Zed in zed-wasm.

To summarize the changes:

  1. Simplify Zed Wasm Repo
  • The source code is written in ESModules which can be run directly with node, no build step needed.
  • The package.json type is set to module.
  • We only build a commonjs version and a browser version of the package to be used by electron, jest, and all of the npm eco system that still chokes on Node ESM.
  • We build a browser bundle that can be used on the web.
  • We expose the main.wasm file as an export for clients to fetch and initialize on their platform. For example, in the browser we fetch(main.wasm) but in node we readFile(main.wasm).
  • We return the {zq , parse} functions for the createInterface function, which is called by both entry files.
  • We run playwright tests for the browser build.
  • We run node tests for the esm source.
  • We run jest tests for the cjs build.

I do not expose the esm versions of the files because I can't figure out how to get node to then use the cjs version if I expose both.

  1. Usage in the App
  • In Zui, there is a new protocol handler that responds to requests with an "app-asset://" url scheme. We can then look at the host and the path and respond with appropriate files. I use this to serve the main.wasm file from the node_modules directory.

  • I mark the zed-wasm package as "external" so that it remains on the file system in a predictable place after the esbuild step bundles the code.

  • I mark zed-wasm as a production dependency so that it gets included when we build the app.

fetch("app-asset://node_modules/@brimdata/zed-wasm/dist/main.wasm")
  • We store the zedWasm instance in a global variable on the page.
  • In the Jest tests, we initialize a zedWasm global in the before-env-setup file.

Concerns

For every test file in jest, we fetch the main.wasm file and initialize it. It's pretty slow. It's also slow when we have to initialize it on every browser window. We also have to keep a global variable around everywhere the app needs to run (tests, browser).

I have to wonder, could it be easier to expose the parser in the zq cli tool? Then the main process can parse it quickly and return the result with ipc. It would not have any startup cost and it would run quickly in tests.

This work is still good though, because we will create a playground at some point (this year I would hope) and this will certainly be used.

jameskerr avatar Jan 17 '24 00:01 jameskerr

@jameskerr I think we should back out the changes to models/Program then merge this.

mattnibs avatar Jan 23 '24 16:01 mattnibs

I've updated the display of parsing errors.

CleanShot 2024-02-26 at 11 01 05@2x

jameskerr avatar Feb 26 '24 19:02 jameskerr

@jameskerr helpfully did a quick Zoom to walk me through some of the changes and answer my questions about what benefits a user might see from this PR vs. where it's laying foundations for other improvements we may add in the future.

To give some additional color around @jameskerr's last comment about the updated display of parsing errors, by comparison here's how the same error is currently presented in Zui Insiders 1.6.1-13. So what's in this PR is indeed a vast improvement.

image

@jameskerr: While experiencing the improved error in this branch at commit 3be329a, I did spot a small glitch that's probably worth looking into: That "Fetching Total Rows..." informational message shows up and never goes away.

image


Re-testing this branch again now at commit d331ab6, I can see that "small glitch" has been addressed. Now when I trigger the same syntax error, the "Fetching Total Rows..." informational message briefly appears, but then immediately disappears.

https://github.com/brimdata/zui/assets/5934157/d233fe76-7fcb-48cf-8d1b-9ca57f5ad10f

philrz avatar Feb 26 '24 19:02 philrz

Here's another example from testing this branch at commit 3be329a. Compare this screenshot to the one in the opening text of #2576. Once again, this is a vast improvement since it shows the user where the error occurred rather than giving a numeric reference to line & column and making the user go find that.

image

However, in the "foundations for the future" category, this doesn't bring us quite all the way to what's envisioned in #2576. If the syntax error had been thousands of characters "rightward" in the big line of text like is shown in #2576, it would still be quite tedious for the user to scroll right to find the === ^ === in the error message. But what's in this PR is definitely still solid incremental improvement before we make time to give #2576 the full treatment.


Update: In a group discussion there was consensus that, while this PR should indeed build the foundations needed to do what's envisioned in #2576, we're not going to try to bite off that effort in this PR since the changes are already so vast.

philrz avatar Feb 26 '24 19:02 philrz

@jameskerr re @philrz's last comment. Instead of using the error message field returned by compile endpoint, I was thinking we could use info field from the response error message e.g.: "info":{"parse_error_offset":6}

It tells you where in the source the parse error occurred and we could pass that to Monaco and have it do the error message natively.

mattnibs avatar Feb 26 '24 22:02 mattnibs

When going over this PR on a Zoom earlier with @jameskerr I drilled down a bit on how this branch currently leverages:

...the "/compile" endpoint from the default lake which runs locally in Zui.

An ongoing concern I've had for some time is that as more users start to use remote Zed lakes they may encounter difficult-to-debug problems when their local Zui and the remote Zed service have some kind of language/AST incompatibilities. I'm sure there's multiple ways we'll go about addressing this over time (e.g., I assume stricter API versioning could help here) but I've always got my eyes open to ways we can improve incrementally in the meantime.

As @jameskerr confirmed, up until this change Zui has always depended on the local Zed parser that shipped with the app, so relying on the /compile endpoint of the Zed that shipped with the app would not be making things worse. However, it seems like we have an opportunity now to use the /compile endpoint for a remote lake when one is in use, since this has the potential to catch some of the difficult-to-debug problems I cited above. In brainstorming about possible down sides to doing this, the increased latency of calling /compile over the network (which might be a WAN) was the one we could think of, so I set out to quantify this a bit.

I set up an EC2 instance in AWS us-east-2, started a Zed lake service on it, and connected to it from my home desktop in SF. I made a branch based on the one in this PR to which I added a patch @jameskerr provided to call out to /compile on whatever lake the user is currently connected to (diffs) as well as a timer to log how long the parsing step takes. After loading some sample data to both the local and remote lakes, I set out to test the experience with needing to parse some simple Zed (just count()) and some sophisticated Zed (the attached 164-line shaper sample.zed.gz).

The summary of the findings:

Lake location count() sample.zed
Local ~4 ms ~75 ms
Remote ~150 ms ~300 ms

The video shows the user experience in terms of visible delay in the app:

https://github.com/brimdata/zui/assets/5934157/22b6d2cd-388a-4241-abf4-1f8d0eca8a05

My conclusion: I think the change is worth making. Yes, speed is important, but I'd argue that correctness in catching/reporting errors is even more important, especially as we start adding more features over time that actually might change the contents of a pool. In terms of how I imagine remote lakes will be used, I'm guessing what I'm doing here with accessing a lake on the other side of the country will be less common, and users that do it will likely become accustomed to other unavoidable delays. I suspect more remote lakes will be "local" in nature, e.g., like our community zync users that I think are all in the same geographic region. Finally, if the observed latency gets to be a problem I could picture fancy optimizations we could do, e.g., have Zui compare what's returned by the /version endpoint in both the local and remote lakes and if they match it could use the local /compile for parsing (though I'm not advocating doing any of that right now).

@mattnibs: Do you have any opinions to add here?

philrz avatar Feb 26 '24 23:02 philrz

I am fine with using the current lake for the /compile endpoint. I can change the order of some of these operations to improve the speed if I know we are going to loose between 150 and 300 milliseconds on parsing.

Also, would it be possible (or does this already happen?) to return the parsing error message and location information from the /query endpoint when you submit a bad query?

I could also parse the query async to improve speed. The only reason for parsing first is to prevent adding to the history stack an invalid query entry. However, I could be hopeful that it will work, and if it does not, remove the entry from the history stack.

jameskerr avatar Feb 26 '24 23:02 jameskerr

I did some thorough testing to quantify the benefits from this opening statement above:

I've added a more robust mechanism for the zui-player tests to wait for the correct set of results after issuing a query. Many of our flaky integration tests were due to this problem.

tl;dr

My results indicate that these benefits are indeed observable. This should allow us to bring some of our e2e tests back into the mix in CI. A detour during that work also led to a better understanding of separate issue #3021.

Details

The changes in #2954 included no longer running the tests in pool-loads.spec.ts in CI because the tests often failed there but consistently succeed locally. One of the ways it was known to fail is described in #2843, and much of that issue text is speculation about the root cause being a timing issue related to not waiting long enough for results to appear after issuing a query, i.e., the exact improvement in this PR @jameskerr speaks to above. Therefore I set out to quantify if I could observe an improvement in the reliability of the pool-loads.spec.ts tests in CI.

Right out of the gate I became distracted because pool-loads.spec.ts includes the bad data displays an error message test that has nothing to do with checking query results but, as shown in #2954, it's been shown to fail in CI more than any other. I therefore took a detour to study that, which led to opening #3021. Now that's understood, I've excluded that particular test from all study of CI improvements from this PR, i.e., hoping to just observe improvements in the symptoms shown in #2843.

First I did a baseline set of runs in https://github.com/brimdata/zui/actions/runs/8087230620 which was based on recent tip of main, removed the test from #3021, and ran pool-loads.spec.ts in a loop. On the Ubuntu runner this chalked up 7 failures out of 709 runs before the job got canceled (presumably for running too long) and all of them looked like the #2843 symptom.

Since #2843 speculates that waiting longer for results to appear might make the problem go away, I did another set of runs in https://github.com/brimdata/zui/actions/runs/8087202682 that's just like the prior set except that I added a 5-second sleep timer to each test before checking for the results output. Indeed this made it through 513 runs before the job got canceled and saw only one failure, which had a slightly different look than the #2843 symptom. Perhaps there was some intermittent Actions infrastructure problem that made the runner really, really unhealthy? Not sure.

> nx run zui-player:test -g pool-loads.spec.ts

Running 2 tests using 1 worker
TF
  1) pool-loads.spec.ts:17:7 › Pool Loads › load data into a pool ──────────────────────────────────
    Test timeout of 30000ms exceeded.
    Error: locator.waitFor: Application exited
    Call log:
      - waiting for locator('.zed-table__cell') to be visible
       at ../helpers/test-app.ts:120
      118 |   async getViewerResults(includeHeaders = true): Promise<string[]> {
      119 |     const fields = await this.mainWin.locator('.zed-table__cell');
    > 120 |     await fields.waitFor();
          |                  ^
      121 |     let results = await fields.evaluateAll<string[], HTMLElement>((nodes) =>
      122 |       nodes.map((n) => n.innerText.trim())
      123 |     );
        at TestApp.getViewerResults (/home/runner/work/zui/zui/packages/zui-player/helpers/test-app.ts:120:18)
        at /home/runner/work/zui/zui/packages/zui-player/tests/pool-loads.spec.ts:24:21
  2) pool-loads.spec.ts:28:7 › Pool Loads › load more data into the pool ───────────────────────────
    Error: expect(received).toEqual(expected) // deep equality
    - Expected  - 1
    + Received  + 1
      Array [
        "this",
    -   "2",
    +   "1",
      ]
      39 |     await app.sleep(5);
      40 |     const results = await app.getViewerResults();
    > 41 |     expect(results).toEqual(['this', '2']);
         |                     ^
      42 |   });
      43 | });
      44 |
        at /home/runner/work/zui/zui/packages/zui-player/tests/pool-loads.spec.ts:41:21
  Slow test file: pool-loads.spec.ts (39.2s)
  Consider splitting slow test files to speed up parallel execution
  2 failed
    pool-loads.spec.ts:17:7 › Pool Loads › load data into a pool ───────────────────────────────────
    pool-loads.spec.ts:28:7 › Pool Loads › load more data into the pool ────────────────────────────

Then for the big money, I did the set of runs in https://github.com/brimdata/zui/actions/runs/8087979145 which loops the test when starting from the branch in this PR, and there's no sleep timers. That made it through 655 runs on Ubuntu in CI and saw zero failures.

Conclusion

Since these failure rates are already small and might be prone to repro flare-ups based on how Actions is behaving on any given day, they're always a little tricky to study, but the tests do seem to show the desired improvement. I'll mark #2843 as fixed by this PR. If #3021 is still not close to a fix by the time this merges, I may look at breaking off that remaining test into its own file so we can start running the other now-reliable pool-loads tests in CI. It also looks like I could then look at revisiting issues #2026 and #2504 since they both depend on being able to reliably check query results.

philrz avatar Feb 29 '24 19:02 philrz

In the tests I've performed on this branch thus far, I did discuss one finding offline with @jameskerr that I'll summarize here as a formal request before final PR approval. The following tradeoff was disclosed in the opening text:

A query is always pushed into history, even if there is an error in the query text.

Once I had a chance to test drive this, it concerned me. I find syntax errors in Zed are pretty easy to make, especially as I'm building up more sophisticated pipelines, creating user-defined functions/ops, etc. Since Zed will be a new language for our user base, I expect this will be even more true for them as they're learning. Therefore if the History panels gets polluted with lots of error-filled iterations, I'm fearful that it would discourage users from leveraging the History panel.

@jameskerr indicated that he could commit a change here so that the History would not be populated with programs that failed to parse. He also spoke of the caveat that the bad-syntax programs would still be in the history walked by hitting the Back button, but we agreed this is probably ok: The Back button can be rationalized as "go back to the last thing I did", so that being something with a syntax error seems defensibly ok.


Re-testing this branch now at commit 58d7751, I can see this has been enhanced as described above. As shown in the attached video, now my programs that don't parse are left out of the History, but they still can be seen as I click the Back button.

https://github.com/brimdata/zui/assets/5934157/dc6d9b23-00b4-44c1-a3b8-7d218a6f63ce

philrz avatar Mar 19 '24 21:03 philrz

I'm currently testing this branch at commit d331ab6.

While performing smoke tests, I happened to notice the border around this Histogram Settings pop-up are gone. I've confirmed that this problem actually showed up with the changes in #2985 and we just didn't happen to catch it then. But since we're already changing a lot in this branch, it might be worth including the fix for that.

image

The same is true when creating the Time Range Pin.

image


Re-testing this branch now at commit 58d7751, I can see these borders have been fixed.

image image

philrz avatar Mar 19 '24 21:03 philrz

I'm currently testing this branch at commit d331ab6.

This is another that was actually introduced with the changes from #2985: The contents of the shaper editor are now persisted, whereas before they were cleared out. I was a fan of the "cleared out" behavior because my different data loads may not have much in common so having to clear out the shaper editor constantly would get annoying. For re-usable shapers I'm holding out for #2868.

https://github.com/brimdata/zui/assets/5934157/47ac964f-d1a2-4bb6-815f-cb96913f19d2


Re-testing this branch at commit 58d7751, I can see this has been addressed. As shown in the attached video, now the shaper editor starts our clear when we Preview new data.

https://github.com/brimdata/zui/assets/5934157/1fbb0450-b103-41e5-8677-169f246b4ba9

philrz avatar Mar 19 '24 22:03 philrz