codeql-action Javascript Analysis Regression in CLI v2.11.4

Hello! I have been investigating an issue where our javascript analysis can take hours, or often fail to complete entirely. We are also seeing large db results and memory usage. Unfortunately I don't have an isolated test case you can reference yet. I could use any guidance you have to help me identify what javascript file / code is causing this behavior so that I can help build that test case for you.

In my investigations I find that the resource consumption is specific to the GlobalAccessPaths module. When this module runs at first iterations will be quite quick, but by iteration 19 resource consumption becomes excessive on my laptop.

This is reproducible using the .../qlpacks/codeql/javascript-queries/0.4.4/, but using the older .../qlpacks/codeql/javascript-queries/0.4.3/ completes in a reasonable amount of time.

Are you aware of any change that might cause resource consumption increases in CLI v2.11.4?

How can I debug this further to understand what code is causing this behavior? I would like to provide a simple reproduction if possible, but currently the behavior is still really opaque to me.

Thank you!

Mar 08 '23 23:03 jentfoo

Thank you for your questions! 🙂

Before we jump into how to find the performance issue, may I ask why you're using version 2.11.4? We're now up to version 2.12.4, and the version you're using is from November of last year, 8 releases ago (cf. the list of releases).

It's very likely that this issue has been fixed in a subsequent release already.

Mar 10 '23 14:03 tausbn

Thank you for the reply @tausbn

Unfortunately this regression persists in the latest versions. We had been just referencing @v2. But while investigating why CodeQL is not performing well, I noticed that this was a regression. That sent me on a hunt to find what version regressed. v2.11.3 (and older, tested two versions back) appear to work without issue, but v2.11.4 and newer all will fail to complete and have resource consumption due to the GlobalAccessPaths module.

Mar 10 '23 15:03 jentfoo

Alright, in that case let's get into some of the details of how to debug stuff.

Actually, before we start: is there any way for you to share the JavaScript code that is exhibiting this behaviour? That would greatly ease the debugging process.

Based on the fact that you've identified GlobalAccessPaths as a potential culprit, I assume you've been running queries using the VSCode CodeQL extension, and that's where you observed the fact that it got to iteration 19 and then seemingly stopped.

If you haven't already, I would suggest turning on debug logging in the VSCode extension.

This makes the CodeQL evaluator emit a log of "tuple counts" during evaluation -- basically listing the sizes of the intermediate relations being calculated, and how many tuples were created with each operation (joins, unions, etc.)

The behaviour you're seeing -- getting to a certain iteration and then seemingly hanging and consuming a lot of resources -- is consistent with a so-called "bad join order". (The CodeQL compiler tries its best to order joins so that this doesn't happen, but both library changes and compiler changes may cause it to reorder joins differently from one release to the next.)

What you can do then, is to run the analysis, and cancel it after it's been working away on iteration 19 for a while (during which time you'll likely see the CodeQL Query Server output pane in VSCode fill up with messages about garbage collection taking place). When you cancel it, it'll give you the current tuple counts (approximately) for the current iteration.

If it is indeed a bad join order, then it'll look something like the following:

Tuple counts for Essa::ScopeEntryDefinition#class#24e22a14#f/1@45e0d8dh after 10.5s:
2133368   ~1%     {2} r1 = Essa::TEssaNodeDefinition#24e22a14#ffff_03#join_rhs AND NOT Essa::ImplicitSubModuleDefinition#class#24e22a14#f(Lhs.1 'this')
534478950 ~0%     {2} r2 = JOIN r1 WITH Definitions::SsaSourceVariable::getScopeEntryDefinition#dispred#f0820431#ff ON FIRST 1 OUTPUT Lhs.1 'this', Rhs.1
581249    ~4%     {1} r3 = JOIN r2 WITH Essa::EssaNodeDefinition::getDefiningNode#dispred#f0820431#ff ON FIRST 2 OUTPUT Lhs.0 'this'
                return r3

that is, a large number in the middle with smaller numbers on each side. (The above example was taken from this old join-order fix of mine.)

From there, we may be able to figure out how to prevent this bad join from appearing. 🙂

Mar 10 '23 15:03 tausbn

is there any way for you to share the JavaScript code that is exhibiting this behaviour?

Unfortunately we do not have a public repo suffering from this issue. If possible I would like to narrow down what file(s) are causing this in order to help provide a reproduction example that I can share with you.

I assume you've been running queries using the VSCode CodeQL extension, and that's where you observed the fact that it got to iteration 19 and then seemingly stopped.

This is not an accurate assumption. I was testing different versions using CodeQL from the CLI. I was then using the --verbose flag which logged out as the testing proceeded forward. It was that logging that pointed to the progressively slower iterations of GlobalAccessPaths

I am currently working on Linux, and don't have VSCode installed. Is this debugging I can even perform? Do you have documentation on how I can setup the VSCode extension? Thank you!

Mar 10 '23 16:03 jentfoo

Unfortunately we do not have a public repo suffering from this issue. If possible I would like to narrow down what file(s) are causing this in order to help provide a reproduction example that I can share with you.

I'm not sure it'll be easy to create a reproduction example like this. Quite often these issues only pop up when there is a sizeable amount of code present. But we can try anyway. 🙂

This is not an accurate assumption. I was testing different versions using CodeQL from the CLI. I was then using the --verbose flag which logged out as the testing proceeded forward. It was that logging that pointed to the progressively slower iterations of GlobalAccessPaths

I am currently working on Linux, and don't have VSCode installed. Is this debugging I can even perform? Do you have documentation on how I can setup the VSCode extension? Thank you!

Ah, if you're using the CLI, then it should be as simple as adding the --tuple-counting argument to the command. (And you may want to add --log-to-stderr or --logdir=<dir> to redirect the log output.)

Mar 10 '23 16:03 tausbn

@tausbn I was able to capture the additional logging. The logs are quite large, let me know if you need the additional context of the full log.

I believe this probably captures what you're looking for:

[2023-03-13 15:00:44] Pausing evaluation to evict 105.21MiB ARRAYS at sequence stamp o+2649489
[2023-03-13 15:00:44] Unpausing evaluation: 105.29MiB forgotten: 105.29MiB UNREACHABLE (92 items up to o+2649487)
[2023-03-13 15:00:44] Pausing evaluation to evict 105.21MiB ARRAYS at sequence stamp o+2649675
[2023-03-13 15:00:44] Unpausing evaluation: 106.43MiB forgotten: 106.43MiB UNREACHABLE (93 items up to o+2649673)
[2023-03-13 15:00:44] Starting to evaluate predicate GlobalAccessPaths#c651dae5::AccessPath::GetLaterAccess::getARelevantVariable#0#f/1@i17#98a82zul (iteration 17)
[2023-03-13 15:00:44] Tuple counts for GlobalAccessPaths#c651dae5::AccessPath::GetLaterAccess::getARelevantVariable#0#f/1@i17#98a82zul after 30ms:
                      138246 ~1%     {1} r1 = Variables#51c25ee5::Variable::isLocal#0#dispred#f AND NOT GlobalAccessPaths#c651dae5::AccessPath::GetLaterAccess::getARelevantVariable#0#f#prev(Lhs.0 'result')
                      304565 ~0%     {2} r2 = JOIN r1 WITH Variables#51c25ee5::VarAccess::getVariable#0#dispred#ff_10#join_rhs ON FIRST 1 OUTPUT Rhs.1, Lhs.0 'result'
                      304565 ~0%     {2} r3 = JOIN r2 WITH DataFlow#d6e964f9::DataFlow::Node::asExpr#0#dispred#ff_10#join_rhs ON FIRST 1 OUTPUT Rhs.1, Lhs.1 'result'
                      0      ~0%     {1} r4 = JOIN r3 WITH GlobalAccessPaths#c651dae5::AccessPath::fromRhs#2#fff#prev_delta ON FIRST 1 OUTPUT Lhs.1 'result'
                      0      ~0%     {2} r5 = JOIN r4 WITH project#GlobalAccessPaths#c651dae5::AccessPath::GetLaterAccess::getAnAccessInContainer#3#ffff ON FIRST 1 OUTPUT Lhs.0 'result', Rhs.1
                      0      ~0%     {1} r6 = JOIN r5 WITH project#GlobalAccessPaths#c651dae5::AccessPath::GetLaterAccess::getAnAccessInContainer#3#ffff#2 ON FIRST 2 OUTPUT Lhs.0 'result'
                      0      ~0%     {2} r7 = JOIN r6 WITH Variables#51c25ee5::VarRef::getVariable#0#dispred#ff_10#join_rhs ON FIRST 1 OUTPUT Rhs.1, Lhs.0 'result'
                      0      ~0%     {1} r8 = JOIN r7 WITH project#GlobalAccessPaths#c651dae5::AccessPath::GetLaterAccess::getBaseVar#1#ff ON FIRST 1 OUTPUT Lhs.1 'result'
                                     return r8
[2023-03-13 15:00:44] Empty delta for GlobalAccessPaths#c651dae5::AccessPath::GetLaterAccess::getARelevantVariable#0#f_delta (order for disjuncts: delta=500000).
[2023-03-13 15:00:44] Empty delta for GlobalAccessPaths#c651dae5::AccessPath::GetLaterAccess::getASuccessorBBThatReadsVar#1#ff_delta (order for disjuncts: delta=<standard>).
[2023-03-13 15:00:44] Empty delta for GlobalAccessPaths#c651dae5::AccessPath::GetLaterAccess::getLaterBaseAccess#1#ff_delta (order for disjuncts: delta=<standard>).
[2023-03-13 15:00:44] Starting to evaluate predicate GlobalAccessPaths#c651dae5::AccessPath::fromRhs#2#fff/3@i17#98a82wul (iteration 17)
[2023-03-13 15:00:56] Pausing evaluation to evict 105.22MiB ARRAYS at sequence stamp o+2722335

The logs towards the bottom all replicate a similar pattern to the above. However I am unsure what to do with this information, let me know how I can help in debugging. Thank you!

Mar 13 '23 21:03 jentfoo

The bit of the log you provided above doesn't look terribly suspicious. In the grand scheme of things, a six-digit number of tuples is no sweat for the CodeQL evaluator (as evidenced by the fact that it took only 30ms to do the joins for that iteration).

I would expect something considerably higher, say at the level of tens or hundreds of millions of tuples for the behaviour you're describing.

If you let the analysis run until it seems to grind to a halt at iteration 19, and wait, say, half a minute before terminating the evaluation, then I think the last tuple count in the log should show some of the behaviour I describe. (It will likely also have very "round" numbers for the tuple counts, as these are not logged exactly if cancelled prematurely.)

Mar 14 '23 11:03 tausbn

Thank you for the continued help @tausbn. Unfortunately the numbers never get that high. I did a variety of testing with different analysis sets to see if it's relevant to a specific test, but they all seem to top out around there.

I was unsure how best to demonstrate this to help you debug. Ultimately I decided to record this. This real time recording shows an almost 14 minute test run. This run gets to iteration 17 and uses over 16GB of ram and the database size grows to over 12GB for the 0.5.4/Declarations/ cases only (I hoped having a smaller test set would help in debugging). And in this test the highest values seen are:

[2023-03-15 11:08:54] Tuple counts for GlobalAccessPaths#c651dae5::AccessPath::GetLaterAccess::getARelevantVariable#0#f/1@i17#98a82zul after 39ms:
                      138246 ~1%     {1} r1 = Variables#51c25ee5::Variable::isLocal#0#dispred#f AND NOT GlobalAccessPaths#c651dae5::AccessPath::GetLaterAccess::getARelevantVariable#0#f#prev(Lhs.0 'result')
                      304565 ~0%     {2} r2 = JOIN r1 WITH Variables#51c25ee5::VarAccess::getVariable#0#dispred#ff_10#join_rhs ON FIRST 1 OUTPUT Rhs.1, Lhs.0 'result'
                      304565 ~0%     {2} r3 = JOIN r2 WITH DataFlow#d6e964f9::DataFlow::Node::asExpr#0#dispred#ff_10#join_rhs ON FIRST 1 OUTPUT Rhs.1, Lhs.1 'result'
                      0      ~0%     {1} r4 = JOIN r3 WITH GlobalAccessPaths#c651dae5::AccessPath::fromRhs#2#fff#prev_delta ON FIRST 1 OUTPUT Lhs.1 'result'
                      0      ~0%     {2} r5 = JOIN r4 WITH project#GlobalAccessPaths#c651dae5::AccessPath::GetLaterAccess::getAnAccessInContainer#3#ffff ON FIRST 1 OUTPUT Lhs.0 'result', Rhs.1
                      0      ~0%     {1} r6 = JOIN r5 WITH project#GlobalAccessPaths#c651dae5::AccessPath::GetLaterAccess::getAnAccessInContainer#3#ffff#2 ON FIRST 2 OUTPUT Lhs.0 'result'
                      0      ~0%     {2} r7 = JOIN r6 WITH Variables#51c25ee5::VarRef::getVariable#0#dispred#ff_10#join_rhs ON FIRST 1 OUTPUT Rhs.1, Lhs.0 'result'
                      0      ~0%     {1} r8 = JOIN r7 WITH project#GlobalAccessPaths#c651dae5::AccessPath::GetLaterAccess::getBaseVar#1#ff ON FIRST 1 OUTPUT Lhs.1 'result'
                                     return r8

Let me know what else I can provide, thank you!

https://user-images.githubusercontent.com/2165757/225429674-a7fb815c-42db-4765-b332-f37da7bb6a6f.mp4

Mar 15 '23 20:03 jentfoo

Thank you for the video -- it was very helpful! (I think this is the first time I've tried debugging something this way. 😄)

I believe this has shed some light on why things are going wrong.

Around the 5:23 mark, we see the tuple counts for GlobalAccessPaths::AccessPath::fromRhs, with a result of size 70760664. This is almost certainly what's causing all of the garbage collection. (My guess is that if you had Ctrl-C'ed the evaluation at the end of the video, you would've seen a later iteration of that same predicate being evaluated, but with a much higher tuple count.)

Now, this is not exactly the "bad join order" scenario that I described, but it's somewhat related (and a bit more concerning). To understand why this is, let's go back to the bad join order example I had above:

Tuple counts for Essa::ScopeEntryDefinition#class#24e22a14#f/1@45e0d8dh after 10.5s:
2133368   ~1%     {2} r1 = Essa::TEssaNodeDefinition#24e22a14#ffff_03#join_rhs AND NOT Essa::ImplicitSubModuleDefinition#class#24e22a14#f(Lhs.1 'this')
534478950 ~0%     {2} r2 = JOIN r1 WITH Definitions::SsaSourceVariable::getScopeEntryDefinition#dispred#f0820431#ff ON FIRST 1 OUTPUT Lhs.1 'this', Rhs.1
581249    ~4%     {1} r3 = JOIN r2 WITH Essa::EssaNodeDefinition::getDefiningNode#dispred#f0820431#ff ON FIRST 2 OUTPUT Lhs.0 'this'
                return r3

In this case, though it looks troubling that there are 534478950 tuples in the intermediate join, in terms of performance it might not be catastrophic. This is because those ~500 million tuples don't have to be materialised as a table.

In general, this materialisation takes place at the end of a predicate, i.e. the return r3 above means we actually materialise a table of 581249 tuples. In the case of fromRhs, this means materialising ~70 million tuples.

Now, this is a bit worrying, because it means an easy fix may not be available. In the case of a bad join order, we can try to change the order of the individual joins to avoid unnecessary computation, but in this case it's the entire predicate that's big (and changing the join order can never change the number of tuples in the result).

In this case, however, the fromRhs relation is simply just that big, and the only remedy will be to limit it somehow.

I think at this point I'll have to hand this over to my colleagues in the JavaScript team, as they are much more familiar with how the GlobalAccessPaths library works.

Mar 17 '23 13:03 tausbn

Thank you for the help @tausbn, let me know if there is anything further I can provide.

Mar 17 '23 22:03 jentfoo

I think I know what issue you're encountering. It's probably the same the one I tried to fix here: https://github.com/github/codeql/pull/12105.

I'll take a look at that again and try to get it merged.
Hopefully that will fix your issue.

Mar 20 '23 11:03 erik-krogh

I just merged https://github.com/github/codeql/pull/12592

That might fix the problem you are encountering.
It definitely fixes a similar problem in another repository.

It will take a few weeks before the fix lands in a release.

Mar 22 '23 11:03 erik-krogh

Thank you for the update @erik-krogh. I tried to see if there was a contributing guide so I could give it an early test, but I don't see any instructions. Let me know if I missed it, otherwise just let me know when there is a release version I can test. Thank you again for the help here!

Mar 24 '23 15:03 jentfoo

Thank you for the update @erik-krogh. I tried to see if there was a contributing guide so I could give it an early test, but I don't see any instructions.

I think you can try this: https://codeql.github.com/docs/codeql-for-visual-studio-code/setting-up-codeql-in-visual-studio-code/
and make sure to pull in the latest changes from the github/codeql repo.

Alternatively you can try to follow this: https://docs.github.com/en/code-security/codeql-cli/using-the-codeql-cli/getting-started-with-the-codeql-cli

Mar 27 '23 12:03 erik-krogh

Hi, sorry I kind of forgot about this.

The latest release should include the above mentioned PR now.
You can find CLI releases here: https://github.com/github/codeql-cli-binaries/releases

Apr 25 '23 19:04 erik-krogh

Thank you for the update @erik-krogh

Unfortunately using codeql-action/analyze@v2 still showed the same behavior we witnessed originally. Let me know if there is any additional debugging I can provide.

Apr 26 '23 19:04 jentfoo

Can you confirm which version of the QLPack you are running? Previously you were on 0.5.4.

Apr 26 '23 19:04 erik-krogh

I forgot where I downloaded the QLPacks from. But I can confirm that this issue is still reproducible with a freshly downloaded CLI 2.13.0 and via the latest GitHub Actions

May 02 '23 18:05 jentfoo

Unfortunately I think we are at the point where it's very hard for me to do more without more information from you.

Maybe I can do something if you give me the entire log file (that streamed in your video), but I doubt it.

The best would be to get a test-case, with some small amount of JavaScript that causes the regression in the analysis. That would really help.

Or if it's possible for me to somehow get temporary access to the code.

May 02 '23 19:05 erik-krogh

@erik-krogh sorry for the delay in responding. I did get authorization to allow you access to a portion of our proprietary code base. I spent the time to find out what minimum amount of code can be provided to demonstrate this issue. I was able to get it down to under 500KB of protobuf generated typescript + javascript files.

Can you confirm you're able to still work on this issue? If so I will provide you access to a repo which contains the code in question. Thank you!

May 31 '23 00:05 jentfoo

Can you confirm you're able to still work on this issue? If so I will provide you access to a repo which contains the code in question. Thank you!

Yes, I can confirm that I'm still able to work on it, or I'll maybe get someone else to take a look. I can't guarantee that we can fix it, but we will take another look at it.

May 31 '23 05:05 erik-krogh

@erik-krogh You should now have access to the repo https://github.com/gravitational/cloud-codeql-testing

That contains the minimal code needed for reproduction. In the README I specified the exact commands I used to validate the failure using CLI v2.13.0

Let me know if you have any questions, or there is anything I can do to help further. Thank you!

Jun 01 '23 19:06 jentfoo

I have a fix: https://github.com/github/codeql/pull/13349

I've confirmed that all our queries run just fine with that fix on the repo you gave me access to.

After it's been merged it will still be a while before it makes it into a release.

Jun 01 '23 21:06 erik-krogh

I am impressed! Thank you for your work! I am happy to test once it's included in a release.

Jun 01 '23 21:06 jentfoo

codeql-action codeql-action copied to clipboard

Javascript Analysis Regression in CLI v2.11.4

codeql-action
codeql-action copied to clipboard