codeql icon indicating copy to clipboard operation
codeql copied to clipboard

C++: Total number of baseline files limit

Open artem-smotrakov opened this issue 1 year ago • 5 comments
trafficstars

Hey friends, I have quite a large C++ database:

codeql database print-baseline -- ${CODEQL_DATABASE_DIR}
Counted a baseline of 27711380 lines of code for cpp.

Before running scans, I normally run some simple diagnostic queries to make sure the database looks fine. The queries look for things like:

  • Files
  • FunctionCalls
  • IfStmts

When I run these queries on this large database, I get this

codeql database analyze ${CODEQL_DATABASE_DIR} --format=sarif-latest --output=calls.sarif ${CODEQL_QUERIES}/qlpacks/cpp-queries/diagnostics/FunctionCalls.ql
Running queries.
[1/1 comp 7.8s] Compiled [...]/qlpacks/cpp-queries/diagnostics/FunctionCalls.ql.
Files.ql: [1/1 eval 36s] Results written to cpp-queries/diagnostics/FunctionCalls.bqrs.
Shutting down query evaluator.
Interpreting results.
Will not interpret file coverage baseline information, since the total number of baseline files is 153738, which is greater than the limit of 50000.

The exit code is 0 but calls.sarif is empty.

When I run queries from the standard C++ pack, I get the same message.

What does this limit mean? Is there any way to increase it? I didn't find anything either in the docs or in this repo unfortunately, may be missing something though. Thanks!

artem-smotrakov avatar Oct 11 '24 14:10 artem-smotrakov

👋 @artem-smotrakov, sorry for the late reply!

That limit is meant to avoid generating too large a SARIF file when populating the Tool Status Page for information about how many files were analyzed, hitting the SARIF file size limit. It is currently hard-coded and cannot be configured.

That said, I'm not entirely sure this limit (and the warning) should really cause a custom query like yours to return no results. Could you:

  • share your FunctionCalls.ql, so we can experiment with it a bit?
  • maybe try another output format like cvs to see if the issue is specifically related to the SARIF format?

redsun82 avatar Oct 14 '24 07:10 redsun82

Hi @redsun82 ! Thanks for your reply!

share your FunctionCalls.ql, so we can experiment with it a bit?

Yeah, sure, it's quite simple

import cpp

from FunctionCall call, Function func
where func = call.getTarget()
select call, "Call " + func + "(" + func.getParameterString() + ")"

maybe try another output format like cvs to see if the issue is specifically related to the SARIF format?

I get the same message for CSV if I use --format=csv --output=calls.csv. The calls.csv file is empty.

It is currently hard-coded and cannot be configured.

Would it be possible to make it configurable in one of the next releases? 🤔

artem-smotrakov avatar Oct 14 '24 08:10 artem-smotrakov

Hi @artem-smotrakov,

The base line information should not influence the result of the query. Could you run https://github.com/github/codeql/blob/main/cpp/ql/src/Diagnostics/ExtractionWarnings.ql to determine if other issue are influencing the results of the query?

rvermeulen avatar Oct 14 '24 19:10 rvermeulen

Hi @rvermeulen ! Attaching the results of the ExtractionWarnings.ql. I see errors in several files but the codebase has way more C++ files. Also, I got the same limit warning when I ran the query.

extractrion_warnings.sarif.txt

artem-smotrakov avatar Oct 16 '24 08:10 artem-smotrakov

Hi @artem-smotrakov,

Let me forward this to our C/C++ team. In the mean time, could you share which CodeQL CLI version you are using codeql version --format=json and the build-tracer.log that you can find in the database directory under logs. Before sharing make sure possible sensitive information is redacted (such as the unpack location).

rvermeulen avatar Oct 17 '24 22:10 rvermeulen

This issue is stale because it has been open 14 days with no activity. Comment or remove the Stale label in order to avoid having this issue closed in 7 days.

github-actions[bot] avatar Nov 01 '24 02:11 github-actions[bot]

This issue is stale because it has been open 14 days with no activity. Comment or remove the Stale label in order to avoid having this issue closed in 7 days.

I am working on it, please don't close it.

artem-smotrakov avatar Nov 04 '24 15:11 artem-smotrakov

share which CodeQL CLI version you are using codeql version --format=json


{
  "productName" : "CodeQL",
  "vendor" : "GitHub",
  "version" : "2.19.0",
  "sha" : "9f0ad8ab1f14c2711b9fc2666e8bcdd09ab39ce8",
  "branches" : [
    "codeql-cli-2.19.0"
  ],
  "copyright" : "Copyright (C) 2019-2024 GitHub, Inc.",
  [...]
  "configFileFound" : false,
  "features" : {
    "analysisSummaryV2Default" : true,
    "buildModeOption" : true,
    "bundleSupportsIncludeDiagnostics" : true,
    "bundleSupportsIncludeLogs" : true,
    "databaseInterpretResultsSupportsSarifRunProperty" : true,
    "featuresInVersionResult" : true,
    "indirectTracingSupportsStaticBinaries" : false,
    "informsAboutUnsupportedPathFilters" : true,
    "supportsPython312" : true,
    "mrvaPackCreate" : true,
    "threatModelOption" : true,
    "traceCommandUseBuildMode" : true,
    "v2ramSizing" : true,
    "mrvaPackCreateMultipleQueries" : true,
    "setsCodeqlRunnerEnvVar" : true,
    "sarifMergeRunsFromEqualCategory" : true,
    "forceOverwrite" : true,
    "generateSummarySymbolMap" : true
  }
}

and the build-tracer.log

I am figuring out if I can actually share it.

artem-smotrakov avatar Nov 04 '24 15:11 artem-smotrakov

build-tracer.log that you can find in the database directory under logs. Before sharing make sure possible sensitive information is redacted

Hi @rvermeulen The build-tracer.log is huge. I am looking at it and trying to redact possible sensitive info, but I am not super well familiar with what it has, and doesn't feel confident if I removed everything that has to be removed.

Is there anything specific that the C++ team might be looking for in this log?

What would be best way to share this log with the team? I would not like attaching it here a publicly visible comment. If that helps, my employer Riot Games is GitHub's customer, I guess there might be some private ticketing/etc services where I could probably post the file?

artem-smotrakov avatar Nov 06 '24 11:11 artem-smotrakov

Hi @artem-smotrakov, I'm so sorry this has slipped through our cracks, and I just figured out only now you were waiting for a reply.

Did you maybe manage to solve this issue on your own in the meantime?

What would be best way to share this log with the team? I would not like attaching it here a publicly visible comment. If that helps, my employer Riot Games is GitHub's customer, I guess there might be some private ticketing/etc services where I could probably post the file?

You (or someone in your company) can create a private support ticket here. More infos are available here. If you do, you can link this public issue in the request to help us sort this out more easily. More sensitive log artifacts can then be shared through that channel.

redsun82 avatar Feb 04 '25 13:02 redsun82

This issue is stale because it has been open 14 days with no activity. Comment or remove the Stale label in order to avoid having this issue closed in 7 days.

github-actions[bot] avatar Feb 20 '25 01:02 github-actions[bot]

Hi @redsun82

Did you maybe manage to solve this issue on your own in the meantime?

I still see the warning, but looks like the queries run okay with CodeQL 2.20.1. I am not sure what changed between 2.19.0 and 2.20.1, but it might have solved the issue.

This issue aside, I wrote a relatively simple diagnostic query that look for functions that meet certain criteria. The query works well with 2.20.1. It takes several minutes to complete on a relatively large database. However, with the latest 2.20.4, it's been running for a few hours already (update: it took approx 11 hours to complete).

You (or someone in your company) can create a private support ticket here. More infos are available here. If you do, you can link this public issue in the request to help us sort this out more easily. More sensitive log artifacts can then be shared through that channel.

Thanks! If I see this issue again, I'll try to send a support request.

artem-smotrakov avatar Feb 20 '25 14:02 artem-smotrakov

@artem-smotrakov

However, with the latest 2.20.4, it's been running for a few hours already (update: it took approx 11 hours to complete).

We were forced to change something in the stats file that is used to compute the join-orders. This change happened between 2.20.1 and 2.20.4. We fixed this issues in our own queries that were caused by this, but those fixes might not have been sufficient in general. Would you be able to share your query so we can investigate?

jketema avatar Feb 21 '25 15:02 jketema

Hi @jketema

We fixed this issues in our own queries that were caused by this, but those fixes might not have been sufficient in general.

What was the fix? Do you have an example?

Would you be able to share your query so we can investigate?

Here it is

import cpp

/**
 * UFUNCTION(..) annotation.
 */
class UnrealFunctionAnnotation extends MacroInvocation {
  UnrealFunctionAnnotation() { this.getMacroName() = "UFUNCTION" }

  /**
   * Returns arguments passed to this UFUNCTION annotation.
   */
  string getAnArgument() {
    result = this.getExpandedArgument(_) or result = this.getUnexpandedArgument(_)
  }

  /**
   * Holds if `functionDeclaration` is annotated with this UFUNCTION annotation.
   */
  predicate annotates(FunctionDeclarationEntry functionDeclaration) {
    this.getLocation().isBefore(functionDeclaration.getLocation())
  }

  /**
   * Holds if this UFUNCTION annotation marks the function as RPC.
   */
  predicate declaresRpc() {
    this.getAnArgument().toLowerCase().matches(["%server%", "%client%", "%netmulticast%"])
  }
}

/**
 * Unreal RPC function declaration.
 */
class UnrealRpcDeclaration extends FunctionDeclarationEntry {
  UnrealRpcDeclaration() {
    exists(UnrealFunctionAnnotation ufunction | ufunction.annotates(this) | ufunction.declaresRpc())
  }

  /**
   * Returns implementation of the declared RPC.
   *
   * Unreal requires "_Implementation" suffix to be added to the definition of the RPC.
   * For example, if PRC is defined in the SomeClassName class like
   *
   *     UFUNCTION(Server, Reliable)
   *     void ServerDoSomething();
   *
   * Then, the RPC impementation should be
   *
   *     void SomeClassName::ServerDoSomething_Implementation()
   *     {
   *         [...]
   *     }
   */
  Function getImplementation() {
    exists(Function func |
      func.getName() = this.getName() + "_Implementation" and
      func.getADeclarationEntry().isDefinition()
    |
      result = func
    )
  }

  /**
   * Holds if the declared RPC is part of the engine.
   */
  predicate isEngine() {
    getLocation().getFile().getRelativePath().matches(["%Engine/Source%", "%Engine/Plugins%"])
  }

  /**
   * Holds if the declared RPC looks like to have been added for debugging purpose only.
   */
  predicate isDebug() { getName().matches("%Debug%") }
}

from UnrealRpcDeclaration rpcDeclaration
where
  not rpcDeclaration.isEngine() and // exclude RPCs that are part of the engine
  not rpcDeclaration.isDebug() // filter out RPCs for debugging
select rpcDeclaration.getImplementation(), rpcDeclaration.getImplementation().getName()

It looks for PRCs in code that uses Unreal Engine.

To test the query, you'd need a multiplayer game based on Unreal Engine, for example, Lyra. Here is RPC example.

artem-smotrakov avatar Feb 24 '25 09:02 artem-smotrakov

What was the fix? Do you have an example?

It very much depends:

  • https://github.com/github/codeql/pull/18478
  • https://github.com/github/codeql/pull/18561
  • https://github.com/github/codeql/pull/18578
  • https://github.com/github/codeql/pull/18588

To test the query, you'd need a multiplayer game based on Unreal Engine, for example, Lyra. Here is RPC example.

This is not something I have access to, unfortunately. I can guide you on how to get the numbers you see on some of the PRs from above if that would make sense.

jketema avatar Feb 24 '25 09:02 jketema

Thanks for the examples.

Is this something you're planning to address in the future CodeQL releases, or do all impacted queries have to be updated? I have to admit that my knowledge of CodeQL doesn't let me understand what's happening in the PR you linked. I am wondering if I can just stay on 2.20.1 before it's addressed.

This is not something I have access to, unfortunately.

Yes, you'd need to join Epic's org to get access to it, instructions

https://github.com/tomlooman/ActionRoguelike -- that's another project that can be used for testing the query.

artem-smotrakov avatar Feb 24 '25 10:02 artem-smotrakov

Is this something you're planning to address in the future CodeQL releases, or do all impacted queries have to be updated?

All queries that were impacted have been updated at this point.

I have to admit that my knowledge of CodeQL doesn't let me understand what's happening in the PR you linked.

Understandable. Especially since this is not really documented.

I am wondering if I can just stay on 2.20.1 before it's addressed.

Currently I'm not able to address this, as I do not have enough data.

Yes, you'd need to join Epic's org to get access to it, instructions

This is very high-overhead, which means I would need to schedule this, but it can only have very low priority because this is the only report we've received about this so far, and as this doesn't show up in our own testing.

https://github.com/tomlooman/ActionRoguelike -- that's another project that can be used for testing the query.

This still seems to require access to Epic's org to get access to the required tooling?

I can guide you on how to get the numbers you see on some of the PRs from above if that would make sense.

To get the numbers you see on the PRs that I linked above:

  • codeql database cleanup --cache-cleanup=clear ${CODEQL_DATABASE_DIR}
  • codeql database analyze ${CODEQL_DATABASE_DIR} --format=sarif-latest --output=output.sarif --tuple-counting --evaluator-log output.log --rerun ${CODEQL_QUERY}
  • codeql generate log-summary --format=text output.log summary.log

The summary.log file should not contain any of your IP (but please double check), so should be shareable here.

jketema avatar Feb 24 '25 10:02 jketema

This still seems to require access to Epic's org to get access to the required tooling?

No. You only need to install Unreal Engine and compiler toolkit

https://www.unrealengine.com/en-US/download

No need access to the Epic's org.

I started the query using the instructions you provided. When it's finished, I'll check the log, and post it here if it's okay.

artem-smotrakov avatar Feb 24 '25 12:02 artem-smotrakov

Attaching summary.log

artem-smotrakov avatar Feb 25 '25 09:02 artem-smotrakov

Thanks. It's completely blowing up in your annotates predicate:

[2025-02-25 03:57:02] Evaluated non-recursive predicate rpc::UnrealFunctionAnnotation.annotates/1#dispred#f5072c6e@37a93epk in 57295162ms (size: 351015697).
Evaluated relational algebra for predicate rpc::UnrealFunctionAnnotation.annotates/1#dispred#f5072c6e@37a93epk with tuple counts:
                   1   ~0%    {0} r1 = CONSTANT()[]
              343222   ~0%    {2}    | JOIN WITH `Location::Location.getEndLine/0#dispred#83af84ae#bf` CARTESIAN PRODUCT OUTPUT Rhs.0, Rhs.1
        745838910322   ~0%    {4}    | JOIN WITH `Location::Location.getStartLine/0#d54f9e6c` CARTESIAN PRODUCT OUTPUT Lhs.0, Lhs.1, Rhs.0, Rhs.1
                              {4}    | REWRITE WITH TEST InOut.1 < InOut.3
        406186767271   ~5%    {2}    | SCAN OUTPUT In.2, In.0
        279209432197   ~0%    {3}    | JOIN WITH fun_decls_40#join_rhs ON FIRST 1 OUTPUT Lhs.1, Lhs.0, Rhs.1
        279209432197   ~3%    {4}    | JOIN WITH `Location::Location.getFile/0#dispred#d1f8b5d1` ON FIRST 1 OUTPUT Lhs.1, Rhs.1, Lhs.0, Lhs.2
            77263627   ~4%    {2}    | JOIN WITH `Location::Location.getFile/0#dispred#d1f8b5d1` ON FIRST 2 OUTPUT Lhs.2, Lhs.3
         11422410724   ~3%    {2}    | JOIN WITH macroinvocations_20#join_rhs ON FIRST 1 OUTPUT Rhs.1, Lhs.1
         18574408764   ~0%    {2}    | JOIN WITH `Macro::MacroAccess.getOutermostMacroAccess/0#d58b05db_10#join_rhs` ON FIRST 1 OUTPUT Rhs.1, Lhs.1
           351015697   ~0%    {2}    | JOIN WITH rpc::UnrealFunctionAnnotation#6f570527 ON FIRST 1 OUTPUT Lhs.0, Lhs.1
                              return r1

[2025-02-25 03:57:14] Evaluated non-recursive predicate rpc::UnrealRpcDeclaration#5b0cb5dc@7bb217ts in 239ms (size: 4637).
Evaluated relational algebra for predicate rpc::UnrealRpcDeclaration#5b0cb5dc@7bb217ts with tuple counts:
        7976950  ~169140%    {1} r1 = JOIN `rpc::UnrealFunctionAnnotation.declaresRpc/0#dispred#aa76493c` WITH `rpc::UnrealFunctionAnnotation.annotates/1#dispred#f5072c6e` ON FIRST 1 OUTPUT Rhs.1
                             return r1

What's not clear to me though is why the same thing doesn't happen with 2.20.1. Would you also be able to produce the same log file but with 2.20.1?

jketema avatar Feb 25 '25 09:02 jketema

Would you also be able to produce the same log file but with 2.20.1?

summary_2_20_1.log

artem-smotrakov avatar Feb 25 '25 10:02 artem-smotrakov

I believe the performance problem should be fixed by https://github.com/github/codeql/pull/18859

jketema avatar Feb 25 '25 11:02 jketema

I ran the query on my database, and the UnrealFunctionAnnotation.annotates() predicate took 56m. This is a significant improvement, thanks!

summary_2_20_4_patched.log

Do you know if there's any chance to reduce it further? The predicate takes about 3m to complete with 2.20.1. I am not sure it would be possible to bring it back to 3m but anything closer would be great!

artem-smotrakov avatar Feb 25 '25 13:02 artem-smotrakov

Do you know if there's any chance to reduce it further?

I don't immediately see an easy way of doing this by just modifying the library, because it's difficult to predict what effect any changes to the library have in other contexts. We believe that https://github.com/github/codeql/pull/18859 is still relatively safe in that respect.

Let me have a think though about whether something can be done on the side of your query.

jketema avatar Feb 25 '25 14:02 jketema

Thanks! I've updated the predicate to this

predicate annotates(FunctionDeclarationEntry functionDeclaration) {
    this.getLocation().getFile() = functionDeclaration.getLocation().getFile() and
    this.getLocation().getStartLine() = functionDeclaration.getLocation().getStartLine() - 1
  }

because I am only interested in UFUNCTION() macros that are located on the previous line from the function declaration. Apparently, this brought query execution time down to approx 21m. I guess it happened because the new condition is a bit stricter that isBefore(), but not 100% sure.

artem-smotrakov avatar Feb 25 '25 14:02 artem-smotrakov

This issue is stale because it has been open 14 days with no activity. Comment or remove the Stale label in order to avoid having this issue closed in 7 days.

github-actions[bot] avatar Mar 12 '25 02:03 github-actions[bot]

Hi @artem-smotrakov

Let me have a think though about whether something can be done on the side of your query.

With the limited time I had for this, I've unfortunately not been able to come up with anything that make things fast again for you. My apologies.

jketema avatar Mar 12 '25 08:03 jketema

No problem @jketema ! Thanks!

artem-smotrakov avatar Mar 12 '25 09:03 artem-smotrakov

This issue is stale because it has been open 14 days with no activity. Comment or remove the Stale label in order to avoid having this issue closed in 7 days.

github-actions[bot] avatar Mar 27 '25 02:03 github-actions[bot]