codeql
codeql copied to clipboard
C++: Total number of baseline files limit
Hey friends, I have quite a large C++ database:
codeql database print-baseline -- ${CODEQL_DATABASE_DIR}
Counted a baseline of 27711380 lines of code for cpp.
Before running scans, I normally run some simple diagnostic queries to make sure the database looks fine. The queries look for things like:
- Files
FunctionCallsIfStmts
When I run these queries on this large database, I get this
codeql database analyze ${CODEQL_DATABASE_DIR} --format=sarif-latest --output=calls.sarif ${CODEQL_QUERIES}/qlpacks/cpp-queries/diagnostics/FunctionCalls.ql
Running queries.
[1/1 comp 7.8s] Compiled [...]/qlpacks/cpp-queries/diagnostics/FunctionCalls.ql.
Files.ql: [1/1 eval 36s] Results written to cpp-queries/diagnostics/FunctionCalls.bqrs.
Shutting down query evaluator.
Interpreting results.
Will not interpret file coverage baseline information, since the total number of baseline files is 153738, which is greater than the limit of 50000.
The exit code is 0 but calls.sarif is empty.
When I run queries from the standard C++ pack, I get the same message.
What does this limit mean? Is there any way to increase it? I didn't find anything either in the docs or in this repo unfortunately, may be missing something though. Thanks!
👋 @artem-smotrakov, sorry for the late reply!
That limit is meant to avoid generating too large a SARIF file when populating the Tool Status Page for information about how many files were analyzed, hitting the SARIF file size limit. It is currently hard-coded and cannot be configured.
That said, I'm not entirely sure this limit (and the warning) should really cause a custom query like yours to return no results. Could you:
- share your
FunctionCalls.ql, so we can experiment with it a bit? - maybe try another output format like
cvsto see if the issue is specifically related to the SARIF format?
Hi @redsun82 ! Thanks for your reply!
share your FunctionCalls.ql, so we can experiment with it a bit?
Yeah, sure, it's quite simple
import cpp
from FunctionCall call, Function func
where func = call.getTarget()
select call, "Call " + func + "(" + func.getParameterString() + ")"
maybe try another output format like cvs to see if the issue is specifically related to the SARIF format?
I get the same message for CSV if I use --format=csv --output=calls.csv. The calls.csv file is empty.
It is currently hard-coded and cannot be configured.
Would it be possible to make it configurable in one of the next releases? 🤔
Hi @artem-smotrakov,
The base line information should not influence the result of the query. Could you run https://github.com/github/codeql/blob/main/cpp/ql/src/Diagnostics/ExtractionWarnings.ql to determine if other issue are influencing the results of the query?
Hi @rvermeulen ! Attaching the results of the ExtractionWarnings.ql. I see errors in several files but the codebase has way more C++ files. Also, I got the same limit warning when I ran the query.
Hi @artem-smotrakov,
Let me forward this to our C/C++ team.
In the mean time, could you share which CodeQL CLI version you are using codeql version --format=json and the build-tracer.log that you can find in the database directory under logs. Before sharing make sure possible sensitive information is redacted (such as the unpack location).
This issue is stale because it has been open 14 days with no activity. Comment or remove the Stale label in order to avoid having this issue closed in 7 days.
This issue is stale because it has been open 14 days with no activity. Comment or remove the
Stalelabel in order to avoid having this issue closed in 7 days.
I am working on it, please don't close it.
share which CodeQL CLI version you are using
codeql version --format=json
{
"productName" : "CodeQL",
"vendor" : "GitHub",
"version" : "2.19.0",
"sha" : "9f0ad8ab1f14c2711b9fc2666e8bcdd09ab39ce8",
"branches" : [
"codeql-cli-2.19.0"
],
"copyright" : "Copyright (C) 2019-2024 GitHub, Inc.",
[...]
"configFileFound" : false,
"features" : {
"analysisSummaryV2Default" : true,
"buildModeOption" : true,
"bundleSupportsIncludeDiagnostics" : true,
"bundleSupportsIncludeLogs" : true,
"databaseInterpretResultsSupportsSarifRunProperty" : true,
"featuresInVersionResult" : true,
"indirectTracingSupportsStaticBinaries" : false,
"informsAboutUnsupportedPathFilters" : true,
"supportsPython312" : true,
"mrvaPackCreate" : true,
"threatModelOption" : true,
"traceCommandUseBuildMode" : true,
"v2ramSizing" : true,
"mrvaPackCreateMultipleQueries" : true,
"setsCodeqlRunnerEnvVar" : true,
"sarifMergeRunsFromEqualCategory" : true,
"forceOverwrite" : true,
"generateSummarySymbolMap" : true
}
}
and the build-tracer.log
I am figuring out if I can actually share it.
build-tracer.logthat you can find in the database directory under logs. Before sharing make sure possible sensitive information is redacted
Hi @rvermeulen The build-tracer.log is huge. I am looking at it and trying to redact possible sensitive info, but I am not super well familiar with what it has, and doesn't feel confident if I removed everything that has to be removed.
Is there anything specific that the C++ team might be looking for in this log?
What would be best way to share this log with the team? I would not like attaching it here a publicly visible comment. If that helps, my employer Riot Games is GitHub's customer, I guess there might be some private ticketing/etc services where I could probably post the file?
Hi @artem-smotrakov, I'm so sorry this has slipped through our cracks, and I just figured out only now you were waiting for a reply.
Did you maybe manage to solve this issue on your own in the meantime?
What would be best way to share this log with the team? I would not like attaching it here a publicly visible comment. If that helps, my employer Riot Games is GitHub's customer, I guess there might be some private ticketing/etc services where I could probably post the file?
You (or someone in your company) can create a private support ticket here. More infos are available here. If you do, you can link this public issue in the request to help us sort this out more easily. More sensitive log artifacts can then be shared through that channel.
This issue is stale because it has been open 14 days with no activity. Comment or remove the Stale label in order to avoid having this issue closed in 7 days.
Hi @redsun82
Did you maybe manage to solve this issue on your own in the meantime?
I still see the warning, but looks like the queries run okay with CodeQL 2.20.1. I am not sure what changed between 2.19.0 and 2.20.1, but it might have solved the issue.
This issue aside, I wrote a relatively simple diagnostic query that look for functions that meet certain criteria. The query works well with 2.20.1. It takes several minutes to complete on a relatively large database. However, with the latest 2.20.4, it's been running for a few hours already (update: it took approx 11 hours to complete).
You (or someone in your company) can create a private support ticket here. More infos are available here. If you do, you can link this public issue in the request to help us sort this out more easily. More sensitive log artifacts can then be shared through that channel.
Thanks! If I see this issue again, I'll try to send a support request.
@artem-smotrakov
However, with the latest 2.20.4, it's been running for a few hours already (update: it took approx 11 hours to complete).
We were forced to change something in the stats file that is used to compute the join-orders. This change happened between 2.20.1 and 2.20.4. We fixed this issues in our own queries that were caused by this, but those fixes might not have been sufficient in general. Would you be able to share your query so we can investigate?
Hi @jketema
We fixed this issues in our own queries that were caused by this, but those fixes might not have been sufficient in general.
What was the fix? Do you have an example?
Would you be able to share your query so we can investigate?
Here it is
import cpp
/**
* UFUNCTION(..) annotation.
*/
class UnrealFunctionAnnotation extends MacroInvocation {
UnrealFunctionAnnotation() { this.getMacroName() = "UFUNCTION" }
/**
* Returns arguments passed to this UFUNCTION annotation.
*/
string getAnArgument() {
result = this.getExpandedArgument(_) or result = this.getUnexpandedArgument(_)
}
/**
* Holds if `functionDeclaration` is annotated with this UFUNCTION annotation.
*/
predicate annotates(FunctionDeclarationEntry functionDeclaration) {
this.getLocation().isBefore(functionDeclaration.getLocation())
}
/**
* Holds if this UFUNCTION annotation marks the function as RPC.
*/
predicate declaresRpc() {
this.getAnArgument().toLowerCase().matches(["%server%", "%client%", "%netmulticast%"])
}
}
/**
* Unreal RPC function declaration.
*/
class UnrealRpcDeclaration extends FunctionDeclarationEntry {
UnrealRpcDeclaration() {
exists(UnrealFunctionAnnotation ufunction | ufunction.annotates(this) | ufunction.declaresRpc())
}
/**
* Returns implementation of the declared RPC.
*
* Unreal requires "_Implementation" suffix to be added to the definition of the RPC.
* For example, if PRC is defined in the SomeClassName class like
*
* UFUNCTION(Server, Reliable)
* void ServerDoSomething();
*
* Then, the RPC impementation should be
*
* void SomeClassName::ServerDoSomething_Implementation()
* {
* [...]
* }
*/
Function getImplementation() {
exists(Function func |
func.getName() = this.getName() + "_Implementation" and
func.getADeclarationEntry().isDefinition()
|
result = func
)
}
/**
* Holds if the declared RPC is part of the engine.
*/
predicate isEngine() {
getLocation().getFile().getRelativePath().matches(["%Engine/Source%", "%Engine/Plugins%"])
}
/**
* Holds if the declared RPC looks like to have been added for debugging purpose only.
*/
predicate isDebug() { getName().matches("%Debug%") }
}
from UnrealRpcDeclaration rpcDeclaration
where
not rpcDeclaration.isEngine() and // exclude RPCs that are part of the engine
not rpcDeclaration.isDebug() // filter out RPCs for debugging
select rpcDeclaration.getImplementation(), rpcDeclaration.getImplementation().getName()
It looks for PRCs in code that uses Unreal Engine.
To test the query, you'd need a multiplayer game based on Unreal Engine, for example, Lyra. Here is RPC example.
What was the fix? Do you have an example?
It very much depends:
- https://github.com/github/codeql/pull/18478
- https://github.com/github/codeql/pull/18561
- https://github.com/github/codeql/pull/18578
- https://github.com/github/codeql/pull/18588
To test the query, you'd need a multiplayer game based on Unreal Engine, for example, Lyra. Here is RPC example.
This is not something I have access to, unfortunately. I can guide you on how to get the numbers you see on some of the PRs from above if that would make sense.
Thanks for the examples.
Is this something you're planning to address in the future CodeQL releases, or do all impacted queries have to be updated? I have to admit that my knowledge of CodeQL doesn't let me understand what's happening in the PR you linked. I am wondering if I can just stay on 2.20.1 before it's addressed.
This is not something I have access to, unfortunately.
Yes, you'd need to join Epic's org to get access to it, instructions
https://github.com/tomlooman/ActionRoguelike -- that's another project that can be used for testing the query.
Is this something you're planning to address in the future CodeQL releases, or do all impacted queries have to be updated?
All queries that were impacted have been updated at this point.
I have to admit that my knowledge of CodeQL doesn't let me understand what's happening in the PR you linked.
Understandable. Especially since this is not really documented.
I am wondering if I can just stay on 2.20.1 before it's addressed.
Currently I'm not able to address this, as I do not have enough data.
Yes, you'd need to join Epic's org to get access to it, instructions
This is very high-overhead, which means I would need to schedule this, but it can only have very low priority because this is the only report we've received about this so far, and as this doesn't show up in our own testing.
https://github.com/tomlooman/ActionRoguelike -- that's another project that can be used for testing the query.
This still seems to require access to Epic's org to get access to the required tooling?
I can guide you on how to get the numbers you see on some of the PRs from above if that would make sense.
To get the numbers you see on the PRs that I linked above:
codeql database cleanup --cache-cleanup=clear ${CODEQL_DATABASE_DIR}codeql database analyze ${CODEQL_DATABASE_DIR} --format=sarif-latest --output=output.sarif --tuple-counting --evaluator-log output.log --rerun ${CODEQL_QUERY}codeql generate log-summary --format=text output.log summary.log
The summary.log file should not contain any of your IP (but please double check), so should be shareable here.
This still seems to require access to Epic's org to get access to the required tooling?
No. You only need to install Unreal Engine and compiler toolkit
https://www.unrealengine.com/en-US/download
No need access to the Epic's org.
I started the query using the instructions you provided. When it's finished, I'll check the log, and post it here if it's okay.
Attaching summary.log
Thanks. It's completely blowing up in your annotates predicate:
[2025-02-25 03:57:02] Evaluated non-recursive predicate rpc::UnrealFunctionAnnotation.annotates/1#dispred#f5072c6e@37a93epk in 57295162ms (size: 351015697).
Evaluated relational algebra for predicate rpc::UnrealFunctionAnnotation.annotates/1#dispred#f5072c6e@37a93epk with tuple counts:
1 ~0% {0} r1 = CONSTANT()[]
343222 ~0% {2} | JOIN WITH `Location::Location.getEndLine/0#dispred#83af84ae#bf` CARTESIAN PRODUCT OUTPUT Rhs.0, Rhs.1
745838910322 ~0% {4} | JOIN WITH `Location::Location.getStartLine/0#d54f9e6c` CARTESIAN PRODUCT OUTPUT Lhs.0, Lhs.1, Rhs.0, Rhs.1
{4} | REWRITE WITH TEST InOut.1 < InOut.3
406186767271 ~5% {2} | SCAN OUTPUT In.2, In.0
279209432197 ~0% {3} | JOIN WITH fun_decls_40#join_rhs ON FIRST 1 OUTPUT Lhs.1, Lhs.0, Rhs.1
279209432197 ~3% {4} | JOIN WITH `Location::Location.getFile/0#dispred#d1f8b5d1` ON FIRST 1 OUTPUT Lhs.1, Rhs.1, Lhs.0, Lhs.2
77263627 ~4% {2} | JOIN WITH `Location::Location.getFile/0#dispred#d1f8b5d1` ON FIRST 2 OUTPUT Lhs.2, Lhs.3
11422410724 ~3% {2} | JOIN WITH macroinvocations_20#join_rhs ON FIRST 1 OUTPUT Rhs.1, Lhs.1
18574408764 ~0% {2} | JOIN WITH `Macro::MacroAccess.getOutermostMacroAccess/0#d58b05db_10#join_rhs` ON FIRST 1 OUTPUT Rhs.1, Lhs.1
351015697 ~0% {2} | JOIN WITH rpc::UnrealFunctionAnnotation#6f570527 ON FIRST 1 OUTPUT Lhs.0, Lhs.1
return r1
[2025-02-25 03:57:14] Evaluated non-recursive predicate rpc::UnrealRpcDeclaration#5b0cb5dc@7bb217ts in 239ms (size: 4637).
Evaluated relational algebra for predicate rpc::UnrealRpcDeclaration#5b0cb5dc@7bb217ts with tuple counts:
7976950 ~169140% {1} r1 = JOIN `rpc::UnrealFunctionAnnotation.declaresRpc/0#dispred#aa76493c` WITH `rpc::UnrealFunctionAnnotation.annotates/1#dispred#f5072c6e` ON FIRST 1 OUTPUT Rhs.1
return r1
What's not clear to me though is why the same thing doesn't happen with 2.20.1. Would you also be able to produce the same log file but with 2.20.1?
I believe the performance problem should be fixed by https://github.com/github/codeql/pull/18859
I ran the query on my database, and the UnrealFunctionAnnotation.annotates() predicate took 56m. This is a significant improvement, thanks!
Do you know if there's any chance to reduce it further? The predicate takes about 3m to complete with 2.20.1. I am not sure it would be possible to bring it back to 3m but anything closer would be great!
Do you know if there's any chance to reduce it further?
I don't immediately see an easy way of doing this by just modifying the library, because it's difficult to predict what effect any changes to the library have in other contexts. We believe that https://github.com/github/codeql/pull/18859 is still relatively safe in that respect.
Let me have a think though about whether something can be done on the side of your query.
Thanks! I've updated the predicate to this
predicate annotates(FunctionDeclarationEntry functionDeclaration) {
this.getLocation().getFile() = functionDeclaration.getLocation().getFile() and
this.getLocation().getStartLine() = functionDeclaration.getLocation().getStartLine() - 1
}
because I am only interested in UFUNCTION() macros that are located on the previous line from the function declaration. Apparently, this brought query execution time down to approx 21m. I guess it happened because the new condition is a bit stricter that isBefore(), but not 100% sure.
This issue is stale because it has been open 14 days with no activity. Comment or remove the Stale label in order to avoid having this issue closed in 7 days.
Hi @artem-smotrakov
Let me have a think though about whether something can be done on the side of your query.
With the limited time I had for this, I've unfortunately not been able to come up with anything that make things fast again for you. My apologies.
No problem @jketema ! Thanks!
This issue is stale because it has been open 14 days with no activity. Comment or remove the Stale label in order to avoid having this issue closed in 7 days.