codeql-action
codeql-action copied to clipboard
Code Scanning support for external data
Background
For users that want to incorporate some external data into a database, odasa supported a workflow based on csv files. During database creation, a specific subdirectory was scanned for csv files, and the content of these files was used to populate the externalData
table that is present in the dbScheme
for each language. This feature was rarely used by users on their own. Instead it mostly gave us a flexible way to implement specific user requests in a "quick-and-dirty" way by rapidly extending databases with custom entries.
This feature has recently been ported to the codeql-cli
. Incorporating external_data.csv
into the example_database
to supplement the tables that are generated from compiling test.c
requires the following steps:
-
codeql database init --language=cpp --source-root=. test_database
-
codeql database trace-command test_database gcc test.c
-
codeql database index-files --language=csv --include=external_data.csv test_database
// External data is added here. -
codeql database finalize test_database
Note that this approach does not use any new command line arguments along the lines of --external-data=external_data.csv
. Instead, csv is treated as a full language and comes with an extractor that adheres to the common interface. Note furthermore that example_database
is effectively a multi-language database without us really supporting multi-language databases - it only works because the csv dbScheme
is a subset of the cpp dbScheme
.
It would be desirable to enable the use of external data in csv files for Code Scanning. Having it available in codeql should make the backend implementation straightforward, but there is some unclarity about how the interface for this should look.
The Challenge
We need to introduce this in a way that already anticipates multi-language databases. The aim should be to introduce {main language} + csv
databases in such a way that it does not become a special (legacy) case once multi-language dataset ship. Therefore, we don't want any notion of "external data" or "csv files" specifically in the Code Scanning configuration.
Instead, we want to treat this as creating a two-language database (where one language happens to be "csv" and the csv files happen to contain "external data") that is configured via the same mechanism that will later be used for generic multi-language datasets. This means that we have to anticipate quite a few future design decisions.
Some initial discussions with @aeisenberg made it clear that this might touch on pretty complex design decisions on the Code Scanning side and will require significant input from @robertbrignull and @jhutchings1.
Potential Implementations
Fully automatic
We could always look for csv files in the entire repository when using autobuild
. All csv files would be incorporated into the database fully automatically, in addition to the detected principal language.
This would make sense if it was the aim to abandon "autobuild [...] only ever attempts to build one compiled language for a repository" when multi-language databases become available. If eventually all present programming language files will be included in the generated database, then the proposed mechanism is quite natural. Otherwise, it would seem like an unjustified special treatment of csv files.
Explicit two-language
We could enable the automatic indexing of csv files in the repository only if precisely two langauges - one of them csv - have been explicitly selected in the configuration, along the lines of
with:
languages: cpp, csv
Again, is it planned that multi-language databases will eventually be constructed this way? If so, then this would seem like a sensible approach. Otherwise, no so much.
Manual indexing
Instead of relying on autobuild
, indexing of csv files could be done with an explicit index
mechanism.
It would be possible to just add a configuration option that directly corresponds to codeql database index-files
. The use of external data is quite advanced, so not having it available with autobuild
would seem reasonable. However, we would not want to introduce this kind of feature just for csv files. Therefore, this would probably only make sense if an explicit indexing command has been deemed useful in other contexts as well.
Wait for multi-language databases
Finally, we could just wait and see what happens to multi-language databases. This would allow us to enable the use of external data with all the hindsight from implementing multiple languages more generally.
My thinking is that we should add a new config option to explicitly opt into csv extraction, and then issue the extra codeql cli commands where appropriate.
Ping: @jhutchings1 We'll be starting work on this soon. Any concerns with this?
Thanks for the ping on Slack, @ginsbach! I am not aware of any code scanning customers who have expressed interest in this (@jhutchings will be able to confirm). I only know of CodeQL power users who are interested in using CSV data.
I'm sure we will see some use cases in this area in the future, but I don't think we should do anything now. So I think there is no need for further work here for the time being.
Thanks for the ping, team. Sorry I missed this one earlier.
I have yet to hear of any customer requesting this particular user scenario. I don't think we need to prioritize any work here for the moment.
We will not work on any implementation for now. When customers request it, this issue should be a good place to start picking it up again.