fasten
fasten copied to clipboard
Looking inside of a jar file and run Scancode upon the Java source files.
Hi @MagielBruntink Magiel,
As the follow-up to yesterday's dev call, I tried to look at source code within the fasten-project GitHub organization running Scancode upon an unzipped Jar file containing Java source files. Unfortunately, I couldn't find any source code related to this task.
Could you please point me out to the code you mentioned?
I appreciate any help you can provide. M
Hi @michelescarlato
As can be seen in this example: https://github.com/fasten-project/fasten/blob/fd0e82ac5524d3b5b17c92a4a9234f7f910a5bd0/analyzer/vulnerability-packages-listener/src/test/resources/real-callable-index-message.json the POM Analyzer puts a link to the jar file containing sources. It is just a matter of HTPP getting that jar file and unzipping it.
@MagielBruntink , thanks for the input.
A follow-up question is:
- when the license detector consumes the
fasten.MetadataDBExtension.out
, where will it find thesourcesUrl
that you indicated to me here?
As Sebastian @proksch mentioned about Kafka messages encapsulated, I still don't know how these messages are encapsulated.
sourcesUrl
belongs to the Kafka message produced to the fasten.POMAnalyzer.out topic.
How can I retrieve it, consuming the fasten.MetadataDBExtension.out
topic?
It will be wrapped somewhere (this is a messy bit, tbh):
{
"input": {
"input": {
"input": {
"groupId": "org.apache.logging.log4j",
"artifactId": "log4j-core",
"version": "2.14.1"
},
"plugin_version": "0.1.2",
"consumed_at": 1642324728,
"payload": {
"date": 1615093953000,
"repoUrl": "",
"groupId": "org.apache.logging.log4j",
"version": "2.14.1",
"parentCoordinate": "org.apache.logging.log4j:log4j:2.14.1",
"artifactRepository": "https://repo.maven.apache.org/maven2/",
"forge": "mvn",
"sourcesUrl": "https://repo.maven.apache.org/maven2/org/apache/logging/log4j/log4j-core/2.14.1/log4j-core-2.14.1-sources.jar",
"artifactId": "log4j-core",
"dependencyData": {
"dependencyManagement": {
"dependencies": []
},
......
What I do is search for the sourcesUrl
field recursively, like with the following method:
static String findSourcesUrl(JSONObject json) {
for (var key : json.keySet()) {
if (key.equals("sourcesUrl")) {
return json.getString("sourcesUrl");
} else {
var other = json.get(key);
if (other instanceof JSONObject) {
var nestedResult = findPayload((JSONObject) other);
if(nestedResult != null) return nestedResult;
}
}
}
return null;
}
Hi @MagielBruntink Magiel,
I am integrating the portion of code that you are suggesting here and here (I need to extract the package name and the package version, in Debian).
Inside the code you mentioned, there is the findPayload()
function. What is precisely doing this function?
Can I see it somewhere?
Or is it just the same function where the json key is payload
(instead of the sourcesUrl
)?
Hi Michele, find the method here: https://github.com/fasten-project/fasten/blob/76f9997fa2fe3a1ce657f0621a6e2a984afa23ce/analyzer/vulnerability-packages-listener/src/main/java/eu/fasten/analyzer/vulnerabilitypackageslistener/VulnerabilityPackagesListener.java#L147
I have just re-discovered this issue. xD So we did not just talk about the problems in a dev call that SIG had with the Flink sync job, but we even discussed and illustrated the ease of use of ..-sources.jar
files in this very issue. We could have saved us a ton of headache if we would have just followed these recommendations here... well, everybody is smarter afterwards.
The Java license detector made heavy use of the messages produced by the RepoCloner. Unfortunately, modifications to the detector's code are required to adapt it to a new approach.
Also, please consider that an approach that avoids the use of Flink but with a very similar implementation of the Java license detector has been carried out in Python (where the input Kafka topic is fasten.MetadataDBPythonExtension.out
).
As you can imagine, the development and the deployment of the three different license detectors (Java, Python, and C) are tight to the pipeline itself, which are different between languages.
Since the Java license detector was mainly developed in July, the Java pipeline in that period relied heavily on the usage of the RepoCloner. That's the main reason for having the Java license detector looking iteratively into the Kafka records consumed at the RepoCloner.out
.
I only recently discovered, performing an analysis with @MagielBruntink, that the repoUrl
(which could contain a GitHub URL that the detector uses to retrieve license information in spdx
format from the GitHub APIs) is produced by the POM Analyzer.
This means that the outbound license detection based on GitHub URLs can still be performed, even after removing the RepoCloner plugin.
On the other hand, following the discussion with Magiel, we understood that having a common place where unjarred
maven packages reside could benefit both plugins, Rapid
and the License Detector
.
As you suggested in the last dev call, this task could be performed directly by the POM Analyzer, preventing the insertion of another plugin in the Java pipeline.
This could be excellent for both Rapid
and License Detector
.