fasten Looking inside of a jar file and run Scancode upon the Java source files.

Hi @MagielBruntink Magiel,

As the follow-up to yesterday's dev call, I tried to look at source code within the fasten-project GitHub organization running Scancode upon an unzipped Jar file containing Java source files. Unfortunately, I couldn't find any source code related to this task.

Could you please point me out to the code you mentioned?

I appreciate any help you can provide. M

Jan 14 '22 02:01 michelescarlato

Hi @michelescarlato

As can be seen in this example: https://github.com/fasten-project/fasten/blob/fd0e82ac5524d3b5b17c92a4a9234f7f910a5bd0/analyzer/vulnerability-packages-listener/src/test/resources/real-callable-index-message.json the POM Analyzer puts a link to the jar file containing sources. It is just a matter of HTPP getting that jar file and unzipping it.

Jan 14 '22 10:01 MagielBruntink

@MagielBruntink , thanks for the input.

A follow-up question is:

when the license detector consumes the fasten.MetadataDBExtension.out, where will it find the sourcesUrl that you indicated to me here?

As Sebastian @proksch mentioned about Kafka messages encapsulated, I still don't know how these messages are encapsulated.

sourcesUrl belongs to the Kafka message produced to the fasten.POMAnalyzer.out topic. How can I retrieve it, consuming the fasten.MetadataDBExtension.out topic?

Jan 16 '22 00:01 michelescarlato

It will be wrapped somewhere (this is a messy bit, tbh):

{
   "input": {
      "input": {
         "input": {
            "groupId": "org.apache.logging.log4j",
            "artifactId": "log4j-core",
            "version": "2.14.1"
         },
         "plugin_version": "0.1.2",
         "consumed_at": 1642324728,
         "payload": {
            "date": 1615093953000,
            "repoUrl": "",
            "groupId": "org.apache.logging.log4j",
            "version": "2.14.1",
            "parentCoordinate": "org.apache.logging.log4j:log4j:2.14.1",
            "artifactRepository": "https://repo.maven.apache.org/maven2/",
            "forge": "mvn",
            "sourcesUrl": "https://repo.maven.apache.org/maven2/org/apache/logging/log4j/log4j-core/2.14.1/log4j-core-2.14.1-sources.jar",
            "artifactId": "log4j-core",
            "dependencyData": {
               "dependencyManagement": {
                  "dependencies": []
               },
              ......

What I do is search for the sourcesUrl field recursively, like with the following method:

        static String findSourcesUrl(JSONObject json) {
            for (var key : json.keySet()) {
                if (key.equals("sourcesUrl")) {
                    return json.getString("sourcesUrl");
                } else {
                    var other = json.get(key);
                    if (other instanceof JSONObject) {
                        var nestedResult = findPayload((JSONObject) other);
                        if(nestedResult != null) return nestedResult;
                    }
                }
            }
            return null;
        }

Jan 16 '22 09:01 MagielBruntink

Hi @MagielBruntink Magiel,

I am integrating the portion of code that you are suggesting here and here (I need to extract the package name and the package version, in Debian).

Inside the code you mentioned, there is the findPayload() function. What is precisely doing this function? Can I see it somewhere? Or is it just the same function where the json key is payload (instead of the sourcesUrl)?

Feb 15 '22 19:02 michelescarlato

Hi Michele, find the method here: https://github.com/fasten-project/fasten/blob/76f9997fa2fe3a1ce657f0621a6e2a984afa23ce/analyzer/vulnerability-packages-listener/src/main/java/eu/fasten/analyzer/vulnerabilitypackageslistener/VulnerabilityPackagesListener.java#L147

Feb 16 '22 06:02 MagielBruntink

I have just re-discovered this issue. xD So we did not just talk about the problems in a dev call that SIG had with the Flink sync job, but we even discussed and illustrated the ease of use of ..-sources.jar files in this very issue. We could have saved us a ton of headache if we would have just followed these recommendations here... well, everybody is smarter afterwards.

May 19 '22 22:05 proksch

The Java license detector made heavy use of the messages produced by the RepoCloner. Unfortunately, modifications to the detector's code are required to adapt it to a new approach.

Also, please consider that an approach that avoids the use of Flink but with a very similar implementation of the Java license detector has been carried out in Python (where the input Kafka topic is fasten.MetadataDBPythonExtension.out).

As you can imagine, the development and the deployment of the three different license detectors (Java, Python, and C) are tight to the pipeline itself, which are different between languages.

Since the Java license detector was mainly developed in July, the Java pipeline in that period relied heavily on the usage of the RepoCloner. That's the main reason for having the Java license detector looking iteratively into the Kafka records consumed at the RepoCloner.out.

I only recently discovered, performing an analysis with @MagielBruntink, that the repoUrl (which could contain a GitHub URL that the detector uses to retrieve license information in spdx format from the GitHub APIs) is produced by the POM Analyzer. This means that the outbound license detection based on GitHub URLs can still be performed, even after removing the RepoCloner plugin.

On the other hand, following the discussion with Magiel, we understood that having a common place where unjarred maven packages reside could benefit both plugins, Rapid and the License Detector.

As you suggested in the last dev call, this task could be performed directly by the POM Analyzer, preventing the insertion of another plugin in the Java pipeline.

This could be excellent for both Rapid and License Detector.

May 20 '22 08:05 michelescarlato

fasten fasten copied to clipboard

Looking inside of a jar file and run Scancode upon the Java source files.

fasten
fasten copied to clipboard