fasten icon indicating copy to clipboard operation
fasten copied to clipboard

Looking inside of a jar file and run Scancode upon the Java source files.

Open michelescarlato opened this issue 3 years ago • 7 comments

Hi @MagielBruntink Magiel,

As the follow-up to yesterday's dev call, I tried to look at source code within the fasten-project GitHub organization running Scancode upon an unzipped Jar file containing Java source files. Unfortunately, I couldn't find any source code related to this task.

Could you please point me out to the code you mentioned?

I appreciate any help you can provide. M

michelescarlato avatar Jan 14 '22 02:01 michelescarlato

Hi @michelescarlato

As can be seen in this example: https://github.com/fasten-project/fasten/blob/fd0e82ac5524d3b5b17c92a4a9234f7f910a5bd0/analyzer/vulnerability-packages-listener/src/test/resources/real-callable-index-message.json the POM Analyzer puts a link to the jar file containing sources. It is just a matter of HTPP getting that jar file and unzipping it.

MagielBruntink avatar Jan 14 '22 10:01 MagielBruntink

@MagielBruntink , thanks for the input.

A follow-up question is:

As Sebastian @proksch mentioned about Kafka messages encapsulated, I still don't know how these messages are encapsulated.

sourcesUrl belongs to the Kafka message produced to the fasten.POMAnalyzer.out topic. How can I retrieve it, consuming the fasten.MetadataDBExtension.out topic?

michelescarlato avatar Jan 16 '22 00:01 michelescarlato

It will be wrapped somewhere (this is a messy bit, tbh):

{
   "input": {
      "input": {
         "input": {
            "groupId": "org.apache.logging.log4j",
            "artifactId": "log4j-core",
            "version": "2.14.1"
         },
         "plugin_version": "0.1.2",
         "consumed_at": 1642324728,
         "payload": {
            "date": 1615093953000,
            "repoUrl": "",
            "groupId": "org.apache.logging.log4j",
            "version": "2.14.1",
            "parentCoordinate": "org.apache.logging.log4j:log4j:2.14.1",
            "artifactRepository": "https://repo.maven.apache.org/maven2/",
            "forge": "mvn",
            "sourcesUrl": "https://repo.maven.apache.org/maven2/org/apache/logging/log4j/log4j-core/2.14.1/log4j-core-2.14.1-sources.jar",
            "artifactId": "log4j-core",
            "dependencyData": {
               "dependencyManagement": {
                  "dependencies": []
               },
              ......

What I do is search for the sourcesUrl field recursively, like with the following method:

        static String findSourcesUrl(JSONObject json) {
            for (var key : json.keySet()) {
                if (key.equals("sourcesUrl")) {
                    return json.getString("sourcesUrl");
                } else {
                    var other = json.get(key);
                    if (other instanceof JSONObject) {
                        var nestedResult = findPayload((JSONObject) other);
                        if(nestedResult != null) return nestedResult;
                    }
                }
            }
            return null;
        }

MagielBruntink avatar Jan 16 '22 09:01 MagielBruntink

Hi @MagielBruntink Magiel,

I am integrating the portion of code that you are suggesting here and here (I need to extract the package name and the package version, in Debian).

Inside the code you mentioned, there is the findPayload() function. What is precisely doing this function? Can I see it somewhere? Or is it just the same function where the json key is payload (instead of the sourcesUrl)?

michelescarlato avatar Feb 15 '22 19:02 michelescarlato

Hi Michele, find the method here: https://github.com/fasten-project/fasten/blob/76f9997fa2fe3a1ce657f0621a6e2a984afa23ce/analyzer/vulnerability-packages-listener/src/main/java/eu/fasten/analyzer/vulnerabilitypackageslistener/VulnerabilityPackagesListener.java#L147

MagielBruntink avatar Feb 16 '22 06:02 MagielBruntink

I have just re-discovered this issue. xD So we did not just talk about the problems in a dev call that SIG had with the Flink sync job, but we even discussed and illustrated the ease of use of ..-sources.jar files in this very issue. We could have saved us a ton of headache if we would have just followed these recommendations here... well, everybody is smarter afterwards.

proksch avatar May 19 '22 22:05 proksch

The Java license detector made heavy use of the messages produced by the RepoCloner. Unfortunately, modifications to the detector's code are required to adapt it to a new approach.

Also, please consider that an approach that avoids the use of Flink but with a very similar implementation of the Java license detector has been carried out in Python (where the input Kafka topic is fasten.MetadataDBPythonExtension.out).

As you can imagine, the development and the deployment of the three different license detectors (Java, Python, and C) are tight to the pipeline itself, which are different between languages.

Since the Java license detector was mainly developed in July, the Java pipeline in that period relied heavily on the usage of the RepoCloner. That's the main reason for having the Java license detector looking iteratively into the Kafka records consumed at the RepoCloner.out.

I only recently discovered, performing an analysis with @MagielBruntink, that the repoUrl (which could contain a GitHub URL that the detector uses to retrieve license information in spdx format from the GitHub APIs) is produced by the POM Analyzer. This means that the outbound license detection based on GitHub URLs can still be performed, even after removing the RepoCloner plugin.

On the other hand, following the discussion with Magiel, we understood that having a common place where unjarred maven packages reside could benefit both plugins, Rapid and the License Detector.

As you suggested in the last dev call, this task could be performed directly by the POM Analyzer, preventing the insertion of another plugin in the Java pipeline.

This could be excellent for both Rapid and License Detector.

michelescarlato avatar May 20 '22 08:05 michelescarlato