dash-licenses icon indicating copy to clipboard operation
dash-licenses copied to clipboard

Agree on a common data format

Open jmini opened this issue 5 years ago • 4 comments

To work on different ways to collect the list of dependencies, there is a need to agree on a data format.

Right now a dependency is identified by its ClearlyDefined coordinate system, stored in the ContentId java class:

https://github.com/eclipse/dash-licenses/blob/13f6bcd1d42a1705f00772ca6f49814149957d9b/src/main/java/org/eclipse/dash/licenses/ContentId.java#L12-L18

The idea of this coordinate system is to map values of technology-specific repositories (Maven Repository, NPM, …) to something that is technology-independent (see software dependencies coordinates for mapping examples).

I see at least two other attributes that needs to be captured:


1) Source repository (where does the dependency comes from)

The provider value used by ClearlyDefined is not really clear. I have raised https://github.com/clearlydefined/clearlydefined/issues/125 to see if the ClearlyDefined project has plans in this area.

In all cases, knowing the repository where the dependency was downloaded from is interesting from an audit point of view and should be captured if available.

For example the Eclipse Foundation host several maven repositories under https://repo.eclipse.org/ and having a dependency coming from an other Eclipse project induces less checks from the IP point of view.

2) Usage

In the context of a given build, a dependency (and its transitive closure) has a specific usage:

Yarn defines:

  • normal dependencies
  • development dependencies
  • peer dependencies
  • optional dependencies
  • bundled dependencies

Maven scopes:

  • compile
  • provided
  • runtime
  • test
  • system
  • import

The Yarn Dependency type and the maven scope not exactly the same. Probably like the ClearlyDefined coordinate system, a common vocabulary with its mapping from the different technologies could also be defined.

This information is needed in order to evaluate the "works with" dependencies (see issue #13)


I think that JSON could be used:

{
    "coordinates": {
        "type": "maven",
        "provider": "mavencentral",
        "namespace": "com.fasterxml.jackson.core",
        "name": "jackson-annotations",
        "revision": "2.9.3"
    },
    "repositoryLocation": "https://jcenter.bintray.com/",
    "usage": "normal"
}

jmini avatar May 12 '20 12:05 jmini

ClearlyDefined is the common format. For our purposes, the five part identifier is sufficient as an identifier for content. e.g., "maven/mavencentral/commons-httpclient/commons-httpclient/3.0.1" unambiguously identifies content.

Given a identifier, we can query ClearlyDefined (or the Eclipse IPZilla) and get more information. ClearlyDefined provides the sourceURL, for example, in the results from their API. In at least a few different cases, I've changed the sourceURL in the ClearlyDefined data to point to a commit in a Git repository rather than the default pointer to the source JAR in Maven central. So these sourceURLs may change as better information becomes available.

This tool can easily be extended to include this information in the output if desired by an adopter (since we don't need that information, we'd have to make including it in the output optional).

At least theoretically, content with a similar identifier, e.g. "maven/eclipseNexus/commons-httpclient/commons-httpclient/3.0.1" (assuming that we make "eclipseNexus" meaningful) could actually be different. At least theoretically, content from some random Nexus instance could be entirely different content, or include significant additional intellectual property. If we decide that there is value in doing so, we will point the ClearlyDefined harvester at our Nexus instances and have it add that distinctiveness to its own.

From the perspective of implementing the Eclipse IP Policy, we only care about the content that is identified as "workswith" and IPZilla can provide us with this information (we have some work to do to make this work, but we have the data). From our perspective, dependencies of a workswith can just be trimmed from the dependency list. It would be great if usage information can be used to trim the list before the content is delivered to the tool. My feeling is this will manifest as best practices.

Flagging workswith content in the output is something that we should do.

For example the Eclipse Foundation host several maven repositories under https://repo.eclipse.org/ and having a dependency coming from an other Eclipse project induces less checks from the IP point of view.

Not true.

waynebeaton avatar May 12 '20 14:05 waynebeaton

I found an interesting case today.

The clearlydefined simplification of the maven coordinates might be dangerous. The maven classifier matters and should not be dropped.

The lib has following dependencies:

  • jffi-1.2.16.jar
  • jffi-1.2.16-native.jar

In the POM:

    <dependency>
      <groupId>com.github.jnr</groupId>
      <artifactId>jffi</artifactId>
      <version>1.2.16</version>
    </dependency>
    <dependency>
      <groupId>com.github.jnr</groupId>
      <artifactId>jffi</artifactId>
      <version>1.2.16</version>
      <classifier>native</classifier>
    </dependency>

Both are sharing the same POM, but if you need to check that the code of those jar is conform, it is dangerous to map them to one entry, because you need to verify both…

jmini avatar May 20 '20 15:05 jmini

Let me add:

Executing

mvn dependency:list

on linux, I get:

org.openjfx:javafx-base:jar:14:compile org.openjfx:javafx-base:jar:linux:14:compile

on windows, I get

org.openjfx:javafx-base:jar:14:compile org.openjfx:javafx-base:jar:win:14:compile

That will make it hard to include "all".

boaks avatar Sep 18 '20 06:09 boaks

I've opened an issue on ClearlyDefined to ask about how we can represent classifiers. clearlydefined/service#786

waynebeaton avatar Dec 30 '20 03:12 waynebeaton