scancode-toolkit icon indicating copy to clipboard operation
scancode-toolkit copied to clipboard

Version JSON output data format

Open pombredanne opened this issue 4 years ago • 16 comments

As part of #2601 this is a first essential step before we start modifying more things to improve package reporting This is also needed to support:

  • #2350
  • #2381
  • #2389
  • #2278

pombredanne avatar Aug 20 '21 13:08 pombredanne

Here is a first take on the policy there:

  1. the version string is using this format scancode-toolkit-data-format-1.1 where the last two segments represent a semver-like version.
  2. the first segment is the major version of the data format; it is incremented when there are attributes that are removed, renamed, changed or moved.
  3. the second segment is the minor version of the data format; it is incremented when there are attributes that are only added.
  4. we store the version string in our JSON output and display that also in the help, using a data_format_version attribute
  5. this data format versioning is strictly for the JSON, YAML and JSON lines formats. It does not apply to CSV and any other formats. For these other formats there is no versioning and guaranteed format stability
  6. for now, the format version is incremented by hand and only only increment per ScanCode tagged release is needed
  7. We will document in the CHANGELOG the format changes in new format versions
  8. In a given released code version, ScanCode TK may support only data format versions: the default, current version, and the next experimental version. We will update the CLI and functions to accept a new flag to select the next, experimental data format version (may be --next-data-format or --experimental-data-format or another flag name TBD)
  9. By using --from-json xxx --json yyy we should be able to convert data from the current, default data format to the next, experimental format.
  10. For any version we should provide a doc on the format #2008

pombredanne avatar Aug 20 '21 14:08 pombredanne

@JonoYang @tdruez @AyanSinhaMahapatra @sschuberth @tsteenbe ping. Feedback welcomed. This is rather non-controversial.

pombredanne avatar Aug 20 '21 14:08 pombredanne

it is incremented when there are attributes that are removed, renamed, changed or moved.

You probably should clarify that "moved" does not refer to changing the order at the same level, as that's not something what would break deserialization.

  1. we store the version string in our JSON output and display that also in the help, using a data_format_version attribute

Maybe you should also make clear that the data_format_version attribute itself must never move, or that it always appears as the first attribute in the file, or something like that.

per ScanCode tagged release is needed

"if needed"

sschuberth avatar Aug 20 '21 14:08 sschuberth

@mjherzog @DennisClark your comments are welcomed too.

pombredanne avatar Aug 23 '21 08:08 pombredanne

This makes sense to me. It is a "nice" idea to have one versioning convention for a whole system, but we are all learning about the important differences in licensing between software and data. So this sounds like using the right tool for the job.

mjherzog avatar Aug 23 '21 15:08 mjherzog

@pombredanne

Will the Codebase-Resource model schema from commoncode be versioned in the same way as the JSON output or is the output format independent of the Codebase-Resource model schema?

JonoYang avatar Aug 23 '21 16:08 JonoYang

Will the Codebase-Resource model schema from commoncode be versioned in the same way as the JSON output or is the output format independent of the Codebase-Resource model schema?

@JonoYang that's a good point as these are tightly coupled . :| This needs a bit of extra thinking.

pombredanne avatar Aug 23 '21 16:08 pombredanne

@indirabhatt @maxhbr @soimkim ping too, FYI :)

pombredanne avatar Aug 23 '21 16:08 pombredanne

Here is the updated documentation based on the feedback above (@sschuberth :heart: ):

Output format version policy

We version the JSON output from ScanCode-Toolkit using this approach:

  1. The version string is using this format scancode-toolkit-output-format-1.1 where the last segments after the dash represent a semver-like version of 1.1.

  2. The first segment is the major version of the output format; it is incremented when attributes that are removed, renamed, changed or moved (but not reorder) in the JSON output. Reordering the attributes of a JSON object is not considered as a change and does not trigger a version change.

  3. The second segment is the minor version of the output format; it is incremented when the changes are only for addition of attributes to the JSON output

  4. We store the version string in the JSON output object as the first attribute and display that also in the help, using the new output_format_version attribute.

  5. This output format versioning applies only to the JSON, pretty-printed JSON, YAML and JSON lines formats. It does not apply to CSV and any other formats. For these other formats there is no versioning and guaranteed format stability (or there some other rationale and convention for versioning like for SPDX)

  6. For now, the output format version is incremented by hand and only incremented with a new ScanCode code tagged release if needed by output format changes.

  7. We document in the CHANGELOG the output format changes in any new format version.

  8. In a given released code version, ScanCode supports two output format versions: the default, current version, and the future version. The command line and core API functions will accept a new flag to select the future output format version (using --future-format option name).

  9. When using --from-json xxx.json --json yyy.json --future-format we will able to convert data from a current, default JSON output format to the next, future JSON output format .

  10. For any format version we will provide a documentation on the format and its updates using JSON examples and a comprehensive and updated data dictionary. See #2008 for details

pombredanne avatar Aug 30 '21 20:08 pombredanne

After extensive review, supporting multiple versions of the output data format at once is an immense task! much simpler on paper than in practice... therefore I think will instead only track which version of data format is in a given SCTK version and we can commit to limit the number of major data format version changes to possibly no more than once a quarter.

The data format version and the documentation should be enough for users IMHO. The effort to have the current and future version would be similar to maintain two branches in the same codebase and make continuous forward port and back ports to each branch. This is too much work for too little benefits.

pombredanne avatar Sep 06 '21 14:09 pombredanne

@pombredanne should this be closed as this is merged?

AyanSinhaMahapatra avatar Sep 22 '21 10:09 AyanSinhaMahapatra

@AyanSinhaMahapatra not yet... we still need to re/write the documentation AND we need to add it to the docs

pombredanne avatar Sep 22 '21 10:09 pombredanne

Here is an updated overall version policy. Because of the semver switch, and as discussed in the weekly community call the next version will be 30.0.0. The initial data format version is at 1.0.0

Versioning approach

ScanCode is composed of code and data (mostly license data used for license detection).

Historically we tried using calver to also convey that the data embedded in ScanCode was updated but it proved to be not as effective as thought so we are switching back to semver which is more useful for users.

We are therefore now using this new versioning approach:

  • Code and data releases are versioned using semver as documented at https://semver.org/.

  • Significant changes in the license of copyright detection data is considered a major version change even if there are no code changes. The rationale is that in our case the data has the same impact as the code. Using outdated data is like using old code and means that several licenses may not be detected correctly.

  • We will signal separately with warnings messages when ScanCode needs to be upgraded because its data and/or code are out of date.

In addition to the main code version, we also maintain a secondary output data format version using also semver with two segments. The versioning approach is adapted for data this way:

  • The first segment --the major version-- is incremented when data attributes that are removed, renamed, changed or moved (but not reordered) in the JSON output. Reordering the attributes of a JSON object is not considered as a change and does not trigger a version change.

  • The second segment --the minor version-- of the output format is incremented for an addition of attributes to the JSON output

  • We store the output format version string in the JSON output object as the first attribute and display that also in the help.

  • This output format versioning applies only to the JSON, pretty-printed JSON, YAML and JSON lines formats. It does not apply to CSV and any other formats. For these other formats there is no versioning and guaranteed format stability (or there may be some other rationale and convention for versioning like for SPDX)

  • The output format version is incremented by when a new ScanCode tagged release is published

  • We document in the CHANGELOG the output format changes in any new format version.

  • For any format version changes, we will provide a documentation on the format and its updates using JSON examples and a comprehensive and updated data dictionary. See #2008 for details

pombredanne avatar Sep 22 '21 13:09 pombredanne

As part of this I am also adding a new outdated version notification: this will be listed in the scan headers when the version is either out of date on PyPI or 90 days old. Before that we were only displaying a warning in the CLI on stderr doing a remote PyPI version check. Note that the remote PyPI check is now optional thanks to @yns88 patch

pombredanne avatar Sep 23 '21 08:09 pombredanne

This is the CHANGELOG entry:

The scan results and the CLI now display an outdated version warning when the installed ScanCode version is older than 90 days. This is to warn users that they are relying on outdated, likely buggy, insecure and inaccurate scan results and encourage them to update to a newer version. This is made entirely locally based on date comparisons.

pombredanne avatar Sep 23 '21 08:09 pombredanne

https://github.com/nexB/scancode-toolkit-reference-scans has been added to store scancode-toolkit reference scans and documentation on output version changes with diffs. Repository: https://scancode-toolkit-reference-scans.readthedocs.io/en/latest/

This needs to be added to scancode-toolkit documentation.

AyanSinhaMahapatra avatar Oct 13 '21 16:10 AyanSinhaMahapatra

This is now versioned and merged. Closing

pombredanne avatar Aug 11 '22 12:08 pombredanne