scancode-toolkit RFC: Improve tabular output formats

The current CSV output is a mess, albeit a convenient mess. We need something and quick. I suggest these short term and long term actions

For now in v31:

[x] #3047 The deprecation message should have a link to this PR.

In v32:

[ ] Create a new --csv-file option that would only list file level details in this way:
- file info columns, license expression, copyright, holder, urls, emails, "for_package".
- no line number columns, no package data
- exactly one row per file
- multiple values are joined with a line in a single cell
- long values that exceed offices tools limits are truncated
[ ] Create a new --csv-package option that would only list package details:
- one row per package instance
[ ] Create a new --csv-dependency option that would only list dependency details:
- one row per dependency instance
[ ] Create a new --csv-license option that would only list file level license scan information, used for debugging and hidden from the CLI help:
- one row for each license match with diagnostic details for each match (rule, scores, etc)
[ ] Drop the hidden --csv option in v32
[ ] Add new XLSX option that creates a proper spreadsheet with multiples tabs
- files
- packages
- dependencies
- summary
- and possibly: file "package_data", file "license_detections" where these essentially mirror the new csv-* options

For reference, we have these related issues:

[ ] #2839 is the key issue and would drop CSV for XLSX
[ ] #1551 shows that serialization in a row is always an issue
[ ] #2249
[ ] #1822
[ ] #1398
[ ] #829
[ ] #830
[ ] #2936 where @yahym uses a pivot table
[ ] #1330 shows we have documentation issues
[ ] #1654 ... combining JSONs to create a CSV is not easy

Aug 05 '22 10:08 pombredanne

I was looking into https://www.python-excel.org/ for what we can use to implement this:

https://foss.heptapod.net/openpyxl/openpyxl/ has pandas in it's requirements so probably not something we can use.

Others are libraries that only use the standard library, no other requirements:

https://github.com/jmcnamara/XlsxWriter seems the best choice, actively maintained and lots of functionality. Author maintains tools in other languages for the same too :P https://github.com/PydPiper/pylightxl seems to lightweight and can be used by vendoring a single file

https://github.com/python-excel/xlwt was widely used but this is for old excel formats and not actively maintained anymore.

@pombredanne what do you think?

Nov 14 '22 11:11 AyanSinhaMahapatra

@Ayan Sinha Mahapatra IMHO just reuse what is used in SCIO See https://github.com/nexB/thirdparty-packages/blob/main/pypi/XlsxWriter-3.0.3.tar.gz.ABOUT See https://github.com/nexB/scancode.io/blob/c4523ea0ab5d19d5dd22a5b5af785bac05be583c/scanpipe/views.py#L339

Nov 14 '22 11:11 pombredanne

Not arguing that the current CSV file is unwieldy, but it's easier to automatically process than multiple-tabs. It contains all the data, which is sometimes what you want in a hurry.

My other approach has been to process the json into a custom CSV, but sometimes that's a hassle.

Would you consider keeping the existing file around as --csv-everything-yolo or similar?

Apr 21 '23 18:04 rspier

@rspier you wrote:

Would you consider keeping the existing file around as --csv-everything-yolo or similar?

of course, but then we can may be design it so that it has everything AND not too much at the same time, so this is compact and efficient to review? For instance, returning the start and end lines of copyrights and license matches may not be needed there, and we could design something that has the key data that could fit on a single row per file?

Apr 22 '23 18:04 pombredanne

I think "compact and efficient to review" might be something that comes from the other views, while the "big one" is for those cases where you want all the data, or things that aren't in the other formats. It's significantly easier to hide/remove data than it is to add/merge it back in, which is one reason I'd lean towards having a CSV with too much info.

For example, I often use matched_rule__identifier so I can see what the rule was to determine if it's likely a false positive. (i.e. lGpL)

There are a lot of nested fields that I would drop before I dropped start and end line. For example:

license__homepage_url
license__text_url
license__reference_url
license__scancode_text_url
license__scancode_data_url
license__spdx_license_key
license__spdx_url

are highly repetitive of each other and other fields (like license__key)

also license__name vs license__short_name.

An alternative idea would be to leverage --custom-template FILE by providing pre-made "kitchen sink" templates. i.e. /path/to/scancode/lib/templates/oldstyle.csv

Apr 24 '23 22:04 rspier

The current CSV doesn't include matched_text which is what most of the people I work with are actually interested in.

Nov 24 '23 16:11 armijnhemel

@armijnhemel re:

The current CSV doesn't include matched_text which is what most of the people I work with are actually interested in.

The problem is that this is too big in practice to be routinely included in a CSV... all commercial and libre spreadsheets I know off will choke with an AGPL matched text :]

Libreoffice: likely 2**31 but very slow with tens of thousands
Google sheet: 50,000 chars
Ms Excel: 32,767 chars

All tools start to choke with a few 1000.

The way out is IMHO using ScanCode.io or the ScanCode workbench that both display the license detection and matched text loaded from a JSON scan even if you do not have the original scanned code.

Nov 24 '23 16:11 pombredanne