RFC: Improve tabular output formats
The current CSV output is a mess, albeit a convenient mess. We need something and quick. I suggest these short term and long term actions
For now in v31:
- [x] #3047 The deprecation message should have a link to this PR.
In v32:
-
[ ] Create a new
--csv-fileoption that would only list file level details in this way:- file info columns, license expression, copyright, holder, urls, emails, "for_package".
- no line number columns, no package data
- exactly one row per file
- multiple values are joined with a line in a single cell
- long values that exceed offices tools limits are truncated
-
[ ] Create a new
--csv-packageoption that would only list package details:- one row per package instance
-
[ ] Create a new
--csv-dependencyoption that would only list dependency details:- one row per dependency instance
-
[ ] Create a new
--csv-licenseoption that would only list file level license scan information, used for debugging and hidden from the CLI help:- one row for each license match with diagnostic details for each match (rule, scores, etc)
-
[ ] Drop the hidden
--csvoption in v32 -
[ ] Add new XLSX option that creates a proper spreadsheet with multiples tabs
- files
- packages
- dependencies
- summary
- and possibly: file "package_data", file "license_detections" where these essentially mirror the new csv-* options
For reference, we have these related issues:
- [ ] #2839 is the key issue and would drop CSV for XLSX
- [ ] #1551 shows that serialization in a row is always an issue
- [ ] #2249
- [ ] #1822
- [ ] #1398
- [ ] #829
- [ ] #830
- [ ] #2936 where @yahym uses a pivot table
- [ ] #1330 shows we have documentation issues
- [ ] #1654 ... combining JSONs to create a CSV is not easy
I was looking into https://www.python-excel.org/ for what we can use to implement this:
https://foss.heptapod.net/openpyxl/openpyxl/ has pandas in it's requirements so probably not something we can use.
Others are libraries that only use the standard library, no other requirements:
https://github.com/jmcnamara/XlsxWriter seems the best choice, actively maintained and lots of functionality. Author maintains tools in other languages for the same too :P https://github.com/PydPiper/pylightxl seems to lightweight and can be used by vendoring a single file
https://github.com/python-excel/xlwt was widely used but this is for old excel formats and not actively maintained anymore.
@pombredanne what do you think?
@Ayan Sinha Mahapatra IMHO just reuse what is used in SCIO See https://github.com/nexB/thirdparty-packages/blob/main/pypi/XlsxWriter-3.0.3.tar.gz.ABOUT See https://github.com/nexB/scancode.io/blob/c4523ea0ab5d19d5dd22a5b5af785bac05be583c/scanpipe/views.py#L339
Not arguing that the current CSV file is unwieldy, but it's easier to automatically process than multiple-tabs. It contains all the data, which is sometimes what you want in a hurry.
My other approach has been to process the json into a custom CSV, but sometimes that's a hassle.
Would you consider keeping the existing file around as --csv-everything-yolo or similar?
@rspier you wrote:
Would you consider keeping the existing file around as --csv-everything-yolo or similar?
of course, but then we can may be design it so that it has everything AND not too much at the same time, so this is compact and efficient to review? For instance, returning the start and end lines of copyrights and license matches may not be needed there, and we could design something that has the key data that could fit on a single row per file?
I think "compact and efficient to review" might be something that comes from the other views, while the "big one" is for those cases where you want all the data, or things that aren't in the other formats. It's significantly easier to hide/remove data than it is to add/merge it back in, which is one reason I'd lean towards having a CSV with too much info.
For example, I often use matched_rule__identifier so I can see what the rule was to determine if it's likely a false positive. (i.e. lGpL)
There are a lot of nested fields that I would drop before I dropped start and end line. For example:
- license__homepage_url
- license__text_url
- license__reference_url
- license__scancode_text_url
- license__scancode_data_url
- license__spdx_license_key
- license__spdx_url
are highly repetitive of each other and other fields (like license__key)
also license__name vs license__short_name.
An alternative idea would be to leverage --custom-template FILE by providing pre-made "kitchen sink" templates. i.e. /path/to/scancode/lib/templates/oldstyle.csv
The current CSV doesn't include matched_text which is what most of the people I work with are actually interested in.
@armijnhemel re:
The current CSV doesn't include matched_text which is what most of the people I work with are actually interested in.
The problem is that this is too big in practice to be routinely included in a CSV... all commercial and libre spreadsheets I know off will choke with an AGPL matched text :]
- Libreoffice: likely 2**31 but very slow with tens of thousands
- Google sheet: 50,000 chars
- Ms Excel: 32,767 chars
All tools start to choke with a few 1000.
The way out is IMHO using ScanCode.io or the ScanCode workbench that both display the license detection and matched text loaded from a JSON scan even if you do not have the original scanned code.