ort Consolidate Scan Storages

The scanner supports storing multiple types of data in storages for reuse in subsequents runs or other tools:

Scan results
- Package based
- Provenance based
Provenance resolutions results
- For packages
- For nested provenances
File archives
File lists

Currently the storage backends can be configured separately for each of those four data types. While this is very flexible, in practice it provides little value. For example, if scan results are stored in a Postgres database, there is little reason to store provenance results in a different place. Or if file archives are stored in S3, there is little reason to store file lists in a different place.

This flexibility makes the configuration complex: The default settings store all data in a local directory which is usually not desired in a production setup, so to store the data remotely four storage backend configurations are required. This often confuses users and can also cause performance issues for users not knowing how the scanner works internally, for example, by forgetting to configure a provenance storage which leads to unnecessary repetition of the provenance resolution.

To simplify the configuration, the proposal is to consolidate the configuration to just two types of data:

Structured data
- Scan results
- Provenance resolution results
Binary data
- File archives
- File lists (at least currently, it could be more efficient to treat file lists as structured data as well)

The implementation proposal is:

Rename ScanStorage and all related classes to ScanResultStorage
- When introduced, the name ScanStorage was chosen because ScanResultStorage was already taken, but this is not the case anymore.
Make two interfaces providing the functions for storing structured data and binary data
- ScanStorage
- BinaryStorage
Make implementations of those interfaces that reuse the existing implementations
- For example, PostgresScanStorage uses ProvenanceBasedPostgresStorage, PostgresPackageProvenanceStorage and PostgresNestedProvenanceStorage.
Adapt the configuration to require only configuration for those two types of storages.
Make the new interfaces plugins (see #6603)
- This ensures that new implementations provide all required features. For example, a new MariaDbScanStorage should not only provide a way to store scan results, but also to store provenance resolution results.

Jun 02 '24 20:06 mnonnenmacher

For reference, this also relates to the pending draft at https://github.com/oss-review-toolkit/ort/pull/5516 which tries to address the storage configuration complexity by centralizing and reusing it.

Jun 03 '24 06:06 sschuberth

To simplify the configuration, the proposal is to consolidate the configuration to just two types of data:

That distinction makes sense to me. And maybe the interfaces should be named accordingly: For binary data that's already the case, but maybe ScanStorage should be more generally StructuredStorage. Thinking about it, maybe BlobStorage would then be a better contrast to that than BinaryStorage.

Jun 03 '24 06:06 sschuberth

In the context of this issue and #6603 I would also like to further simplify the scan storage implementation:

Currently, an arbitrary number of storages can be configured, and they can act as either reader, writer, or both. I would like to change that so that only a single implementation can be configured which is always reader and writer. Also, we support package and provenance based storages. I would like to remove the support for package based storages because this would significantly reduce the complexity of the scanner logic.

For the separate reader/writer configuration I am aware of only two use cases:

The ClearlyDefinedStorage which can only read scan results. My proposal would be to separate that functionality from the scan result storage, for example by adding a new interface like ScanResultProvider, especially as ClearlyDefined only supports lookup by package, not by provenance.
When migrating to a different storage implementation, the old one can still be configured as read-only to not have to re-scan everything. Instead, I would propose to add a migration command for this, but only if it is really required.

The following implementations would be affected by removing support for package based storages:

ClearlyDefinedStorage: See above.
The old package based PostgreSQL and file storages: For those we have provenance based replacements.
Sw360Storage: I have no idea if anyone is using this. Ideally, we can remove it.

@oss-review-toolkit/core-devs I would be very interested in your feedback.

Jan 14 '25 15:01 mnonnenmacher

we support package and provenance based storages. I would like to remove the support for package based storages

I'm fine with removing support for package-based storages.

Just as a reminder: I believe we do still have plans to add file-based storages / query provenance-based storages on a file-level. This should be loosely kept in mind for relevant design choices.

adding a new interface like ScanResultProvider, especially as ClearlyDefined only supports lookup by package, not by provenance.

Sounds reasonable to me. I foresee more such "read-only" scan result storages coming up in the context of https://github.com/aboutcode-org, and such results would probably be queried by purl (so, also not by provenance; that is, unless you encode the repository_url etc. as part of the qualifier).

Sw360Storage: I have no idea if anyone is using this. Ideally, we can remove it.

Are you, @heliocastro?

I would like to change that so that only a single implementation can be configured which is always reader and writer.

Just thinking out aloud here: I still wish for an IPFS scan storage. Maybe such a storage would be slow, so maybe you'd want to first use a faster storage and only use IPFS as the fallback. Such a thing would not be possible with the new implementation.

Also I'm wondering: Does the current complexity mainly come from allowing readers and writers to be different, or from allowing multiple readers and writers? Would it make sense to maybe only remove the more complex of the aforementioned features, but keeping the other one?

Jan 14 '25 15:01 sschuberth

Just as a reminder: I believe we do still have plans to add file-based storages / query provenance-based storages on a file-level. This should be loosely kept in mind for relevant design choices.

Yes, that's another argument for simplifying the current implementation.

Just thinking out aloud here: I still wish for https://github.com/oss-review-toolkit/ort/issues/1328. Maybe such a storage would be slow, so maybe you'd want to first use a faster storage and only use IPFS as the fallback. Such a thing would not be possible with the new implementation.

So the faster storage would act as a cache for the IPFS storage? Maybe it's better to solve this in the IPFS implementation itself instead of on a higher level.

Also I'm wondering: Does the current complexity mainly come from allowing readers and writers to be different, or from allowing multiple readers and writers? Would it make sense to maybe only remove the more complex of the aforementioned features, but keeping the other one?

In the scanner code it's mainly the support for both package and provenance based storages. In the configuration it's mainly the support for multiple kinds of storages at the same time, separate readers and writers, and the fact that the configuration is scattered.

What I would like is to separate ORT's internal storage logic from external services, because they do not use the same concepts, and then find other ways to integrate the external services.

Jan 14 '25 16:01 mnonnenmacher

@mnonnenmacher @sschuberth We already overhaul the entire SW360 backend and from next release, 20.x it will be very different, so the current storage model will work only for version up to 18.x and i honestly think that this one should go for good. We want to not anymore store artifacts, but instead add links in the respective components to some storage defined, so what i think is actually reasonable:

Deprecate / remove the current sw360 storage, but before do some inquiry on the community meeting
Provide a way to when we use the sw360 provider component update, be able to use the current storage info to add as a reference link to

Jan 15 '25 07:01 heliocastro

@heliocastro So if I understand you correctly you are not using the Sw360Storage, correct?

* Provide a way to when we use the sw360 provider component update, be able to use the current storage info to add as a reference link to

Sorry, I do not understand what you mean, could you elaborate?

Jan 15 '25 07:01 mnonnenmacher

Yep, not using Sw360Storage at all, correct.

For the second point, the "upload-result-to-sw360" utility today create projects and components based on the information coming from the ORT results. So, i plan to extend this to add a link/info to the referred content, so i will need to have a way to retrieve information about the storage if exists.

Basically, we are eliminating the Sw360Storage, but adding a functionality to upload-result to add a reference to whatever storage is defined on ORT config

Jan 15 '25 08:01 heliocastro

Basically, we are eliminating the Sw360Storage, but adding a functionality to upload-result to add a reference to whatever storage is defined on ORT config

That sounds good, it's in line with my goal to separate the integration of external services from ORT's internal storage implementation.

So, i plan to extend this to add a link/info to the referred content, so i will need to have a way to retrieve information about the storage if exists.

I think currently there is no easy way to get this, could you give an example what this reference should look like, for example, if results are stored in a PostgreSQL database?

Jan 15 '25 08:01 mnonnenmacher

ort ort copied to clipboard

Consolidate Scan Storages

ort
ort copied to clipboard