dejacode icon indicating copy to clipboard operation
dejacode copied to clipboard

Problems creating a package from a SourceForge download URL

Open DennisClark opened this issue 2 years ago • 13 comments

Perhaps this is a user "pilot" error, but when I create a Package in DejaCode from a SourceForge download URL, I get strange results. A recent Add Package using https://sourceforge.net/projects/scribus/files/scribus/1.6.0/scribus-1.6.0.tar.gz/download
resulted in a Package with a filename of download rather than scribus-1.6.0.tar.gz. It also resulted in the rather verbose PURL value of pkg:generic/download?download_url=https://sourceforge.net/projects/scribus/files/scribus/1.6.0/scribus-1.6.0.tar.gz/download

I scanned the package, using the same download URL, directly in SCIO v32.0.8, and it returned a PURL value of pkg:autotools/scribus-1.6.0 in the key_files_packages section

So it appears that the rather eccentric download conventions of SourceForge are messing things up a bit.

  • Can we improve DejaCode to interpret the results of such a scan differently?
  • Does such an improvement rather belong in SCIO?
  • or should we prompt the DejaCode user with instructions how to provide a different, better, less eccentric download URL when processing a SourceForge package?

DennisClark avatar Jan 02 '24 19:01 DennisClark

The problem stems that https://sourceforge.net/projects/scribus/files/scribus/1.6.0/scribus-1.6.0.tar.gz/download is not the actual direct download URL but is followed by several URL redirects that end up in a mirror.

The final destination is something like where the first segment changes from mirror to mirror: https://kumisystems.dl.sourceforge.net/project/scribus/scribus/1.6.0/scribus-1.6.0.tar.gz

The stable final URL would be https://master.dl.sourceforge.net/project/scribus/scribus/1.6.0/scribus-1.6.0.tar.gz

None of these are practically visible and accessible. Therefore we should IMHO do these:

  • [ ] Convert Sourceforge download URL to PURL. Update the the code to properly translate a Sourceforge URL to a PURL, either here or in the Python packageurl library, or both places.
  • [ ] Consider updating "legacy" Sourceforge URLs to a canonical URL. This should be the one that is visible when browsing, ignoring redirections: https://sourceforge.net/projects/scribus/files/scribus/1.6.0/scribus-1.6.0.tar.gz/download
  • [ ] Update MineCode Sourceforge miners to handle and store download URLs correctly

pombredanne avatar Jan 03 '24 14:01 pombredanne

thanks @pombredanne your proposed solution looks good to me!

DennisClark avatar Jan 03 '24 14:01 DennisClark

Note that we have support for the https://*.sourceforge.net/project/scribus/scribus/1.6.0/scribus-1.6.0.tar.gz URLs in the packageurl library, returning pkg:sourceforge/scribus/[email protected]

We simply have to add support for this URL syntax: https://sourceforge.net/projects/scribus/files/scribus/1.6.0/scribus-1.6.0.tar.gz/download

tdruez avatar Jan 04 '24 20:01 tdruez

@DennisClark I've added support for those type of URLs in the purl library, see https://github.com/package-url/packageurl-python/issues/139 Also, as @pombredanne suggested, we are now using the final redirect URL to extract the proper filename.

With those changes, we now generate a proper PURL and filename: Screenshot 2024-01-04 at 14 08 37

tdruez avatar Jan 04 '24 21:01 tdruez

Hi @tdruez I'm getting mixed results in Staging. My original scribus case went just fine, but I then tried another package from SourceForge, turbovnc-3.1.tar.gz , on staging with download URL of

https://sourceforge.net/projects/turbovnc/files/3.1/turbovnc-3.1.tar.gz/download

and it all went fine, including a scan, except that it did not assign any PURL values. See attached.

turbovnc-3 1 tar gz test on staging

DennisClark avatar Jan 04 '24 21:01 DennisClark

@DennisClark I've added support for the following URLs format:

  • https://sourceforge.net/projects/turbovnc/files/3.1/turbovnc-3.1.tar.gz/download
  • https://sourceforge.net/projects/ventoy/files/v1.0.96/Ventoy%201.0.96%20release%20source%20code.tar.gz/download
  • https://sourceforge.net/projects/geoserver/files/GeoServer/2.23.4/geoserver-2.23.4-war.zip/download

You can give it another try.

tdruez avatar Jan 05 '24 17:01 tdruez

@tdruez I tested the 3 you identified in your comment, plus the scribus package, and they all look rather good, with one small issue.

When I simply click on the download link for the ventoy package, it downloads a file name Ventoy 1.0.96 release source code.tar.gz which I think is correct and what they call it on the web site, but in DejaCode the filename is shown as Ventoy%201.0.96%20release%20source%20code.tar.gz with all the escape characters for the spaces. If we simply don't allow spaces in the DejaCode filename field, I guess that's ok, but it does look kind of strange. See attached.

ventoy package in staging

DennisClark avatar Jan 05 '24 17:01 DennisClark

@tdruez one other observation, which is not directly related to this issue, but something that is somewhat perplexing. DejaCode found the existing scans that I created yesterday for the 4 packages (good) and apparently they did not get re-scanned (fine I think) but it did not perform any of the auto-updates to fields on the package (not so good), such as the license-expression, even though 3 of the 4 scans have a declared license. See attached.

Screenshot 2024-01-05 at 09 25 54

DennisClark avatar Jan 05 '24 17:01 DennisClark

In the example above, the geoserver does not have a detected license anyway, so that's not a big deal, but the other 3 all have declared licenses.

DennisClark avatar Jan 05 '24 17:01 DennisClark

@tdruez Sorry I did not catch this one yesterday, but the results from creating a package with

https://sourceforge.net/projects/spacesniffer/files/spacesniffer_1_3_0_2.zip/download

do not look so great. See attached.

spacesniffer in staging

DennisClark avatar Jan 05 '24 17:01 DennisClark

It appears that there are an unknown number of (arbitrary) variations in the SourceForge download url's, suggesting we really do not have a satisfactory way to determine if we got them all. I'm sure you would like to finish this one, but it is possibly an unmanageable task. I'm ok if we go with "good enough" once we have fixed the ones we have actually discovered.

DennisClark avatar Jan 05 '24 17:01 DennisClark

@DennisClark changes available for review:

  • Ventoy%201.0.96%20release%20source%20code.tar.gz is now properly unquoted
  • Added support for https://sourceforge.net/projects/spacesniffer/files/spacesniffer_1_3_0_2.zip/download

one other observation, which is not directly related to this issue, but something that is somewhat perplexing. DejaCode found the existing scans that I created yesterday for the 4 packages (good) and apparently they did not get re-scanned (fine I think) but it did not perform any of the auto-updates to fields on the package (not so good), such as the license-expression, even though 3 of the 4 scans have a declared license. See attached.

Entered as https://github.com/nexB/dejacode/issues/30

tdruez avatar Jan 09 '24 18:01 tdruez

@tdruez The spacesniffer package creation works great now. The Ventoy package creation issue is fixed, although it was very slow to complete the Add Package step, with the cursor spinning for more than 2 minutes; I tested it with a different Ventoy version and had the same slow response. So it all appears to be working fine, but you might want to check on the performance problem.

DennisClark avatar Jan 09 '24 18:01 DennisClark