software-mentions icon indicating copy to clipboard operation
software-mentions copied to clipboard

Help detecting software sharing URLs

Open evamaxfield opened this issue 8 months ago • 5 comments

Hello!

First, thanks so much for making this and the accompanying software-mentions-client library, I really appreciate all of the work that has gone in to making this available!

I am looking for some help regarding detecting and extracting software sharing URLs. For example, in the CZI Software Mentions Dataset paper, there are lots of pieces of software mentioned and correctly found but it doesn't seem to find this link/passage:

This model achieves a 10-fold cross-validation F1 score of 0.92. More details can be found at: https://github.com/chanzuckerberg/software-mention-extraction.

or:

All the code used for extraction, disambiguation and linking, as well as instructions on how to reproduce the results and some starter code is available at a GitHub repository https://github.com/chanzuckerberg/software-mentions under the MIT license with the permanent snapshot at [43].

I was hoping that I would be able to extract those links as that is the authors trying to share their software but I think I may be doing something wrong or have the deployment configured incorrectly.

Another example can be found in the Rise of Open Science paper. This time the service finds very little software (which is I think to be expected) but also doesn't find the authors sharing their code in the following footnote:

The codes can be accessed at https://github.com/caohanch/paper_data_method_sharing/.

I am using the grobid/software-mentions:0.8.1 docker image and I haven't changed any of the configuration details because I already saw in the README:

It is recommended to use the Docker image for running the service. The best Deep Learning models are included and are used by default by this image.

Please let me know if you have any thoughts/ideas/etc. Any help is greatly appreciated, just confused if I am doing something incorrectly.

evamaxfield avatar Mar 20 '25 22:03 evamaxfield

Thanks! Looks like these might be URL only mentions, ie without a separate software name? @kermitt2 any insight on these? Perhaps we need to gather a bunch of these and retrain?

jameshowison avatar Mar 21 '25 15:03 jameshowison

That seems correct to me. "Our software/code is available at: xyz" without an explicit name provided. However I will also say that I have seen this same issue for sentences like:

"All code, models, and data used in this work is available from our Python package: insert-package-name-here" where the "insert-package-name-here" is also hyperlinked to GitHub / PyPI / etc. Which made me also wonder how hyperlink parsing works? Does the model only see the package name and not the link, does it see both?

evamaxfield avatar Mar 21 '25 16:03 evamaxfield

It should see an abstract "mention", then identify metadata for that mention (software-name, url, citation, version, etc).

Ah, wait, you are meaning actual links, rather than URLs in text. Great question, I'm not sure I know. That would be a Grobid (the underlying model of which softcite-mention is a specialization) question. This seems to imply that Grobid does not extract those: https://pierre.senellart.com/publications/mishra2024first.pdf

On Fri, Mar 21, 2025 at 11:18 AM Eva Maxfield Brown < @.***> wrote:

That seems correct to me. "Our software/code is available at: xyz" without an explicit name provided. However I will also say that I have seen this same issue for sentences like:

"All code, models, and data used in this work is available from our Python package: insert-package-name-here" where the "insert-package-name-here" is also hyperlinked to GitHub / PyPI / etc. Which made me also wonder how hyperlink parsing works? Does the model only see the package name and not the link, does it see both?

— Reply to this email directly, view it on GitHub https://github.com/softcite/software-mentions/issues/43#issuecomment-2743846997, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWOUS4NYJDW3GHABWR5GL2VQ3V3AVCNFSM6AAAAABZOTBRHCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONBTHA2DMOJZG4 . You are receiving this because you commented.Message ID: @.***> [image: evamaxfield]evamaxfield left a comment (softcite/software-mentions#43) https://github.com/softcite/software-mentions/issues/43#issuecomment-2743846997

That seems correct to me. "Our software/code is available at: xyz" without an explicit name provided. However I will also say that I have seen this same issue for sentences like:

"All code, models, and data used in this work is available from our Python package: insert-package-name-here" where the "insert-package-name-here" is also hyperlinked to GitHub / PyPI / etc. Which made me also wonder how hyperlink parsing works? Does the model only see the package name and not the link, does it see both?

— Reply to this email directly, view it on GitHub https://github.com/softcite/software-mentions/issues/43#issuecomment-2743846997, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWOUS4NYJDW3GHABWR5GL2VQ3V3AVCNFSM6AAAAABZOTBRHCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONBTHA2DMOJZG4 . You are receiving this because you commented.Message ID: @.***>

jameshowison avatar Mar 21 '25 16:03 jameshowison

Hello @evamaxfield and thank you for using our software mention recognizer !!

As James explained, normally the model extracts URL coming with a software name mention (including implicit mention, like "code", "program"), but not URL-only mention and currently not URL with are only PDF annotation.

For example, what is expected from your example:

The codes can be accessed at https://github.com/caohanch/paper_data_method_sharing/.

is to recognize the software mention "codes" with the URL attribute "https://github.com/caohanch/paper_data_method_sharing".

The implicit mentions have currently less training examples, so it might not always be as accurate as it should. There is at least one project currently running (SoFAIR) developing additional training data, and it should improve.

A URL-only mention is currently not supported, for example:

"We process the document with https://github.com/kermitt2/grobid."

having the software mention and URL from this example is currently out of the scope of our tool.

Similarly, if the URL only appears as PDF-annotation on a mentioned name (without URL text appearing in the text), it will not be added as URL attribute of the mention - this is your example:.

All code, models, and data used in this work is available from our Python package: insert-package-name-here.

(where the "insert-package-name-here" is also hyperlinked/"clickable" to GitHub / PyPI / etc.)

However, it would be easy to add that to the system, because Grobid extracts all the URL of a PDF with coordinates and corresponding annotated text. So we simply need to check if the recognized software name PDF tokens are associated to the URL PDF annotation using the existing GROBID List<PDFAnnotation>.

kermitt2 avatar Mar 23 '25 13:03 kermitt2

I see! Thanks for the insight. Please let me know / update this issue whenever anything like this becomes available via the service!

evamaxfield avatar Mar 25 '25 21:03 evamaxfield