BookStack icon indicating copy to clipboard operation
BookStack copied to clipboard

Locate broken links in content

Open ssddanbrown opened this issue 1 year ago • 1 comments

Describe the feature you'd like

Functionality to scan links in content and detect which links are no longer valid (lead to >=400 status). Primarily for anchors, potentially for image/media references also.

Describe the benefits this would bring to existing BookStack users

This will allow broken links in content to be detected and dealt with (updated/removed) so that readers won't be dealing with dead links. From the editor point of view, system-level scanning of this helps locating of such links without spending an exhaustive amount of time manually searching.

Can the goal of this request already be achieved via other means?

Yes:

  • Via manual search (time consuming)
  • Via API scripting (requires custom code, may not be able to handle internal links).

Have you searched for an existing open/closed issue?

  • [X] I have searched for existing issues and none cover my fundamental request

How long have you been using BookStack?

3 months to 1 year

Additional context

Note: This was opened on behalf of a user through the BookStack support services. (Ticket 116)

Dev Notes

  • Server side request security consideration limits things. Review against existing options. Likely need to have strong permission requirements (or keep at sysadmin/command level).
  • Would likely expect handling of internal links, which may be complicated in permission-controlled scenarios.

ssddanbrown avatar Aug 14 '24 14:08 ssddanbrown

Piggy-backing on this issue, if I may – with a related feature request: Other wiki software (MediaWiki, DokuWiki) displays links to non-existant pages in red. Would this be possible to implement in BookStack?

(I'm not sure if this might already possible, given BookStack's internal data structures and permission system, but I'd guess "not easily" – so any changes required to support this broken-link location feature could ideally be done with this feature in mind, too.)

doersino avatar Oct 01 '24 19:10 doersino

we could build up some python script to

  • access the database, read the tables "pages" and get the column "html" from it
  • parse the columns content for each page id to find all existing URLs (https://stackoverflow.com/questions/44644501/extract-all-urls-in-a-string-with-python3)
  • curl each URL for desired return values (usually there are a lot of specific return codes for each need) -> https://curl.se/libcurl/c/libcurl-errors.html
  • save the results in a new database table or in a file like json
  • parse the output to do sth. with it, e.g.
    • send a mail to the administrator
    • render a special page within BookStack, which contains a table, where each table row has an entry for the page id and a list of all broken URLs

vmario89 avatar Feb 10 '25 23:02 vmario89

to render broken URLs directly in BookStack is cool too and would be another aproach. This difference: you will only find broken one's when you are on the page. Some process to find broken stuff globally would be good to. could be processed by a cron job weekly/monthy for example

vmario89 avatar Feb 11 '25 00:02 vmario89

i messed a bit with SQL but had to success to render a clean list of URLs yet. But i only did 5 minutes

SELECT 
	id,
	RES.urls
	FROM (
	SELECT
		ALL REGEXP_SUBSTR(html, '"http(.?s*):\/\/.*"') AS urls,
		id
	FROM pages
	) AS RES
WHERE
	RES.urls is not Null AND
	RES.urls != ''
;

because maybe we could also do some kind of SQL-only approach

vmario89 avatar Feb 11 '25 00:02 vmario89

Related to this, I've written up a somewhat simplistic script for locating internal old/broken cross-item (book/shelf/chapter/page) references. Usage details in the comments at the top of the script:

https://codeberg.org/bookstack/devops/src/branch/main/tinker-scripts/find-broken-internal-references.php

ssddanbrown avatar Jul 15 '25 12:07 ssddanbrown