BookStack
BookStack copied to clipboard
Locate broken links in content
Describe the feature you'd like
Functionality to scan links in content and detect which links are no longer valid (lead to >=400 status). Primarily for anchors, potentially for image/media references also.
Describe the benefits this would bring to existing BookStack users
This will allow broken links in content to be detected and dealt with (updated/removed) so that readers won't be dealing with dead links. From the editor point of view, system-level scanning of this helps locating of such links without spending an exhaustive amount of time manually searching.
Can the goal of this request already be achieved via other means?
Yes:
- Via manual search (time consuming)
- Via API scripting (requires custom code, may not be able to handle internal links).
Have you searched for an existing open/closed issue?
- [X] I have searched for existing issues and none cover my fundamental request
How long have you been using BookStack?
3 months to 1 year
Additional context
Note: This was opened on behalf of a user through the BookStack support services. (Ticket 116)
Dev Notes
- Server side request security consideration limits things. Review against existing options. Likely need to have strong permission requirements (or keep at sysadmin/command level).
- Would likely expect handling of internal links, which may be complicated in permission-controlled scenarios.
Piggy-backing on this issue, if I may – with a related feature request: Other wiki software (MediaWiki, DokuWiki) displays links to non-existant pages in red. Would this be possible to implement in BookStack?
(I'm not sure if this might already possible, given BookStack's internal data structures and permission system, but I'd guess "not easily" – so any changes required to support this broken-link location feature could ideally be done with this feature in mind, too.)
we could build up some python script to
- access the database, read the tables "pages" and get the column "html" from it
- parse the columns content for each page id to find all existing URLs (https://stackoverflow.com/questions/44644501/extract-all-urls-in-a-string-with-python3)
- curl each URL for desired return values (usually there are a lot of specific return codes for each need) -> https://curl.se/libcurl/c/libcurl-errors.html
- save the results in a new database table or in a file like json
- parse the output to do sth. with it, e.g.
- send a mail to the administrator
- render a special page within BookStack, which contains a table, where each table row has an entry for the page id and a list of all broken URLs
to render broken URLs directly in BookStack is cool too and would be another aproach. This difference: you will only find broken one's when you are on the page. Some process to find broken stuff globally would be good to. could be processed by a cron job weekly/monthy for example
i messed a bit with SQL but had to success to render a clean list of URLs yet. But i only did 5 minutes
SELECT
id,
RES.urls
FROM (
SELECT
ALL REGEXP_SUBSTR(html, '"http(.?s*):\/\/.*"') AS urls,
id
FROM pages
) AS RES
WHERE
RES.urls is not Null AND
RES.urls != ''
;
because maybe we could also do some kind of SQL-only approach
Related to this, I've written up a somewhat simplistic script for locating internal old/broken cross-item (book/shelf/chapter/page) references. Usage details in the comments at the top of the script:
https://codeberg.org/bookstack/devops/src/branch/main/tinker-scripts/find-broken-internal-references.php