tools icon indicating copy to clipboard operation
tools copied to clipboard

Lint: Check long description for potentially unlinked collection names (SE Manual conformance)

Open jwfxpr opened this issue 3 years ago • 6 comments

As per the SE Manual 9.6.2.5,

If the long description references... story collections that already have pages on Standard Ebooks then the first occurrence of these are linked as well.

Currently, se_epub_lint.py m-056 checks the author's name for linking:

https://github.com/standardebooks/tools/blob/6b73e1934b80a70986a7c7ac262dc93135861441/se/se_epub_lint.py#L676-L685

I propose a similar message for unlinked collection names, using the id="collection-n" attributes in content.opf for reference. Since many of the collection names reference a publication list from an authority (e.g. "Modern Library", "The Guardian", etc) this would simply attempt to match the first word which is not "The" or "A" and check for enclosing tags, similar to the current check for author's last name. This would be an se.MESSAGE_TYPE_WARNING, as this method is much more prone to mismatching.

I'm happy to have a go at this, using the next available 'm-' message code, if you think it's a useful enhancement.

jwfxpr avatar Feb 11 '21 11:02 jwfxpr

I think this might not be so straightforward, because you would have to craft a general-case regex to match things like <a href="">The Guardian</a> and the <a href="">Guardian</a> and <a href="">The Guardian's</a> and the <a href="">Guardian</a>'s and so on. This is not so hard with xpath, but we don't have xpath on the long description, we have to use regex; and that would be difficult or maybe impossible to do in a general case in regex without raising too many false positives.

acabal avatar Feb 11 '21 15:02 acabal

If you want to try it, then give it a shot; but maybe a better approach would be somehow parsing the long description into an EasyXml object so that we can get xpath for both this problem and for m-056.

acabal avatar Feb 11 '21 15:02 acabal

If you want to try it, then give it a shot; but maybe a better approach would be somehow parsing the long description into an EasyXml object so that we can get xpath for both this problem and for m-056.

That seems like a realy sensible approach, and will make it much easier to work with the long description for future enhancements. A bit of shallow googling indicates that the standard library module xml.sax.saxutils includes an unescape() function which might do the trick. I'll have a bit of a play with this in my spare time and see if I can come up with an elegant solution. I'm rusty with both python and xml, so it's a good excuse to practice.

jwfxpr avatar Feb 11 '21 16:02 jwfxpr

Only use the EasyXml class and lxml for this, we don't want to add more giant dependencies like Sax.

acabal avatar Feb 11 '21 16:02 acabal

Gotcha. Shall do.

jwfxpr avatar Feb 12 '21 06:02 jwfxpr

Hi there, is there progress on this or should we close it as inactive?

acabal avatar Nov 09 '21 19:11 acabal

Closed for inactivity.

acabal avatar Jun 15 '23 15:06 acabal