tools icon indicating copy to clipboard operation
tools copied to clipboard

Use time markup in colophon

Open robinwhittleton opened this issue 1 year ago • 8 comments

This defaults years to be within <time> tags, which will automatically work for everything published after 1000. It also adjusts prepare_release to add the ISO datetime to the colophon “released on” field.

I knocked this together in 20 mins, so if you think it’ll be too much work for not enough value to update the corpus, let’s close it, no harm done 🙂

robinwhittleton avatar Oct 20 '24 19:10 robinwhittleton

I had been thinking about proposing this a while ago. Nice!

apasel422 avatar Oct 21 '24 12:10 apasel422

OK, I think we can do this; but we will of course have to update the corpus. That might be tricky... we can assume any string of 4 digits is a year, but what about low or BC dates like in Epictetus or odd formulations (also Epictetus)?

Then, we will have to calculate the ISO timestamp of the actual release date, this will have to be scripted a one-off update.

I think updating the corpus will be the actual hard part here.

Lastly we need to update se_epub_build.py line 209 in the toolset to insert <time> during build.

acabal avatar Oct 21 '24 20:10 acabal

I’ll look into the se_epub_build.py update, thanks for the hint.

I had a scan through the corpus and found this non-standard dating. We might want to pick a standard for BC/BCE and AD/CE as well.

Date ranges with incomplete years

  • https://github.com/standardebooks/anthony-hope_the-prisoner-of-zenda/blob/master/src/epub/text/colophon.xhtml#L29
  • https://github.com/standardebooks/anthony-trollope_orley-farm/blob/master/src/epub/text/colophon.xhtml#L29
  • https://github.com/standardebooks/arthur-machen_the-three-impostors/blob/master/src/epub/text/colophon.xhtml#L29
  • https://github.com/standardebooks/ben-jonson_the-alchemist/blob/master/src/epub/text/colophon.xhtml#L29
  • https://github.com/standardebooks/carey-rockwell_stand-by-for-mars/blob/master/src/epub/text/colophon.xhtml#L29
  • https://github.com/standardebooks/charlotte-bronte_jane-eyre/blob/master/src/epub/text/colophon.xhtml#L29
  • https://github.com/standardebooks/e-m-forster_where-angels-fear-to-tread/blob/master/src/epub/text/colophon.xhtml#L29
  • https://github.com/standardebooks/edgar-wallace_room-13/blob/master/src/epub/text/colophon.xhtml#L29
  • https://github.com/standardebooks/ford-madox-ford_the-fifth-queen/blob/master/src/epub/text/colophon.xhtml#L29
  • https://github.com/standardebooks/henry-adams_democracy/blob/master/src/epub/text/colophon.xhtml#L29
  • https://github.com/standardebooks/henry-fielding_the-history-of-tom-jones-a-foundling/blob/master/src/epub/text/colophon.xhtml#L29
  • https://github.com/standardebooks/jules-verne_in-search-of-the-castaways_j-b-lippincott-co/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/leo-tolstoy_war-and-peace_louise-maude_aylmer-maude/blob/master/src/epub/text/colophon.xhtml#L29
  • https://github.com/standardebooks/luigi-pirandello_six-characters-in-search-of-an-author_edward-storer/blob/master/src/epub/text/colophon.xhtml#L31
  • https://github.com/standardebooks/thomas-a-kempis_the-imitation-of-christ_william-benham/blob/master/src/epub/text/colophon.xhtml#L29
  • https://github.com/standardebooks/thornton-w-burgess_green-forest-stories/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/thornton-w-burgess_green-meadow-stories/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/william-dean-howells_a-hazard-of-new-fortunes/blob/master/src/epub/text/colophon.xhtml#L29
  • https://github.com/standardebooks/william-dean-howells_the-rise-of-silas-lapham/blob/master/src/epub/text/colophon.xhtml#L29

Published 0 -> 999 so needs a 4-digit datetime padded with zeros

  • https://github.com/standardebooks/anonymous_beowulf_john-lesslie-hall/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/boethius_the-consolation-of-philosophy_h-r-james/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/epictetus_discourses_george-long/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/marcus-aurelius_meditations_george-long/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/suetonius_the-lives-of-the-caesars_j-c-rolfe/blob/master/src/epub/text/colophon.xhtml#L15

BCE dates that can’t be represented with datetime

  • https://github.com/standardebooks/aeschylus_agamemnon_gilbert-murray/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/aeschylus_the-eumenides_gilbert-murray/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/aeschylus_the-libation-bearers_gilbert-murray/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/apollonius-of-rhodes_the-argonautica_arthur-s-way/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/aristotle_nicomachean-ethics_f-h-peters/blob/master/src/epub/text/colophon.xhtml#L17
  • https://github.com/standardebooks/cicero_tusculan-disputations_c-d-yonge/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/herodotus_histories_g-c-macaulay/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/homer_the-iliad_william-cullen-bryant/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/homer_the-odyssey_william-cullen-bryant/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/julius-caesar_commentaries-on-the-gallic-war_w-a-mcdevitte_w-s-bohn/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/laozi_tao-te-ching_james-legge/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/pindar_victory-odes_arthur-s-way/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/pindar_victory-odes_arthur-s-way/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/plato_dialogues_benjamin-jowett/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/sun-tzu_the-art-of-war_lionel-giles/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/virgil_the-aeneid_john-dryden/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/virgil_the-eclogues_john-dryden/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/virgil_the-georgics_john-dryden/blob/master/src/epub/text/colophon.xhtml#L15

Vague dates that can’t be represented with datetime

  • https://github.com/standardebooks/abu-al-ala-al-maarri_the-luzumiyat_ameen-rihani/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/apuleius_the-golden-ass_william-adlington/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/diogenes-laertius_the-lives-and-opinions-of-eminent-philosophers_c-d-yonge/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/edgar-saltus_mr-incouls-misadventure/blob/master/src/epub/text/colophon.xhtml#L29
  • https://github.com/standardebooks/epictetus_short-works_george-long/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/h-g-wells_the-war-of-the-worlds/blob/master/src/epub/text/colophon.xhtml#L29
  • https://github.com/standardebooks/khalil-gibran_the-prophet/blob/master/src/epub/text/colophon.xhtml#L29
  • https://github.com/standardebooks/mark-rutherford_the-revolution-in-tanners-lane/blob/master/src/epub/text/colophon.xhtml#L29
  • https://github.com/standardebooks/procopius_the-secret-history_richard-atwater/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/thiruvalluvar_the-kural_v-v-s-aiyar/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/thomas-a-kempis_the-imitation-of-christ_william-benham/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/thomas-a-kempis_the-imitation-of-christ_william-benham/blob/master/src/epub/text/colophon.xhtml#L15
  • https://github.com/standardebooks/william-shakespeare_henry-vi-part-i/blob/master/src/epub/text/colophon.xhtml#L29
  • https://github.com/standardebooks/william-shakespeare_henry-vi-part-ii/blob/master/src/epub/text/colophon.xhtml#L29

robinwhittleton avatar Oct 23 '24 18:10 robinwhittleton

Let’s reiterate this once more: I really don’t mind if you decide that the value isn’t there and we don’t want to do this.

robinwhittleton avatar Oct 23 '24 18:10 robinwhittleton

I think it's consistent with the project goals so it's worthwhile. Updating the corpus will be work.

I also wonder if we can add a lint check for this that somehow ignores odd constructions or year ranges like in short fiction collections.

Note that I think it should be possible to create an ISO timestamp of a BCE date; the years would just be negative. See https://www.tondering.dk/claus/cal/iso8601.php However whether or not that will be validate is an open question, you can try running it through an HTML validator to see.

acabal avatar Oct 23 '24 18:10 acabal

So, I’d tried negative numbers before, but got errors in the validator and dismissed it as non-spec-compliant. But I‘ve just realised that I didn’t know that years needed to be padded to four digits with zeros, and trying -0001 just throws a warning about time zones now, not an error.

I had considered lint checks, but I figured that, to start with at least, if we have the <time> elements in the colophon template they’re likely to stay in place.

I will probably need to update the examples in the manual’s “Colophon” section. And it might be worth adding something about how to complete the datetime attribute.

robinwhittleton avatar Oct 24 '24 06:10 robinwhittleton

Where are we at with this proposal?

acabal avatar May 09 '25 19:05 acabal

I had it in my head that this was now on your plate, had forgotten the last couple of posts 😅 Let me poke at it a little more.

robinwhittleton avatar May 10 '25 07:05 robinwhittleton

I’ve pushed up a basic lint check and regenerated the golden masters.

Unfortunately it looks like I was wrong about negative dates. The timezone warning in validator is due to it misparsing the datetime, and MDN says:

For the purposes of HTML dates, years are always at least four digits long; years prior to the year 1000 are padded with leading zeroes (0), so the year 72 is written as 0072. Years prior to the year 1 C.E. are not supported, so HTML doesn't support years 1 B.C.E. (1 B.C.) or earlier.

I’ve documented this approach for the manual in https://github.com/standardebooks/manual/pull/205.

robinwhittleton avatar Jun 29 '25 19:06 robinwhittleton

Untested pseudo-xpath:

/html/body//p/text()[re:test(., '\b[0-9]{3,4}\s$') and following-sibling::*[0][name() = 'abbr' and re:test(., '^BCE?$')]]

This should match paragraphs containing BCE years, so invert that or something for the lint check so that it doesn't trigger when BCE years are present.

acabal avatar Jun 29 '25 22:06 acabal

Also, we should use xpath for the lint check and not a regex regardless, because currently the regex will emit a lint error if there is for example <abbr epub:type="z3998:given-name"> in the colophon, which is a very common case.

acabal avatar Jun 30 '25 03:06 acabal

I ended up just including a negative lookahead into the current regex to make sure a BC / BCE date doesn’t follow the year. I was having trouble with XPath but I’m more than happy to revisit if you think this is fragile. Having said that, we’re not going to run into trouble with z3998 namespaced attributes as the regex checks for a space character followed by a digit sequence followed by another space character. All the bare years in the colophon template follow that pattern, and running a check against the corpus shows that only a handful of titles don’t match that, usually because they have a year range using an en-dash for the painting (which isn’t caught by our published (in|between) lint check.

robinwhittleton avatar Jun 30 '25 21:06 robinwhittleton

Can you try getting an xpath working? That will let us output a specific line that is the problem. Also, there is a draft PR that will add line numbers to lint output and that only works with xpath nodes right now

acabal avatar Jun 30 '25 22:06 acabal

XPath working: I ended up copying another test with following-sibling and amending it to fit.

robinwhittleton avatar Jul 03 '25 21:07 robinwhittleton

Great, thanks!

Now the big question is, how do we update the corpus? Can you take care of that?

acabal avatar Jul 04 '25 22:07 acabal

Thinking about this further, we have another big update in the next version of the toolset where we remove url: from the SE identifier. Instead of rebuilding the corpus twice we should do it just once. With that in mind can you put together a Bash script that would update the corpus to match this PR, and send it to me so I can apply it on my end?

I bet 90% of it could be handled with something like sed --regexp-extended --in-place 's|(\s[\d]{4}\b)|<time>\1</time>|g' /path/to/colophon/files, then a regex for year ranges, and maybe small set of hand-done exceptions.

acabal avatar Jul 05 '25 16:07 acabal

No problem, was just updating fully updating my corpus before starting, but let me see what I can put together. (Also will have to remember how to write bash scripts after a decade and a half of using fish 😁)

robinwhittleton avatar Jul 05 '25 16:07 robinwhittleton

I’m mildly hamstrung by trying to get this working on a Linux VM and not being used to GNU versions of the tooling, but I think this will do the trick. Could you test?

#!/bin/bash

# To be run directly from a corpus directory

# Deal with standard four digit dates
sed --regexp-extended --in-place 's|(\b[0-9]{4})(\s)|<time>\1</time>\2|g' **/src/epub/text/colophon.xhtml

# Deal with three digit dates
sed --regexp-extended --in-place 's|(\b[0-9]{3})(\s)|<time datetime="0\1">\1</time>\2|g' **/src/epub/text/colophon.xhtml

# Undo those changes to three digit BC(E) dates - not supported in HTML
sed --regexp-extended --in-place 's|<time datetime="0[0-9]{3}">([0-9]{3})</time>(\s.*>BCE?<)|\1\2|g' **/src/epub/text/colophon.xhtml

# Finally, amend the first published dates
for d in */ ; do
	DATE_STRING=$(grep -E '<b>([A-Za-z])+ [0-9]+, [0-9]+, [0-9]+:[0-9]+ <abbr class="eoc">[ap]\.m\.<\/abbr><\/b>' $d/src/epub/text/colophon.xhtml |
		sed 's/<b>//' |
		sed 's/,//g' |
		sed 's/ <abbr class="eoc">/ /' |
		sed 's/<\/abbr><\/b><br\/>//'
	)
	ISO_DATE_STRING=$(date -Iminutes --date "$DATE_STRING")
	sed --regexp-extended --in-place "s|(([A-Za-z])+ [0-9]+, [0-9]+, [0-9]+:[0-9]+ <abbr class=\"eoc\">[ap]\.m\.<\/abbr>)|<time datetime=\"$ISO_DATE_STRING\">\1</time>|g" $d/src/epub/text/colophon.xhtml
done

There are some oddities remaining.

  • A bunch of ”completed” date ranges for paintings with en dashes. I’ve update the lint that checks for the same thing with “published” dates (https://github.com/standardebooks/tools/pull/846) and will get them fixed.
  • seneca_dialogues_aubrey-stewart has two-digit AD dates that’ll need manual fixup
  • The painting titles in owen-johnson_stover-at-yale / gustave-le-bon_the-crowd_t-fisher-unwin-ltd / charles-dickens_martin-chuzzlewit / cicely-hamilton_theodore-savage have dates that’ll need manual fixup
  • edward-whymper_scrambles-amongst-the-alps-in-the-years-1860-69 will need title fixup

I expect there’s more stuff lurking too, but this is as much as I’ve been able to find.

robinwhittleton avatar Jul 07 '25 20:07 robinwhittleton

OK prefect, thanks. I think I've updated everything on my local copy. Don't push any more changes just now. I'm waiting on a merge conflict resolution for a different PR then we can release a new version of the tools, manual, and website.

acabal avatar Jul 08 '25 19:07 acabal