Use time markup in colophon
This defaults years to be within <time> tags, which will automatically work for everything published after 1000. It also adjusts prepare_release to add the ISO datetime to the colophon “released on” field.
I knocked this together in 20 mins, so if you think it’ll be too much work for not enough value to update the corpus, let’s close it, no harm done 🙂
I had been thinking about proposing this a while ago. Nice!
OK, I think we can do this; but we will of course have to update the corpus. That might be tricky... we can assume any string of 4 digits is a year, but what about low or BC dates like in Epictetus or odd formulations (also Epictetus)?
Then, we will have to calculate the ISO timestamp of the actual release date, this will have to be scripted a one-off update.
I think updating the corpus will be the actual hard part here.
Lastly we need to update se_epub_build.py line 209 in the toolset to insert <time> during build.
I’ll look into the se_epub_build.py update, thanks for the hint.
I had a scan through the corpus and found this non-standard dating. We might want to pick a standard for BC/BCE and AD/CE as well.
Date ranges with incomplete years
- https://github.com/standardebooks/anthony-hope_the-prisoner-of-zenda/blob/master/src/epub/text/colophon.xhtml#L29
- https://github.com/standardebooks/anthony-trollope_orley-farm/blob/master/src/epub/text/colophon.xhtml#L29
- https://github.com/standardebooks/arthur-machen_the-three-impostors/blob/master/src/epub/text/colophon.xhtml#L29
- https://github.com/standardebooks/ben-jonson_the-alchemist/blob/master/src/epub/text/colophon.xhtml#L29
- https://github.com/standardebooks/carey-rockwell_stand-by-for-mars/blob/master/src/epub/text/colophon.xhtml#L29
- https://github.com/standardebooks/charlotte-bronte_jane-eyre/blob/master/src/epub/text/colophon.xhtml#L29
- https://github.com/standardebooks/e-m-forster_where-angels-fear-to-tread/blob/master/src/epub/text/colophon.xhtml#L29
- https://github.com/standardebooks/edgar-wallace_room-13/blob/master/src/epub/text/colophon.xhtml#L29
- https://github.com/standardebooks/ford-madox-ford_the-fifth-queen/blob/master/src/epub/text/colophon.xhtml#L29
- https://github.com/standardebooks/henry-adams_democracy/blob/master/src/epub/text/colophon.xhtml#L29
- https://github.com/standardebooks/henry-fielding_the-history-of-tom-jones-a-foundling/blob/master/src/epub/text/colophon.xhtml#L29
- https://github.com/standardebooks/jules-verne_in-search-of-the-castaways_j-b-lippincott-co/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/leo-tolstoy_war-and-peace_louise-maude_aylmer-maude/blob/master/src/epub/text/colophon.xhtml#L29
- https://github.com/standardebooks/luigi-pirandello_six-characters-in-search-of-an-author_edward-storer/blob/master/src/epub/text/colophon.xhtml#L31
- https://github.com/standardebooks/thomas-a-kempis_the-imitation-of-christ_william-benham/blob/master/src/epub/text/colophon.xhtml#L29
- https://github.com/standardebooks/thornton-w-burgess_green-forest-stories/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/thornton-w-burgess_green-meadow-stories/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/william-dean-howells_a-hazard-of-new-fortunes/blob/master/src/epub/text/colophon.xhtml#L29
- https://github.com/standardebooks/william-dean-howells_the-rise-of-silas-lapham/blob/master/src/epub/text/colophon.xhtml#L29
Published 0 -> 999 so needs a 4-digit datetime padded with zeros
- https://github.com/standardebooks/anonymous_beowulf_john-lesslie-hall/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/boethius_the-consolation-of-philosophy_h-r-james/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/epictetus_discourses_george-long/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/marcus-aurelius_meditations_george-long/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/suetonius_the-lives-of-the-caesars_j-c-rolfe/blob/master/src/epub/text/colophon.xhtml#L15
BCE dates that can’t be represented with datetime
- https://github.com/standardebooks/aeschylus_agamemnon_gilbert-murray/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/aeschylus_the-eumenides_gilbert-murray/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/aeschylus_the-libation-bearers_gilbert-murray/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/apollonius-of-rhodes_the-argonautica_arthur-s-way/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/aristotle_nicomachean-ethics_f-h-peters/blob/master/src/epub/text/colophon.xhtml#L17
- https://github.com/standardebooks/cicero_tusculan-disputations_c-d-yonge/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/herodotus_histories_g-c-macaulay/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/homer_the-iliad_william-cullen-bryant/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/homer_the-odyssey_william-cullen-bryant/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/julius-caesar_commentaries-on-the-gallic-war_w-a-mcdevitte_w-s-bohn/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/laozi_tao-te-ching_james-legge/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/pindar_victory-odes_arthur-s-way/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/pindar_victory-odes_arthur-s-way/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/plato_dialogues_benjamin-jowett/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/sun-tzu_the-art-of-war_lionel-giles/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/virgil_the-aeneid_john-dryden/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/virgil_the-eclogues_john-dryden/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/virgil_the-georgics_john-dryden/blob/master/src/epub/text/colophon.xhtml#L15
Vague dates that can’t be represented with datetime
- https://github.com/standardebooks/abu-al-ala-al-maarri_the-luzumiyat_ameen-rihani/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/apuleius_the-golden-ass_william-adlington/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/diogenes-laertius_the-lives-and-opinions-of-eminent-philosophers_c-d-yonge/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/edgar-saltus_mr-incouls-misadventure/blob/master/src/epub/text/colophon.xhtml#L29
- https://github.com/standardebooks/epictetus_short-works_george-long/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/h-g-wells_the-war-of-the-worlds/blob/master/src/epub/text/colophon.xhtml#L29
- https://github.com/standardebooks/khalil-gibran_the-prophet/blob/master/src/epub/text/colophon.xhtml#L29
- https://github.com/standardebooks/mark-rutherford_the-revolution-in-tanners-lane/blob/master/src/epub/text/colophon.xhtml#L29
- https://github.com/standardebooks/procopius_the-secret-history_richard-atwater/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/thiruvalluvar_the-kural_v-v-s-aiyar/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/thomas-a-kempis_the-imitation-of-christ_william-benham/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/thomas-a-kempis_the-imitation-of-christ_william-benham/blob/master/src/epub/text/colophon.xhtml#L15
- https://github.com/standardebooks/william-shakespeare_henry-vi-part-i/blob/master/src/epub/text/colophon.xhtml#L29
- https://github.com/standardebooks/william-shakespeare_henry-vi-part-ii/blob/master/src/epub/text/colophon.xhtml#L29
Let’s reiterate this once more: I really don’t mind if you decide that the value isn’t there and we don’t want to do this.
I think it's consistent with the project goals so it's worthwhile. Updating the corpus will be work.
I also wonder if we can add a lint check for this that somehow ignores odd constructions or year ranges like in short fiction collections.
Note that I think it should be possible to create an ISO timestamp of a BCE date; the years would just be negative. See https://www.tondering.dk/claus/cal/iso8601.php However whether or not that will be validate is an open question, you can try running it through an HTML validator to see.
So, I’d tried negative numbers before, but got errors in the validator and dismissed it as non-spec-compliant. But I‘ve just realised that I didn’t know that years needed to be padded to four digits with zeros, and trying -0001 just throws a warning about time zones now, not an error.
I had considered lint checks, but I figured that, to start with at least, if we have the <time> elements in the colophon template they’re likely to stay in place.
I will probably need to update the examples in the manual’s “Colophon” section. And it might be worth adding something about how to complete the datetime attribute.
Where are we at with this proposal?
I had it in my head that this was now on your plate, had forgotten the last couple of posts 😅 Let me poke at it a little more.
I’ve pushed up a basic lint check and regenerated the golden masters.
Unfortunately it looks like I was wrong about negative dates. The timezone warning in validator is due to it misparsing the datetime, and MDN says:
For the purposes of HTML dates, years are always at least four digits long; years prior to the year 1000 are padded with leading zeroes (0), so the year 72 is written as 0072. Years prior to the year 1 C.E. are not supported, so HTML doesn't support years 1 B.C.E. (1 B.C.) or earlier.
I’ve documented this approach for the manual in https://github.com/standardebooks/manual/pull/205.
Untested pseudo-xpath:
/html/body//p/text()[re:test(., '\b[0-9]{3,4}\s$') and following-sibling::*[0][name() = 'abbr' and re:test(., '^BCE?$')]]
This should match paragraphs containing BCE years, so invert that or something for the lint check so that it doesn't trigger when BCE years are present.
Also, we should use xpath for the lint check and not a regex regardless, because currently the regex will emit a lint error if there is for example <abbr epub:type="z3998:given-name"> in the colophon, which is a very common case.
I ended up just including a negative lookahead into the current regex to make sure a BC / BCE date doesn’t follow the year. I was having trouble with XPath but I’m more than happy to revisit if you think this is fragile. Having said that, we’re not going to run into trouble with z3998 namespaced attributes as the regex checks for a space character followed by a digit sequence followed by another space character. All the bare years in the colophon template follow that pattern, and running a check against the corpus shows that only a handful of titles don’t match that, usually because they have a year range using an en-dash for the painting (which isn’t caught by our published (in|between) lint check.
Can you try getting an xpath working? That will let us output a specific line that is the problem. Also, there is a draft PR that will add line numbers to lint output and that only works with xpath nodes right now
XPath working: I ended up copying another test with following-sibling and amending it to fit.
Great, thanks!
Now the big question is, how do we update the corpus? Can you take care of that?
Thinking about this further, we have another big update in the next version of the toolset where we remove url: from the SE identifier. Instead of rebuilding the corpus twice we should do it just once. With that in mind can you put together a Bash script that would update the corpus to match this PR, and send it to me so I can apply it on my end?
I bet 90% of it could be handled with something like sed --regexp-extended --in-place 's|(\s[\d]{4}\b)|<time>\1</time>|g' /path/to/colophon/files, then a regex for year ranges, and maybe small set of hand-done exceptions.
No problem, was just updating fully updating my corpus before starting, but let me see what I can put together. (Also will have to remember how to write bash scripts after a decade and a half of using fish 😁)
I’m mildly hamstrung by trying to get this working on a Linux VM and not being used to GNU versions of the tooling, but I think this will do the trick. Could you test?
#!/bin/bash
# To be run directly from a corpus directory
# Deal with standard four digit dates
sed --regexp-extended --in-place 's|(\b[0-9]{4})(\s)|<time>\1</time>\2|g' **/src/epub/text/colophon.xhtml
# Deal with three digit dates
sed --regexp-extended --in-place 's|(\b[0-9]{3})(\s)|<time datetime="0\1">\1</time>\2|g' **/src/epub/text/colophon.xhtml
# Undo those changes to three digit BC(E) dates - not supported in HTML
sed --regexp-extended --in-place 's|<time datetime="0[0-9]{3}">([0-9]{3})</time>(\s.*>BCE?<)|\1\2|g' **/src/epub/text/colophon.xhtml
# Finally, amend the first published dates
for d in */ ; do
DATE_STRING=$(grep -E '<b>([A-Za-z])+ [0-9]+, [0-9]+, [0-9]+:[0-9]+ <abbr class="eoc">[ap]\.m\.<\/abbr><\/b>' $d/src/epub/text/colophon.xhtml |
sed 's/<b>//' |
sed 's/,//g' |
sed 's/ <abbr class="eoc">/ /' |
sed 's/<\/abbr><\/b><br\/>//'
)
ISO_DATE_STRING=$(date -Iminutes --date "$DATE_STRING")
sed --regexp-extended --in-place "s|(([A-Za-z])+ [0-9]+, [0-9]+, [0-9]+:[0-9]+ <abbr class=\"eoc\">[ap]\.m\.<\/abbr>)|<time datetime=\"$ISO_DATE_STRING\">\1</time>|g" $d/src/epub/text/colophon.xhtml
done
There are some oddities remaining.
- A bunch of ”completed” date ranges for paintings with en dashes. I’ve update the lint that checks for the same thing with “published” dates (https://github.com/standardebooks/tools/pull/846) and will get them fixed.
seneca_dialogues_aubrey-stewarthas two-digit AD dates that’ll need manual fixup- The painting titles in
owen-johnson_stover-at-yale/gustave-le-bon_the-crowd_t-fisher-unwin-ltd/charles-dickens_martin-chuzzlewit/cicely-hamilton_theodore-savagehave dates that’ll need manual fixup edward-whymper_scrambles-amongst-the-alps-in-the-years-1860-69will need title fixup
I expect there’s more stuff lurking too, but this is as much as I’ve been able to find.
OK prefect, thanks. I think I've updated everything on my local copy. Don't push any more changes just now. I'm waiting on a merge conflict resolution for a different PR then we can release a new version of the tools, manual, and website.