pretext Research backporting pagination into merge file

It is customary with a Braille book to maintain two page numberings. One is "the Braille page number", which corresponds to the nth face of a piece of paper just like page numbering in a regular book. But in the non-Braille print copy, one page may correspond to a (possibly improper) fraction of a Braille page. So there are "horizontal rules" in Braille on a Braille page that correspond to where the non-Braille edition has a page break. And alongside these rules there is a number to indicate the page number from the non-Braille edition. A non-Braille student can say "I'm reading page 67" and the Braille-readnig studnet can go to the same place in their copy, and vice versa.

So in the future, we could make .tex -> .pdf. And if there were a way to backport where the pagebreaks were into XML, we could incorporate that into the merge process that is used for WW. And then we could generate Braille with pagebreaks in the right places. As an optional switch, we could even mark such places on HTML (not my taste, but I could imagine uses).

So...this issue is to investigate ways to backport where the page breaks happen from tex-pdf to xml.

Mar 09 '19 03:03 Alex-Jordan

It is slightly tricky that page breaks can occur in the middle of a PDF paragraph, which may be in the middle of a line in the braille or HTML versions. (I think it could be useful to have the option of showing page information in the HTML.)

If you compile the PDF with the flag --synctex=1 then a new file with extension .synctex will be created (possibly gzipped).

That file has the information about the mapping from the LaTeX to the PDF file. Apparently you are not supposed to look at that file, but instead use command line utilities:

http://manpages.ubuntu.com/manpages/cosmic/en/man5/synctex.5.html

So, it should be possible to find the line in the source file that becomes the end of a PDF page.

To make the braille, you can create a temporary source file with appropriate markup after that line.

It helps if the source is formatted like I think it should (meaning, multiple lines per paragraph instead of one long line) because then the break will be more accurate.

For the HTML, from the line number in the LaTeX file you can find the permid of the object containing that line. And (if it is a paragraph) you can get a good idea of what fraction of the paragraph is on the previous page.

Then you can supply a bunch of triples to the HTML:

(id, page number, percent)

which is sufficient for Javascript to mark the page breaks.

On Fri, 8 Mar 2019, Alex Jordan wrote:

It is customary with a Braille book to maintain two page numberings. One is "the Braille page number", which corresponds to the nth face of a piece of paper just like page numbering in a regular book. But in the non-Braille print copy, one page may correspond to a (possibly improper) fraction of a Braille page. So there are "horizontal rules" in Braille on a Braille page that correspond to where the non-Braille edition has a page break. And alongside these rules there is a number to indicate the page number from the non-Braille edition. A non-Braille student can say "I'm reading page 67" and the Braille-readnig studnet can go to the same place in their copy, and vice versa.

So in the future, we could make .tex -> .pdf. And if there were a way to backport where the pagebreaks were into XML, we could incorporate that into the merge process that is used for WW. And then we could generate Braille with pagebreaks in the right places. As an optional switch, we could even mark such places on HTML (not my taste, but I could imagine uses).

So...this issue is to investigate ways to backport where the page breaks happen from tex-pdf to xml.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.[AAM6LCP3EJywdv806h3TZGQImpZ3a5RDks5vUybKgaJpZM4bmcqW.gif]

Mar 09 '19 12:03 davidfarmer

I think this is larger than just Braille. I've heard people complain about the online version lacking page numbers. As in, the professor says "Now, on page 432, we have..." Of course, this assumes a canonical numbering someplace. That would seem to be a publisher decision. A printed edition makes good sense as the canonical numbering - though a Braille rendition is an expensive proposition I think, so you might want it to have a longer life? (About $0.05 per page for paper, about a 3x inflation factor.)

EPUB is reflowable, so the same problem exists there. I forget what is the typical way to indicate "real" page numbers on a Kindle book?

synctex gets used heavily in CoCalc, so William would have experience with that (and I think that code should be open). Braille pipeline begins with PTX HTML, so whatever mechanism is used could likely be used for both online HTML and Braille.

Mar 11 '19 22:03 rbeezer

There is something of a production circle this is heading toward.

You can't put page numbering data in HTML without first having built the canonical PDF.

You can't make the canonical PDF without first having taken preview images of interactive things. And presently, that requires you to do a full build of HTML.

OK, so it's fine to say "you build HTML, get preview images, build PDF, extract pagination data, then build HTML again." But it makes me wonder if there should be a way to get those standalone interactive HTML pages without having to build the full HTML book. Maybe that has always been the plan and interactives are still simmering.

Mar 11 '19 22:03 Alex-Jordan

Right.

At the end,

without having to build the full HTML book

What is the cost? A few minutes waiting around?

Mar 11 '19 22:03 rbeezer

What is the cost? A few minutes waiting around?

That's the cost if you understand everything, all of the time. The real cost (imho) is from getting confused. Forgetting that you did not re-run the merge after that last edit or something like that. And being unaware that you have posted PDF and HTML that are not in fact synchronized because of some oversight.

Mar 11 '19 23:03 Alex-Jordan

We need a script/button that does it all, so nobody (not even diligent experts) will make a mistake.

On Mon, 11 Mar 2019, Alex Jordan wrote:

  What is the cost? A few minutes waiting around?
That's the cost if you understand everything, all of the time. The real cost (imho) is from getting confused. Forgetting that you did not re-run the merge after that last edit or something like that. And being unaware that you have posted PDF and HTML that are not in fact synchronized because of some oversight.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.[AAM6LIRnwJExcfQ96BQpLBUVa8Ij-MOJks5vVuKEgaJpZM4bmcqW.gif]

Mar 11 '19 23:03 davidfarmer

We need a script/button that does it all

Right!

Mar 11 '19 23:03 rbeezer

@dbrianwalton has mentioned bugging the log file of a LaTeX run with page numbers. Then the challenge becomes inserting those canonical page numbers back into XML source in the right place.

Idea: place the most granular xml:id into the output with the page numbers. Now they can reliably go somewhat close to the right location in the source.

Jul 18 '21 17:07 rbeezer

Braille needs canonical page numbers. But EPUB would benefit from this also. A step in the right direction is #2034 which is in-progress at this writing.

Aug 03 '23 17:08 rbeezer