hts-specs icon indicating copy to clipboard operation
hts-specs copied to clipboard

Making it machine-translatable will make hts-specs available to more people on the planet.

Open kojix2 opened this issue 3 years ago • 13 comments

Hello.

 I would like to raise an issue from a slightly different perspective here. To be frank, I'm not very good at English. This text is written by DeepL, but on days when DeepL is off, Google Translate does it for me. Without machine translation, my life would not be possible.

 The same goes for reading papers. My intelligence is not capable of reading English papers quickly. I always look at the web page and then use Google Translate.

 And hts-spec .... Oops, hts-spec is not machine-translatable; the PDF has annoying line breaks and weird paragraphs that are not easily machine-translatable.

 This is why reading hts-spec is so difficult. Not only is the content difficult, but it is also difficult to use machine translation. Most people involved in bioinformatics are very smart, so this may not be a problem. Some people can even speak several languages easily. However, most people on the planet are not that smart. I am one of those not-so-smart people.

 I am convinced that providing the hts-spec in a form that can be read by machine translation will help more people. For example, it is an html web page with no line breaks. The hts-specs documentation seems to be generated from tex, but I don't know if it is easy to do so.

 I know this comment may be too candid and somewhat unpolite. However, it contains what is true for me. Thank you for reading.

Translated with www.DeepL.com/Translator (free version)

kojix2 avatar Aug 16 '21 06:08 kojix2

I know you closed this issue, but inclusivity is still an issue we value. There are tools like latex2html which may do a better job of making something machine translatable. There is also the TeX source, although you'll have to suffer a bit of markup and it may break translation. Or if you've found a better solution yourself, it may be good to note it here so others can find it and use the tips (or perhaps we can add it somewhere else).

You also didn't say which specifications are problematic. Is it all the TeX ones (ie PDF docs), or others?

jkbonfield avatar Aug 16 '21 07:08 jkbonfield

I'm going to reopen this for now, this definitely sounds like a topic which at least should be discussed

tskir avatar Aug 17 '21 09:08 tskir

Having html as the primary output may be problematic, at least initially, due to some cosmetic issues. However there is perhaps something to be said for having alternative formats available even if we just list them as a more accessable version with the master version explicitly being PDF.

I tried latex2html -split +0 -info "" -no_navigation on SAMv1 and it produced something, but left quite a lot of markup in there that looked poor. htlatex needed some hand-holding and ignoring of errors, but what it produced was then much superior. Potentially room for improvement to get it working better (albeit with missing bits due to ignoring the errors).

VCF faired better with htlatex. An example:

PDF: image

HTML: image

jkbonfield avatar Aug 17 '21 10:08 jkbonfield

Looking at the above again I see the Simga combinatorial is incorrectly formatted by htlatex. There may be options to get such things improved, even if it's just getting it to insert formulae as images, but it's obviously not something we can rely on without having to proof read at the moment.

I think this is probably going to be more of a slow back-burner project than something we embrace quickly, unless anyone has spare time to work on it.

jkbonfield avatar Sep 16 '21 10:09 jkbonfield

I wonder if Markdown may be a better choice for this material: at this point, Markdown is relatively ubiquitous and can be easily translated into a variety of mediums (including latex) with pandoc. You also have the benefit of Github's work making markdown accessible within their web platform where your users are. If you like the formatting of the current specs, you can probably pretty easily get it working with your own custom template (here's a link where they show you mostly how to do so).

claymcleod avatar Oct 24 '22 21:10 claymcleod

The maintainers of these documents are familiar with Markdown.

jmarshall avatar Oct 24 '22 23:10 jmarshall

The maintainers of these documents are familiar with Markdown.

Yeah, sorry, the point here was not say "here's this new technology, markdown" (😄). I actually have recently switched to using Markdown for more and more of the technical design documents that I used to use LaTeX for, and it's been a pleasant experience thus far. It was not so long ago that I wouldn't have considered Markdown an appropriate medium for the SAM specification, but perhaps now the ecosystem could support the full spectrum of required features (cross-compilation into custom LaTeX documents, figure generation, linters, etc).

claymcleod avatar Oct 25 '22 00:10 claymcleod

Some specs here are already in MarkDown - see https://github.com/samtools/hts-specs/blob/master/htsget.md and https://github.com/samtools/hts-specs/blob/master/refget.md.

In my opinion they're not so well formatted (due to limitations of md) as LaTeX, but that's not really the point of this topic as it was about accessability. Being able to target multiple output formats to provide easier to process versions for screen-readers and language translation engines is obviously helpful. That doesn't mean markdown should be the primary document though - it could just as easily be an output from e.g. docbook, asciidoc, or even just using pandoc for latex to md.

However mainly it's an issue of time to evaluate the alternatives and to validate the translations don't introduce glitches (as demonstrated by htlatex above). (FWIW the CRAM spec started life as a Word doc. I extracted the XML from docx and used xslt to transform that to latex - scary! It mostly worked, but still needed quite a bit of editing to fix issues. Eww!)

jkbonfield avatar Oct 25 '22 08:10 jkbonfield

Being able to target multiple output formats to provide easier to process versions for screen-readers and language translation engines is obviously helpful. That doesn't mean markdown should be the primary document though - it could just as easily be an output from e.g. docbook, asciidoc, or even just using pandoc for latex to md.

Agree. I like the idea of using a latex to XYZ converter, but I do not know of any that would work fully with the content of the spec.

Just to see what was possible, I spent about two hours tonight messing around with pandoc to see if I could get something reasonable out of it. As expected, one major limitation is the complex formatting sections: mainly getting the tables to look correct. Beyond just converting it to HTML, I also tried using pandoc to turn the latex document into Github flavored markdown specifically (pandoc -f latex -t gfm SAMv1.tex > SAMv1.md) and render it in Github—that didn't work great either.

Given this experience, I can think of two directions that would both (a) to keep all the features needed and (b) also improve the situation for accessibility:

  • Use pandoc -f latex -t html SAMv1.tex and have that automatically build and deploy alongside the PDF to Github pages (possibly link to it from the PDF too). It's not perfect, but this would successfully generate a large chunk of the content more accessible. Much of the problem with this approach is that the figures do not render correctly: this approach could be further augmented by converting the figures to something not built within latex though that in and of itself is a task.
  • An even bigger task but maybe the best overall solution, use HTML directly. This would likely add quite a bit more overhead in terms of the backing code base to generate these sites, I don't know how much of an appetite for that there is.

claymcleod avatar Oct 26 '22 03:10 claymcleod

I too had a play with pandoc for github markdown and it was tragic, even with the latest release. The tikz bit was particularly special! The best I had previously for html was htlatex, but that wasn't perfect on formulae either. There may be options to improve it though, such as generating images for the formulae rather than attempting mathml conversion.

Fundamentally it's just a matter of free time. No one who works on these specs does it as a full time job, and we're down on maintainers already without taking on projects that only progress the presentation rather than content. If we could find a conversion tool that was pretty much flawless then maybe it'd be something we could automate.

jkbonfield avatar Oct 26 '22 08:10 jkbonfield

I also discovered https://math.nist.gov/~BMiller/LaTeXML/ which sounds ideal, but it doesn't work straight out of the box on our files. I haven't had time to dig around and figure out what shenanigans we're doing that breaks it. Anyway, with appropriate command line options and management it may perhaps work given the pedigree and online demos.

jkbonfield avatar Oct 26 '22 08:10 jkbonfield