epub-press
epub-press copied to clipboard
RTL languages: preserve directionality
Current Behavior
Hebrew/Arabic/Farsi (RTL languages) articles appear aligned to the left in the ePub.
Expected Behavior
Get the directonality from the DOM of each article and align it correctly. support a different directionality per article, or maybe even per paragraph/div
Steps to reproduce
Pick an article in an RTL language and make an epub out of it (I'll assume the same result happens with mobi, didn't test).
Example URL: http://www.bbc.com/persian/world-45713700
System Information
Using the Firefox add-on at my end.
Hey there, What an excellent add-on! I found it yesterday on the Mozilla site.
I see that content I create from an article in RTL comes out aligned to the left rather than the right, which I guess means that the directionality of the text is not extracted and preserved from the DOM when creating the epub. I didn't dive into the code yet, but I bet it should be an easy one to fix if the code already looks for other local and global settings like it for indentations and the like.
In case I can't find the place and submit a PR (JS is not my forte, I'm an infrastructure guy), I'm also leaving here this item, in case anyone else can pick this up.
Thanks!
Hey @seefood ! Thanks so much for this well documented issue.
I'll sniff around to see if there's any easy way to support this. I agree that supporting RTL languages would be super valuable 👍 .
I see no new pushes, should I look into it too? I'd appreciate a general start point to look if you could spare a moment.
Hey @seefood - indeed haven't had a chance. I don't get much time to work on this anymore :(.
Looking around a bit, it looks like that link gets RTL via the dir="rtl"
on one of the divs. Some I would expect that to get pulled into EpubPress and persisted. But perhaps epub doesn't support the dir
attribute?
In that case, maybe you would need to query for divs with dir
in EpubPress and then apply some specific inline style that works for Epub?
Can't say... I'm basically just googling at this point. Never worked with RTL before 😬 .
Do you need an RTL epub file to study the expected outcome? I have here quite a few that are public domain if it helps.
I inspected the book created from the above link - looks like the div where all the direction is applied is just getting stripped out (because it's generically applied to the whole page instead of the paragraphs).
It probably wouldn't be too hard to search for div with a dir
property and propagate that to children...
well, I bet there's a standard HTML parser library that will return a DOM object of the rendered page, and then every paragraph and DIV can be probed and return the inherited parameters of the object, like font/directionality/alignment or what have you.
I tried exporting a Google Doc that is marked RTL but still the output was LTR, so I will open them a ticket as well, don't feel bad, you're not alone :-)
I'll try to make a good RTL epub and post here my findings.
OK, I took a poem in Hebrew with diacritics and tried to export an epub from LibreOffice, and it screwed up too. the text was right-aligned but the directionality was still RTL so for instance stops and commas show up at the beginning of lines rather than the end, etc.
The only conversion that worked nicely was saving the page from the browser (it dumps also all the CSS and JS in a subdir next to the HTML), Importing to Calibre and converting to EPUB.
source: https://benyehuda.org/bialik/bia003.html
Result: RTL-Demo-Epub.zip
Here you see that directionality is mentioned both inline in the HTML as well as the CSS, it's a bit messy, but it seams this was exported from Word to HTML with lots of Microsoft markup. not ideal, but it should be easy to study.
@seefood I've opened a PR which I think should address the issue...
https://github.com/haroldtreen/epub-press/pull/11
Should be solved! 🎉
Shout out to @seefood who was super helpful providing sample ebooks, websites and other information that made this a lot easier to solve.
I haven't had a lot of time to work on EpubPress due to full time job stuff - so when people can make my life easy with tips and help it makes a big difference.
Thanks! ❤️
ok, a few articles selected at random from my day-to-day reading tabs. In general it worked 70% and is just about satisfactory for general use, but if you want specific reports on top:
- All following articles: Header is LTR and also left-aligned rather than center or right.
- subtitle is gone. If it's an option I could switch on and off, I'd like to keep it (though nothing RTL-specific about it)
- https://www.the7eye.org.il/307320 Title is replaced by the first line of the subtitle, the rest of the subtitle is removed. H4 inserts and H3 subtitles are again aligned left.
- https://www.ha-makom.co.il/post/haim-green-factory I have a left-aligned title and the body was dropped, instead the body of a "read also" box at the bottom was picked - same for other articles on that site.
- content of https://www.israelhayom.co.il/article/601929 was not found, got an error message instead "We looked for content in https://www.israelhayom.co.il/article/601929 but couldn't find anything :(."
- Worpress post e.g. https://ira.abramov.org/blog/2018/05/06/bettys-chocolate-cake/ - Title is still aligned left, any LI items (numbered or not) are LTR and left-aligning.
- Another wordpress post https://ira.abramov.org/blog/2016/10/28/zcash-is-the-most-interesting-story-this-wek/ revealed that blockquote tags are also skipped, and there's no good reason, as they are part of the DIV that's the post body text.
Thanks :)
Thanks for testing @seefood !
Makes sense for the titles, block quotes and list items. Those are added separate from the articles. I can peek into what it would take to fix those.
As for sites where some content is omitted / selected incorrectly - that's more in the domain of the article selection algorithm. Wouldn't have been affected by the RTL change. Being able to guess the article for a page is pretty tricky and reliant on site creators structuring their pages a certain way. So generally I haven't prioritized those as high.
Those are added separate from the articles
Actually they are child objects of article bodies and paragraphs, so they should inherit their directionality from their daddy.
As for omitted content - have you considered leaving an option to the users to set hints, like on sites *.FarsiNews.site select articles by DIV name "blah" instead of "post" or "article".