pubcrawl
pubcrawl copied to clipboard
encoding. of course it is encoding...
It would be very nice if the text parsing would default to utf-8, because I have something that doesn't seem to be right. 1001 nights
Generous Dealing of Yahya Son of KhĂ\u0081Lid with A Man Who Forged A Letter in His Name.
should be
Generous Dealing of Yahya Son of KhĂLid with A Man Who Forged A Letter in His Name.
The Kingâ\u0080\u0099s Daughter and the Ape
should be
The Kingâs Daughter and the Ape.
I extracted a few parts and the html files within are encoded correctly that is, there is a charset tag in the
<meta charset="utf-8" />
So I guess it could read that tag, or default to utf-8 In https://github.com/hrbrmstr/pubcrawl/blob/master/R/clean-text.R#L5:
if (!inherits(doc, "html_document")) doc <- xml2::read_html(doc)
read_html might need the encoding argument (defaults to "")
If I read the html file in directly with rvest::html_text(xml2::read_html("file.html")) it already defaults to utf-8 . So perhaps there is implicit recoding when xslt::xml_xslt is applied to the data?
nope thats not it (xml2::read_html(doc) would also always default to utf-8).
So, the default was UTF-8 but I added a pass-through encoding parameter wherever I could and it still looks as though you're going to have to post-process to handle Latin1 or cp1252 (etc) encodings. Vis a vis:
x <- epub_to_text("~/Downloads/b97b.epub", "Latin1")
z <- x$content[1] # just to make it easier to debug in my session
substr(z, 1, 1000) # I added the hard line breaks
[1] "The Book of The Thousand Nights and a Night: a plain and literal translation of the Arabian Nights Entertainments. Translated and annotated by Richard F. Burton; illustrated by Albert Letchford\n Contents\n Top\n\tEditorâ\u0080\u0099s Note to this Web
Edition\n\tDedications to the Original Ten Volumes\n\tThe Translatorâ\u0080\u0099s Foreword.\n\tThe Book of The Thousand Nights and a
Night\n\tTale of the Trader and the Jinni.\n\tThe First Shaykhâ\u0080\u0099s Story.\n\tThe Second Shaykhâ\u0080\u0099s Story.\n\tThe
Third Shaykhâ\u0080\u0099s Story.\n\tThe Fisherman and the Jinni.\n\tThe Tale of the Wazir and the Sage Duban.\n\tKing Sindibad and
his Falcon.\n\tThe Tale of the Husband and the Parrot.\n\tThe Tale of the Prince and the Ogress.\n\tThe Tale of the Ensorcelled
Prince.\n\tThe Porter and the Three Ladies of Baghdad.\n\tThe First Kalandarâ\u0080\u0099s Tale.\n\tThe Second Kalandarâ\u0080\u0099s
Tale.\n\tThe Tale of the Envier and the Envied.\n\tThe Third Kalandarâ\u0080\u0099s Tale.\n\tThe Eldest Ladyâ\u0080\u0099s
Tale.\n\tTale of the Portress.\n\tThe Tale of the Three Apples\n\tTale of Nur Al-Din and his S"
In theory, it should have dealt with ^^ properly since it (honest!) passed it in all the way through and I even do a final iconv() to encoding on the column.
But, if you do (this text is Latin1 btw):
substr(iconv(z, "", to="Latin1"), 1, 1000)
[1] "The Book of The Thousand Nights and a Night: a plain and literal translation of the Arabian Nights Entertainments. Translated
and annotated by Richard F. Burton; illustrated by Albert Letchford\n Contents\n Top\n\tEditorâs Note to this Web
Edition\n\tDedications to the Original Ten Volumes\n\tThe Translatorâs Foreword.\n\tThe Book of The Thousand Nights and a
Night\n\tTale of the Trader and the Jinni.\n\tThe First Shaykhâs Story.\n\tThe Second Shaykhâs Story.\n\tThe Third Shaykhâs
Story.\n\tThe Fisherman and the Jinni.\n\tThe Tale of the Wazir and the Sage Duban.\n\tKing Sindibad and his Falcon.\n\tThe Tale of
the Husband and the Parrot.\n\tThe Tale of the Prince and the Ogress.\n\tThe Tale of the Ensorcelled Prince.\n\tThe Porter and the
Three Ladies of Baghdad.\n\tThe First Kalandarâs Tale.\n\tThe Second Kalandarâs Tale.\n\tThe Tale of the Envier and the
Envied.\n\tThe Third Kalandarâs Tale.\n\tThe Eldest Ladyâs Tale.\n\tTale of the Portress.\n\tThe Tale of the Three Apples\n\tTale of
Nur Al-Din and his Son.\n\tThe Hunchback"
it works.
I'll keep this open since it'd like to provide robust support in the long run but at least the iconv() should work ex-post-facto for the edge cases.
(just saw your extended comments)
aye, i even pass encoding along to it and ensure it's a raw vector when processing and still no-go.
something (IMO) "weird" is happening either as a result of read_html() OR in tibble-land causing some issues but iconv() will work ex post facto.