An option to remove chapter title
Is your feature request related to a problem? Please describe. When downloading from royalroad, many novels have their chapter title in their body too, so the downloaded chapters have two titles in them, one from its chapter title and another from its body text.
Describe the solution you'd like An option to remove chapter title. Just like remove author notes option.
Describe alternatives you've considered I tried adding manual parser, but I couldn't make it work properly, so I am requesting this option.
Additional context This option (Remove chapter title) might work for all hosts as it woud nullify the chapter title only.
Can you share an example of a chapter this actually happens in?
Here you go. I had also linked the novel used below.
Also the downloaded novels from royalroad and webnovel have this p class = and div data ejs, respectively with random text. Some time ago it would only have the novel text only. If possible, please correct them to show novel text only.
Also the downloaded novels from royalroad and webnovel have this p class = and div data ejs, respectively with random text. Some time ago it would only have the novel text only. If possible, please correct them to show novel text only.
For future reference, data-ejs attributes were removed from webnovel in PR #1363. These changes aren't currently in the live build, and some junk data does still persist. They should, however be included in the build linked here: https://github.com/dteviot/WebToEpub/issues/1368#issuecomment-2212294773
I'll check to see if something similar can be done for RR, but scrubbing classes isn't as cut & dry as removing entire attributes. Either way, I'll give both of these a shot; I have a few ideas for both of these issues...
@Kiradien @Xeolod I'm going to suggest that doing the "double title removal" might be better as a post processing step using EpubEditor. Logic might be something like:
- Find the H1 header, then the text in it.
- Search for any other text nodes with the same text.
- If any found, delete their enclosing element.
As dteviot said above, that is probably the best way, I played around with a config to do the same and it could be a bit funky - especially due to author notes. I've pushed for PR on the cleanup code, however.
For future reference, data-ejs attributes were removed from webnovel in PR #1363. These changes aren't currently in the live build, and some junk data does still persist. They should, however be included in the build linked here: #1368 (comment)
Tested it on webnovel, almost all the junk data is removed. One div data ejs attribute still exists, but removed it using regex.
Test versions for Firefox and Chrome with Kiradien's Royal Road cleanup have been uploaded to https://drive.google.com/drive/folders/1B_X2WcsaI_eg9yA-5bHJb8VeTZGKExl8?usp=sharing.
@xeolod
Try this script to remove duplicated title text.
let titleNode = dom.querySelector("h1")?.firstChild;
let titleText = titleNode?.data;
let filter = (node) => {
return (node !== titleNode) && (node.data == titleText)
? NodeFilter.FILTER_ACCEPT
: NodeFilter.FILTER_SKIP;
};
let walker = dom.createTreeWalker(
dom.body,
NodeFilter.SHOW_TEXT,
filter
);
let node = walker.firstChild()?.parentNode;
if (node != null) {
console.log(node.outerHTML);
node.remove();
return true;
}
return false;
Tested with:
- https://www.royalroad.com/fiction/59948/desolate-fate, chapters 1, 2, 3
For my notes: 24 minutes work
Thanks, it's working.
@xeolod Updated version (0.0.0.167) has been submitted to Firefox and Chrome stores. Firefox version is available now. Chrome might be available in a few hours to 21 days.