WebToEpub An option to remove chapter title

Is your feature request related to a problem? Please describe. When downloading from royalroad, many novels have their chapter title in their body too, so the downloaded chapters have two titles in them, one from its chapter title and another from its body text.

Describe the solution you'd like An option to remove chapter title. Just like remove author notes option.

Describe alternatives you've considered I tried adding manual parser, but I couldn't make it work properly, so I am requesting this option.

Additional context This option (Remove chapter title) might work for all hosts as it woud nullify the chapter title only.

Jul 09 '24 08:07 xeolod

Can you share an example of a chapter this actually happens in?

Jul 09 '24 19:07 Kiradien

Here you go. I had also linked the novel used below.

Also the downloaded novels from royalroad and webnovel have this p class = and div data ejs, respectively with random text. Some time ago it would only have the novel text only. If possible, please correct them to show novel text only.

royalraod

webnovel

Jul 10 '24 15:07 xeolod

Also the downloaded novels from royalroad and webnovel have this p class = and div data ejs, respectively with random text. Some time ago it would only have the novel text only. If possible, please correct them to show novel text only.

For future reference, data-ejs attributes were removed from webnovel in PR #1363. These changes aren't currently in the live build, and some junk data does still persist. They should, however be included in the build linked here: https://github.com/dteviot/WebToEpub/issues/1368#issuecomment-2212294773

I'll check to see if something similar can be done for RR, but scrubbing classes isn't as cut & dry as removing entire attributes. Either way, I'll give both of these a shot; I have a few ideas for both of these issues...

Jul 10 '24 17:07 Kiradien

@Kiradien @Xeolod I'm going to suggest that doing the "double title removal" might be better as a post processing step using EpubEditor. Logic might be something like:

Find the H1 header, then the text in it.
Search for any other text nodes with the same text.
If any found, delete their enclosing element.

Jul 10 '24 19:07 dteviot

As dteviot said above, that is probably the best way, I played around with a config to do the same and it could be a bit funky - especially due to author notes. I've pushed for PR on the cleanup code, however.

Jul 10 '24 20:07 Kiradien

For future reference, data-ejs attributes were removed from webnovel in PR #1363. These changes aren't currently in the live build, and some junk data does still persist. They should, however be included in the build linked here: #1368 (comment)

Tested it on webnovel, almost all the junk data is removed. One div data ejs attribute still exists, but removed it using regex.

Jul 11 '24 05:07 xeolod

Test versions for Firefox and Chrome with Kiradien's Royal Road cleanup have been uploaded to https://drive.google.com/drive/folders/1B_X2WcsaI_eg9yA-5bHJb8VeTZGKExl8?usp=sharing.

Jul 13 '24 02:07 dteviot

@xeolod

Try this script to remove duplicated title text.

let titleNode = dom.querySelector("h1")?.firstChild;
let titleText = titleNode?.data;
let filter = (node) => {
    return (node !== titleNode) && (node.data == titleText)
        ? NodeFilter.FILTER_ACCEPT
        : NodeFilter.FILTER_SKIP;
};

let walker = dom.createTreeWalker(
  dom.body,
  NodeFilter.SHOW_TEXT,
  filter
);
let node = walker.firstChild()?.parentNode;
if (node != null) {
    console.log(node.outerHTML);
    node.remove();
    return true;
}
return false;

Tested with:

https://www.royalroad.com/fiction/59948/desolate-fate, chapters 1, 2, 3

For my notes: 24 minutes work

Jul 19 '24 08:07 dteviot

Thanks, it's working.

Aug 05 '24 05:08 xeolod

@xeolod Updated version (0.0.0.167) has been submitted to Firefox and Chrome stores. Firefox version is available now. Chrome might be available in a few hours to 21 days.

Aug 23 '24 08:08 dteviot