WebToEpub icon indicating copy to clipboard operation
WebToEpub copied to clipboard

WebToEpub cannot capture the Story Seedling Webpage

Open readingthings8-cell opened this issue 4 months ago • 1 comments

Hi, so I recently discovered this issue where webtoepub cannot take in the webpages of the story seeding website, i think they have some sort of protection against it? Anyways trying to copy and paste does not work and webtoepub cannot capture the contents of the page it just shows up blank.

Image

Any suggestions how to get the contents to be visible? I've also tried using this story seedling parser on epubeditor and used the function run above script to modify epub but no new epub was generated and I'm officially out of ideas from here onwards : "use strict";

//dead url/ parser parserFactory.register("storyseedling.com", () => new StorySeedlingParser());

class StorySeedlingParser extends Parser { constructor() { super(); }

async getChapterUrls(dom) {
    return [...dom.querySelectorAll("main .grid.w-full a")]
        .map(link => this.linkToChapter(link));
}

linkToChapter(link) {
    let title = link.querySelector(".truncate").textContent;
    return ({
        sourceUrl:  link.href,
        title: title,
    });
}

findContent(dom) {
    return (
        dom.querySelector("div.prose .mb-4") || dom.querySelector("#chapter-content")
    );
}

populateUIImpl() {
    document.getElementById("removeAuthorNotesRow").hidden = false; 
}

preprocessRawDom(webPageDom) {
    let notes = webPageDom.querySelector("div.prose .mb-4:nth-of-type(2)");
    if ((notes != null) && !this.userPreferences.removeAuthorNotes.value) {
        this.tagAuthorNotes([notes]);
        this.findContent(webPageDom).appendChild(notes);
    }
}

extractTitleImpl(dom) {
    return dom.querySelector("h1");
}

extractAuthor(dom) {
    let authorLabel = dom.querySelector("div.leading-7 a");
    return authorLabel?.textContent ?? super.extractAuthor(dom);
}

findChapterTitle(dom) {
    return dom.querySelector(".truncate");
}

findCoverImageUrl(dom) {
    return util.getFirstImgSrc(dom, "div[x-data='']");
}

getInformationEpubItemChildNodes(dom) {
    return [...dom.querySelectorAll("div.order-2.mb-4")];
}

}

readingthings8-cell avatar Dec 22 '25 13:12 readingthings8-cell

I can confirm with a different dataset that something strange is happening - the path mapping looks correct (when explored with an online tool) in the h5 file Measurements/DATETIME/Experiment/Path_Mappings folder, and the Measurements/DATETIME/Image/PathName_CHANNELand URLName_CHANNEL tables all show mapped paths, not unmapped paths, but when I go to try to execute the batch file on my local machine, it says it cannot find the unmapped directory. We tried this in 4 permutations (\\ vs \ separators, and trailing slash present vs absent), same issue every time.

bethac07 avatar Mar 31 '26 16:03 bethac07

Stack trace is

pipeline_exception
Error detected during run of module NamesAndTypes
Traceback (most recent call last):
  File "/Users/bcimini/Documents/GitHub/CellProfiler/core/cellprofiler_core/pipeline/_pipeline.py", line 1016, in run_with_yield
    self.run_module(module, workspace)
  File "/Users/bcimini/Documents/GitHub/CellProfiler/core/cellprofiler_core/pipeline/_pipeline.py", line 1349, in run_module
    module.run(workspace)
  File "/Users/bcimini/Documents/GitHub/CellProfiler/core/cellprofiler_core/modules/namesandtypes.py", line 1940, in run
    self.add_image_provider(
  File "/Users/bcimini/Documents/GitHub/CellProfiler/core/cellprofiler_core/modules/namesandtypes.py", line 2003, in add_image_provider
    self.add_simple_image(
  File "/Users/bcimini/Documents/GitHub/CellProfiler/core/cellprofiler_core/modules/namesandtypes.py", line 2062, in add_simple_image
    self.add_provider_measurements(provider, m, "Image")
  File "/Users/bcimini/Documents/GitHub/CellProfiler/core/cellprofiler_core/modules/namesandtypes.py", line 2076, in add_provider_measurements
    img = provider.provide_image(m)
  File "/Users/bcimini/Documents/GitHub/CellProfiler/core/cellprofiler_core/image/abstract_image/file/url/_monochrome_image.py", line 33, in provide_image
    image = URLImage.provide_image(self, image_set)
  File "/Users/bcimini/Documents/GitHub/CellProfiler/core/cellprofiler_core/image/abstract_image/file/_file_image.py", line 328, in provide_image
    self.__set_image()
  File "/Users/bcimini/Documents/GitHub/CellProfiler/core/cellprofiler_core/image/abstract_image/file/_file_image.py", line 259, in __set_image
    self.__set_image_volume()
  File "/Users/bcimini/Documents/GitHub/CellProfiler/core/cellprofiler_core/image/abstract_image/file/_file_image.py", line 332, in __set_image_volume
    pathname = url2pathname(self.get_url())
  File "/Users/bcimini/Documents/GitHub/CellProfiler/core/cellprofiler_core/image/abstract_image/file/url/_url_image.py", line 33, in get_url
    if self.cache_file():
  File "/Users/bcimini/Documents/GitHub/CellProfiler/core/cellprofiler_core/image/abstract_image/file/_file_image.py", line 159, in cache_file
    raise IOError(
OSError: Test for access to directory failed. Directory: ///C:/REST_OF_UNMAPPED_PATH

bethac07 avatar Mar 31 '26 16:03 bethac07

Nowhere near solved yet, but a bit of tracing (all the below in 4.2.x):

The name, it turns out, is stored in the ImagePlaneDetails, which are a Java object. Horrifying that that's what we're putting in the batch file IMO.

For a file on Mac, it starts with file:; for a file on Windows, it starts with file:///C: etc. NamesAndTypes pulls this and then sends it to alter_url_post_create_batch in the measurements class, which among other things, hardcodes the number of characters before the path starts to 5, and tries to do a dumb find-and-replace that is failing, for among other reasons, because it's using startswith (and the hardcoded path digits are wrong), and also it tries to path separator adjust the filename, but not the pathmapping, and then say "if pathmapping there", but it's not, because only one has been adjusted not the other! So I think alter_url_post_create_batch is the thing we need to change if we're hotfixing a CP4 fix; I'm wondering though if this means this has been broken back all the way to at least CP4.0.0, if not before? I swear I used to make batchfiles on my Windows machine when I started in the group, but maybe that's a Mandela effect...

For CP5, I think since the H5 file already has lovely path-adjusted names in it, we should think about if the input modules should just read those rather than trying to pass along the Java objects and treat them more like a load_data, but not sure if that's birds or national parks.

bethac07 avatar Apr 02 '26 00:04 bethac07