Please add https://wizards.com sites for MTG stories
I'm starting work on an archival project, to convert Magic: the Gathering web fiction to EPUB (here and on my HDD), as it is slowly disappearing from the website with slapdash updates. Web2Epub is the best tool for the job, and I have been using it successfully with the default parser using the following:
New MTGStory:
URL structure: https://magic.wizards.com/en/news/magic-story/hero-iroas-2014-03-05
Include: #article-body
Title: #article-body > div > article > header > h1
Exclude: #article-body > div > aside, #article-body > div > article > div.css-AerwF
Old MTGStory:
URL structure: https://magic.wizards.com/en/articles/archive/magic-story/zendikars-last-stand-2016-02-17
Include: #main-content > article
#content-detail-page-of-an-article (to exclude author)
Title: #main-content > h1
Exclude: #content > aside
Really old Magic Uncharted Realms:
URL structure: http://www.wizards.com/Magic/Magazine/Article.aspx?x=mtg/daily/ur/263
Unselect all except for "skip to content", as that is your article
Include: #content > div.center-content
Title: #ctl00_ctl00_ContentPlaceHolder1_mainContent_Article_headerPanel > div.description > h4
Exclude: #topNav, #leftColumn, #footerWrap, #ctl00_ctl00_ContentPlaceHolder1_mainContent_Article_footerPanel, #ctl00_ctl00_ContentPlaceHolder1_MagicTopNavigation_topNavigation, #ctl00_ctl00_ContentPlaceHolder1_mainContent_Article_socialbar, #ctl00_ctl00_ContentPlaceHolder1_mainContent_Article_HeadingLinks
mtglore.com:
Include: #content > div
Title: #content > div > div
The story is spread across 4 different URL/article website structures, half of which are only on the Internet Archive. Different chapters can exist under different structures, and the TOCs (if they exists) are not comprehensive.
My workflow currently involves getting the Archive.org links for as many chapters on one website structure as possible, as mix-and-match of Includes and Excludes doesn't seem to work very well (? or maybe I should just use more commas), testing, then editing the chapter list and pasting in the links to what I actually want to download, e.g.
<a href="https://web.archive.org/web/20230208082604/https://magic.wizards.com/en/news/making-magic/planeswalkers-guide-theros-part-1-2013-08-21">01 Planeswalker's Guide to Theros, Part 1: The Plane of Theros</a>
<a href="https://web.archive.org/web/20230330201809/https://magic.wizards.com/en/news/making-magic/planeswalkers-guide-theros-part-2-2013-08-28">02 Planeswalker's Guide to Theros, Part 2: The Poleis</a>
<!--<a href="https://web.archive.org/web/20140302084755/http://www.wizards.com/Magic/Magazine/Article.aspx?x=mtg/daily/ur/263">03 The Lost Confession</a>-->
<a href="https://web.archive.org/web/20230620224550/https://magic.wizards.com/en/news/making-magic/planeswalkers-guide-theros-part-3-2013-09-04">04 Planeswalker's Guide to Theros, Part 3: Nonhuman Creatures</a>
<a href="https://web.archive.org/web/20230128080100/https://magic.wizards.com/en/news/feature/prince-anax-part-1-2013-09-18">05 Prince Anax, Part 1</a>
<a href="https://magic.wizards.com/en/news/feature/prince-anax-part-2-2013-09-23">06 Prince Anax, Part 2</a>
<a href="https://web.archive.org/web/20231208032539/https://magic.wizards.com/en/news/feature/nymphs-theros-2013-10-02">07 Nymphs of Theros</a>
<a href="https://web.archive.org/web/20230923144610/https://magic.wizards.com/en/news/feature/consequences-attraction-2013-10-09">08 The Consequences of Attraction</a>
<a href="https://web.archive.org/web/20230929173716/https://magic.wizards.com/en/news/feature/tragedy-2013-10-23">09 Tragedy</a>
<a href="https://magic.wizards.com/en/news/making-magic/unanswered-questions-theros-2013-11-04">10 Unanswered Questions: Theros</a>
<a href="https://magic.wizards.com/en/news/feature/i-iroan-2013-11-04">11 I Iroan</a>
<a href="https://web.archive.org/web/20230203095421/https://magic.wizards.com/en/news/feature/sea-gods-labyrinth-part-1-2013-11-13">12 The Sea God's Labyrinth, Part 1</a>
<a href="https://web.archive.org/web/20230208082828/https://magic.wizards.com/en/news/feature/sea-gods-labyrinth-part-2-2013-11-20">13 The Sea God's Labyrinth, Part 2</a>
<a href="https://magic.wizards.com/en/news/feature/building-toward-dream-part-1-2013-11-27">14 Building Toward a Dream, Part 1</a>
<a href="https://web.archive.org/web/20230127023202/https://magic.wizards.com/en/news/feature/building-toward-dream-part-2-2013-12-04">15 Building Toward a Dream, Part 2</a>
<a href="https://web.archive.org/web/20230202181952/https://magic.wizards.com/en/news/feature/asphodel-2013-12-11">16 Asphodel</a>
<a href="https://web.archive.org/web/20230924005808/https://magic.wizards.com/en/news/feature/planeswalkers-guide-born-gods-2014-01-08">17 Planeswalker's Guide to Born of the Gods</a>
<a href="https://magic.wizards.com/en/news/magic-story/cowardice-hero-2014-01-22">18 Cowardice of the Hero</a>
<a href="https://magic.wizards.com/en/news/magic-story/emonberry-red-2014-02-05">19 Emonberry Red</a>
<a href="https://magic.wizards.com/en/news/magic-story/kioras-followers-2014-02-14">20 Kiora's Followers</a>
<a href="https://magic.wizards.com/en/news/magic-story/dance-flitterstep-2014-02-19">21 Dance of the Flitterstep</a>
<a href="https://magic.wizards.com/en/news/magic-story/walls-akros-2014-02-26">22 The Walls of Akros</a>
<a href="https://magic.wizards.com/en/news/magic-story/hero-iroas-2014-03-05">23 The Hero of Iroas</a>
<a href="https://magic.wizards.com/en/news/magic-story/oracle-ephara-2014-03-19">24 The Oracle of Ephara</a>
<a href="https://web.archive.org/web/20211205115729/https://magic.wizards.com/en/articles/archive/uncharted-realms/seasons-setessa-2014-03-26)">25 Seasons in Setessa</a>
<a href="https://web.archive.org/web/20230922141754/https://magic.wizards.com/en/news/feature/planeswalkers-guide-journey-nyx-2014-04-02">26 Planeswalker's Guide to Journey into Nyx</a>
<!--<a href="https://magic.wizards.com/en/news/magic-story/ajani-mentor-heroes-2014-12-17">27 Ajani, Mentor of Heroes</a>-->
<a href="https://web.archive.org/web/20211023194705/https://magic.wizards.com/en/articles/archive/uncharted-realms/labyrinth-labors-2014-04-16">28 The Labyrinth of Labors</a>
<a href="https://web.archive.org/web/20231204094100/https://magic.wizards.com/en/news/feature/desperate-stand-2014-04-16-0">29 Desperate Stand</a>
<!--<a href="https://magic.wizards.com/en/news/magic-story/dreams-city-2014-04-23">30 Dreams of the City</a>-->
<a href="https://web.archive.org/web/20230226010657/https://magic.wizards.com/en/news/magic-story/thank-gods-2014-04-30">31 Thank the Gods</a>
<a href="https://web.archive.org/web/20230926130458/https://magic.wizards.com/en/news/making-magic/journeys-end-2014-05-26">32 Journey's End</a>
<a href="https://web.archive.org/web/20230130113433/https://magic.wizards.com/en/news/magic-story/kruphixs-insight-2014-06-11">33 Kruphix's Insight</a>
<a href="https://web.archive.org/web/20221204012503/https://magic.wizards.com/en/news/feature/ajanis-vengeance-2014-07-23">34 Ajani's Vengeance</a>
<a href="https://web.archive.org/web/20230205144542/https://magic.wizards.com/en/news/magic-story/drop-drop-2015-05-20">35 Drop for Drop</a>
<a href="https://web.archive.org/web/20230131203628/https://magic.wizards.com/en/news/magic-story/its-time-talk-commander-2016-edition-2016-10-26">36 It's Time to Talk Commander (2016 Edition)!</a>
I am starting work on the parser, but was wondering if there was a way for it to target different sites, and to ignore the TOC, and only request a manual list? My workflow would be improved if there was just a box for URLs and it could extract the titles from that, rather than having to write an HTML chapter list with a href="">Title here</a> - is there a way to force this with a new parser?
@Darthagnon
I'm not quite sure what you're asking for. Is it something that walks https://magic.wizards.com/en/news/archive, treating each article as a chapter to collect?
Apologies, my explanation was rather confusing.
"Edit chapter URLs" is the view I use to collect chapter lists to convert to EPUB, because
- the Wizards website is broken/useless/missing chapters, so there is no auto-parser that could work (EDIT: without too much work). An auto-parser would need to process
https://magic.wizards.com/en/news/archive(2024),https://web.archive.org/web/2023mmddetc/https://magic.wizards.com/en/articles/archive(unreliable infinite scroller) andhttps://web.archive.org/web/2020mmddetc/http://www.wizards.com/Magic/Magazine/Archive.aspx(paginated, mostly 404s),https://web.archive.org/web/2014mmddetc/http://www.wizards.com/Magic/Magazine/Article.aspx - a lot of chapters are not story-related, so less useful for EPUB.
Questions
- Parser: Is it possible for a single parser to target multiple domains (e.g. archive.org, multiple versions of wizards.com) with separate filters on a per-chapter basis? Or are there any example site templates that are similar to this use case?
- Extension: Would it be possible to (optionally) open "Edit chapter URLs" by default, rather than auto-parsing all chapters? For the default parser and manual listing on weird random websites, auto-parsing doesn't seem to work very well, gathering lots of irrelevant navbar links. It would help my workflow if I could paste story links directly, rather than fiddling with the UI to eliminate irrelevant chapters.
- Extension: Edit chapter URLs requires specifying the title to use per chapter, as well as html tags, eg.
a href="">Title here</a>- could it be changed to just take a list of URLs? e.g. instead of
<a href="https://magic.wizards.com/en/news/magic-story/cowardice-hero-2014-01-22">18 Cowardice of the Hero</a>
<a href="https://magic.wizards.com/en/news/magic-story/emonberry-red-2014-02-05">19 Emonberry Red</a>
<a href="https://magic.wizards.com/en/news/magic-story/kioras-followers-2014-02-14">20 Kiora's Followers</a>
we could have
https://magic.wizards.com/en/news/magic-story/cowardice-hero-2014-01-22
https://magic.wizards.com/en/news/magic-story/emonberry-red-2014-02-05
https://magic.wizards.com/en/news/magic-story/kioras-followers-2014-02-14
... and the titles read according to the filter template to editable fields in the chapter list:
Many thanks for any advice or help!
I have started implementation here: https://github.com/Darthagnon/web2epub-tidy-script/blob/master/MagicWizardsParser.js
@Darthagnon
Parser: Is it possible for a single parser to target multiple domains (e.g. archive.org, multiple versions of wizards.com) with separate filters on a per-chapter basis? Or are there any example site templates that are similar to this use case?
yes. https://github.com/dteviot/WebToEpub/blob/ExperimentalTabMode/plugin/js/parsers/NoblemtlParser.js Although it's not obvious how it works. An example is
findCoverImageUrl(dom) {
return util.getFirstImgSrc(dom, ".thumbook, .sertothumb");
}
This looks for an image using two CSS selectors. ".thumbook" and ".sertothumb" and picks the first it finds. As the two sites have a different layout, only one will succeed.
An alternate way to handle multiple sites, is the "dom" parameter holds the URL of the page in dom.baseURI. You could extract the hostname from the URL and then switch the logic based on that.
That said, WebToEpub is supposed to check the URL for each page, and then select the appropriate parser even if the Table of Contents is a mixture of sites. So, you might not need a combined parser. Just write one for each site.
fixed in #1500 @Darthagnon Updated version (1.0.0.0) has been submitted to Firefox and Chrome stores. Firefox/ Chrome version is available now.