WebToEpub
WebToEpub copied to clipboard
Could ChatGPT be used to write site specific parsers?
ChatGPT (and Google's Bard, and similar) can apparently write code for tasks. So, could it be persuaded to write a parsers for specific web sites?
@dteviot my attempt with bard failed.
@gamebeaker Any notes? What you tried? Result? Just saying "Failed" is hard to learn much from.
here is a link to what i tried. https://g.co/bard/share/fb1b5954cece
Hmm....
Thanks.
AI's may have better luck with a more generic parser, with less actual code being written; no idea if it will actually be helpful but I've used it a few times on and off, with it, the AI would theoretically just pass the parameters for instantiation instead of writing code.
To be clear, the code needs cleanup and doesn't currently handle paging. It is possible to implement paging as additional parameters; I lost that changeset a while back and haven't used this in a while. GenericParser.txt
That said, if this generic parser were actually perfect, I'd have submitted it ages ago.
Trying to be more specific didn't work.
Request:
Provide javascript that, given the URL of a story's Table of Contents, can extract the author, for stories on the site https://kakuyomu.jp, a typical Table to Contents page is https://kakuyomu.jp/works/1177354054894027232
Response:
As a language model, I'm not able to assist you with that.
Request:
Page https://kakuyomu.jp/works/1177354054894027232 is the table of contents for a web novel. How can I find the Author?
Response:
Unfortunately, I couldn't find the author information on that page.
(Note, author appears in 3 places on the page.)
You have to remember that language models process text, not URLs. The support for external requests is rather hacky and generally it works by downloading the page and including the text for the model to read (but in a special format). I tried manually including the HTML in ChatGPT 3.5 and Bard and it is too long to include in the prompt. In Bard I also tried including a placeholder like <full html of https://kakuyomu.jp/works/1177354054894027232>
; it does get it to download the page but I think the downloader preprocesses the page to plain text because the CSS selectors it output (.widget-user
, .user-name
) bore no relationship to the HTML of the page.
So what is necessary is a model that supports input sizes of 360KB (the page's html size). CodeLLama in theory supports 100K token context windows, which might be enough (?), but it isn't available as a convenient webservice so I haven't tried it. Also I found some posts like "100k tokens is a meme". OTOH Claude 2 is open beta, 200k tokens, and you can just attach files directly. So I went with that. Even with the huge limit, the HTML file was ~55% too large, I think part of it is that the free token limit is less than the paid but also HTML files are just huge. I cut it down by removing the JSON slug at the end and trimming out some CSS files, JS files, and SVG paths.
Now the question: did it work?
Request:
I have provided the HTML code of a sample story's Table of Contents page on the site https://kakuyomu.jp/. Please provide JavaScript code that can extract the author from similarly-structured pages on the site https://kakuyomu.jp/. I would recommend finding the correct element using a CSS selector or XPath query but you may use whatever logic is necessary. Write it as the implementation of an
extractAuthor(dom)
JavaScript function, wheredom
is the result of(await HttpClient.wrapFetch(url)).responseXML
. The author's name for this sample page is 羽田宇佐.
First I tried it without author's name, it gave
const titleElement = dom.querySelector('h1.Heading_heading__lQ85n');
const authorElement = titleElement.querySelector('.WorkTitle_workLabelAuthor__Kxy5E');
The title is correct but the author selector matches the list of authors in related works (not author of this work), and limited to the title element matches nothing.
Then I added the author's name. After trying several times and getting capacity limited I eventually got through and it suggested this code
const authorLink = dom.querySelector('a[href="/users/hanedausa"]');
This works for the page, but obviously not in general.
Concluding, it does at least seem to be reading the page, but it has relatively little understanding of the DOM. None of the selectors involved parent-child relationships or anything like that. The LLMs are clearly struggling to parse the raw HTML, both in capacity limits and in understanding.
Short answer: no, it doesn't work, the model got confused.
I would say, to properly work, you need a multi-step structured process, that doesn't involve feeding LLMs raw HTML. Render a picture of the page, run an image model over it to identify elements such as title, author, etc., translate the coordinates back into DOM elements, and then use a second model (or just heuristics) to guess CSS selectors given the concrete DOM path.
@Mathnerd314
Thanks for that. (And that's a lot more effort than I was willing to put into it.) To be honest, I was not expecting this to work.
But, given the hype about LLMs, I thought I'd give it a go.
That said, when I tried some other stuff, I was somewhat surprised by the response to "How can I convert a Web Novel to an epub?" I got three methods back. The 3rd was to use WebToEpub.
Well, the annoying thing is that it "almost" works. It generates code, it generates CSS selectors in that code, and the CSS selectors even match some things in the HTML. They are just the wrong things. It suggests that if you tripled the size of the model and fine-tuned it on some examples it might actually work. So really it is a hardware problem. I would say if you don't want to try the image recognition route, then just shelve it and try it again in like 3-4 years when every new computer comes with a dedicated AI chip and the models have gotten better.
But also, there are DOM-aware approaches:
- https://www.youtube.com/watch?v=QIfmJHyQIlY
- https://arxiv.org/pdf/2201.10608.pdf
- https://www.mdpi.com/2504-4990/3/1/6
They aren't off-the-shelf though (yet) so it is probably overkill compared to the heuristics you have now and the simple expedient of manually specifying the CSS selectors, until someone releases a library that you can just start using.