comics
comics copied to clipboard
Support fetching related text with more formatting
E.g. Darths & Drois has a huge formatted text associated with each comic. Since these texts often are half the fun, comics should support fetching larger pieces of text with formatting, and keep a sane amount of this formatting, e.g. headers and bullet lists.
I believe @xim have been looking a bit at this, ref. xim/comics@fdea7223f33b8bb510fdf17976cb52eb63b5b926.
I don't remember what we ended up with as a preferred approach. I made a tiny, general converter on my local computer. The idea was:
- Get the formatted HTML
- Use a dict that transforms elements, something like
{'p': lambda data: ' '.join(data.split()) + '\n\n', ...}
- Allow the individual crawler to override any element type in this dict
I only tested this with rom.ac and QC, but it should enable good results on any comic. Further suggestions? =)