parser
parser copied to clipboard
π Extract meaningful content from the chaos of a web page
Hi As per: https://www.fivefilters.org/2022/kindle-epub-issues/ Some recent changes in Amazon's send to kindle service are no longer taking the title from the EPUB metadata. As a result, webpages sent to a...
βERRORβ Command failed with exit code 128: git ls-remote git+ssh://[email protected]/postlight/difflib.js.git HEAD ssh: connect to host github.com port 22: Connection timed out fatal: Could not read from remote repository. Please make...
## Expected Behavior Image should be shown ## Current Behavior Image is not shown and istead its alt text is shown ## Steps to Reproduce When processing this url: https://denikreferendum.cz/clanek/34961-nova-vlna-teroru-a-nasili-v-palestine-a-izraeli-a-cesky-postoj...
I have insalled both postlight parser and cheerio. In a parallel repository, both work fine. I am using react and vite to create a chrome extension that needs to use...
Fixes issue with `--version` flag mentioned here: https://github.com/postlight/parser/pull/610#issuecomment-1772072583 This is needed for ArchiveBox (and many other UNIX tools) to autodetect the version.
Is there an interface for determining whether the content of a web page is readable, similar to `readability`?
Currently on following pages the parser seems to be lost. I don't see any markup problems. maybe the newspapers detect and block the scraper? https://www.derstandard.at/story/2000145508819/franzoesischer-verfassungsrat-stimmt-umstrittener-pensionsreform-zu there an info is added...
This may be error, `user_agent` versus `user-agent` (`_` versus `-`) but surprising behavior with header sent. ## Expected Behavior Want to override user agent ## Current Behavior postlight-parse http://localhost:8000/test_postlight.html --header.user_agent=my_user_agent...
## Expected Behavior Postlight Parser should preserve all the actual content of the page. ## Current Behavior Postlight Parser will get rid of any bulleted / numbered lists which consist...
This PR includes comments from a Reddit thread