pismo
pismo copied to clipboard
Extracts machine-readable metadata and content from Web pages
"doc.images" call returns "nil" every time, even if there are valid images with absolute urls in the html page. The reader_doc.images array is empty every time.
coder.io not accessable
Year, hour and minutes was missing on datetime detection in this format: "Jul. 25, 2012 10:46 a.m". Because of this, Chronic was inferring year, hour and minutes wrongly.
Is it possible to add support for different languages? May be some kind of API / settings for it?
burl="http://www.momfluential.net" => "http://www.momfluential.net" ruby-1.9.2-p0 > pismo = Pismo[burl] ArgumentError: invalid byte sequence in UTF-8 from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo/document.rb:48:in `gsub!' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo/document.rb:48:in`clean_html' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo/document.rb:36:in `load' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo/document.rb:16:in`initialize' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo.rb:29:in `new' from /Users/jtoy/.rvm/gems/ruby-1.9.2-p0/gems/pismo-0.7.2/lib/pismo.rb:29:in`[]' from...
Any fix planned for allowing redirects? thanks! "redirection forbidden: http://www.bettiepageclothing.com -> https://www.bettiepageclothing.com/"
When I tried to get images in this website I got this exception http://verkoren.wordpress.com/2013/04/12/you-cant-skate-you-old/ #
Noticed this when I was using the Pismo powered ‘entry text extraction’ on Feedbin. ``` irb >> Pismo['http://hsivonen.iki.fi/accept-charset/'].lede => "Accept-Charset Is No More. Now that Firefox 10 has been released,...
maybe you want to add a case-sensitive matchers for looking up the favicon: ``` ['link[@rel="Shortcut Icon"]', lambda { |el| el.attr('href') }], ``` https://github.com/fluxsaas/pismo/blob/master/lib/pismo/internal_attributes.rb#L36 also, it might be nice to add...
Ran into an issue with Pismo's default reader returning the wrong section of an HTML document for its `body`/`html_body` fields. It does work, however, with the cluster reader. This might...