calais icon indicating copy to clipboard operation
calais copied to clipboard

Nokogiri XML Namespaces and gzip decoding

Open nathanstitt opened this issue 11 years ago • 3 comments

This is needed for fixing a few issues that DocumentCloud has encountered while using calais for our entity extraction.

The first is that Calias sometimes returns gzipped content. When that occurs an exception is thrown since the content can't be decoded (of course). This may have been an intermittent issue with the api, but our thoughts were that it can't hurt to attempt to handle it. A further enhancement would be to request gzip encoding on the request so it would be more efficient.

The second is more pressing. It has to do with newer nokogiri differing on how it handles namespace prefixes. I believe issues #10 and #11 are attempting to fix the same bug. #11 indicates that the bug started with Nokogiri 1.5.6, but I haven't tracked down when the change occurred.

DocumentCloud has been running with this branch in production for several months now without issue (https://github.com/documentcloud/documentcloud/blob/master/Gemfile#L5). We'd really like to get it merged and a new gem cut so we can remove the "git" references out of our Gemfile.

Thanks for the excellent job you've done with the gem thus far. If I can help with any further testing or merging, please let me know.

nathanstitt avatar Jun 24 '14 21:06 nathanstitt

@nathanstitt, I'm looking for someone to properly take ownership of this project since I don't have the cycles to do it myself. Any thoughts on DocumentCloud or yourself taking this on? I could see you guys actually running with it.

abhay avatar Jun 25 '14 06:06 abhay

@abhay I totally understand, stuff can get crazy and sometimes there's just not enough days in the week.

We'd be very interested in taking over the project. I think it would fit very well with DocumentCloud's mission since we depend on it quite a bit for our entity support.

I'm not 100% sure on how that would go down, but I'm assuming you could just transfer the repo to documentcloud's github account and transfer the ruby gem to us. Feel free to email me directly, or swing by the #documentcloud irc channel if you'd like to discuss in real-time.

nathanstitt avatar Jun 25 '14 14:06 nathanstitt

Hi Abhay,

Have you given any further thought to allowing DocumentCloud to take over support of the Gem? We're still attempting to cleanup the Gemfile. Please let us know if we can help further.

Thanks very much.

nathanstitt avatar Jul 16 '14 22:07 nathanstitt