boilerpipe icon indicating copy to clipboard operation
boilerpipe copied to clipboard

Can not parse NYtimes pages

Open GoogleCodeExporter opened this issue 9 years ago • 2 comments

Boilerpipe can not parse NYTimes pages. I get no output when tried with NYTimes 
pages. 

Original issue reported on code.google.com by [email protected] on 23 Sep 2012 at 5:38

GoogleCodeExporter avatar Mar 24 '15 10:03 GoogleCodeExporter

We're seeing the same problem. It's not all NYT pages, but some.

E.g. these don't work: 
http://theater.nytimes.com/2013/03/01/theater/reviews/sondheim-and-lapines-passi
on-at-classic-stage-company.html?ref=arts

http://theater.nytimes.com/2013/03/01/theater/reviews/the-revisionist-at-the-rat
tlestick-theater.html?ref=arts&_r=0

but this one works:
http://www.nytimes.com/2013/03/02/nyregion/us-judges-offer-addicts-a-way-to-avoi
d-prison.html?hp

Original comment by [email protected] on 2 Mar 2013 at 12:54

GoogleCodeExporter avatar Mar 24 '15 10:03 GoogleCodeExporter

Any change on this issue? I am seeing the same thing with parsing NYT pages for 
my application. I think this might be related to the fact that NYT tries to set 
a cookie when a client makes a request. Would love to know any workarounds 
people have for this.

Original comment by [email protected] on 10 Jun 2014 at 6:41

GoogleCodeExporter avatar Mar 24 '15 10:03 GoogleCodeExporter