HTMLReader
HTMLReader copied to clipboard
not detecting complete set of meta tags
Hi,
I am trying to parse meta tags using this code:
NSArray *metaNodes = [document nodesMatchingSelector:@"meta"];
I ran the code through this page: http://www.nytimes.com/2015/08/16/technology/inside-amazon-wrestling-big-ideas-in-a-bruising-workplace.html
and it only picked up 31 meta tags when there is clearly 50+
Hello! I'm not seeing the same results as you. I downloaded the HTML with the command
curl -OL "http://www.nytimes.com/2015/08/16/technology/inside-amazon-wrestling-big-ideas-in-a-bruising-workplace.html"
When I opened the downloaded file in a text editor, I counted ten instances of <meta. When I loaded it into HTMLReader, the array returned by [document nodesMatchingSelector:@"meta"] had a count of ten.
May I ask how you're counting 31 and 50+ meta tags? Are you loading the page in a browser?
Hi Nolan
Thanks for getting back so quickly. Yup when I look at the source code in Chrome its showing at least 50+ tags. I'm using the exact code from your github readme and added:
NSArray *metaNodes = [document nodesMatchingSelector:@"meta"];
NSLog(@"metaNodes %@", metaNodes);
The metaNodes array consistently comes back on my side to be 31 objects.
Wow that very strange you are getting 10. Would the meta tag count be different based on how you load it? Maybe some are ones generated by javascript after it is loaded? That wouldn't make sense either because I went through line by line what is in metaNodes and it matches up correctly until it cuts off at:
twitter:app:url:googleplay
Thanks!
On Mon, Aug 17, 2015 at 7:38 PM, Nolan Waite [email protected] wrote:
Hello! I'm not seeing the same results as you. I downloaded the HTML with the command
curl -OL "http://www.nytimes.com/2015/08/16/technology/inside-amazon-wrestling-big-ideas-in-a-bruising-workplace.html"
When I opened the downloaded file in a text editor, I counted ten instances of <meta. When I loaded it into HTMLReader, the array returned by [document nodesMatchingSelector:@"meta"] had a count of ten.
May I ask how you're counting 31 and 50+ meta tags? Are you loading the page in a browser?
— Reply to this email directly or view it on GitHub https://github.com/nolanw/HTMLReader/issues/35#issuecomment-131995003.
Oh, I was not very careful. Turns out that curl command gets redirected to the login page because it isn't accepting cookies. I changed it to:
curl -OL -c cookies.txt "http://www.nytimes.com/2015/08/16/technology/inside-amazon-wrestling-big-ideas-in-a-bruising-workplace.html"
And got a text file with 92 instances of the string <meta, and with HTMLReader I dragged it into the playground and did
import HTMLReader
let path = NSBundle.mainBundle().pathForResource("updog.html", ofType: nil)!
let data = NSData(contentsOfFile: path)!
let home = HTMLDocument(data: data, contentTypeHeader: nil)
home.nodesMatchingSelector("meta").count
and got a count of 92 matching nodes.
Just for fun, I went to the page you linked in Safari, opened the Web Inspector, typed
document.querySelectorAll('meta').length
into the console, and got 93.
Is any of this helpful?
Hi Nolan
Yup I'm getting the 93 when I'm running the javascript query as well.
I'm running this code in a brand new iOS singleview project after importing HTMLReader.h via Cocoapods and still getting a 31. Very strange.
- (void)viewDidLoad { [super viewDidLoad]; NSURL *url = [NSURL URLWithString:@" http://www.nytimes.com/2015/08/16/technology/inside-amazon-wrestling-big-ideas-in-a-bruising-workplace.html?_r=0 "]; NSURLSession *session = [NSURLSession sharedSession]; [[session dataTaskWithURL:url completionHandler: ^(NSData *data, NSURLResponse *response, NSError *error) { NSString *contentType = nil; if ([response isKindOfClass:[NSHTTPURLResponse class]]) { NSDictionary *headers = [(NSHTTPURLResponse *)response allHeaderFields]; contentType = headers[@"Content-Type"]; } HTMLDocument *document = [HTMLDocument documentWithData:data
contentTypeHeader:contentType]; NSArray *metaNodes = [document nodesMatchingSelector:@"meta"]; NSLog(@"metaNodes %@ count:%lu", metaNodes, (unsigned long)[metaNodes count]); }] resume]; }
On Tue, Aug 18, 2015 at 8:31 PM, Nolan Waite [email protected] wrote:
Oh, I was not very careful. Turns out that curl command gets redirected to the login page because it isn't accepting cookies. I changed it to:
curl -OL -c cookies.txt "http://www.nytimes.com/2015/08/16/technology/inside-amazon-wrestling-big-ideas-in-a-bruising-workplace.html"
And got a text file with 92 instances of the string <meta, and with HTMLReader I dragged it into the playground and did
import HTMLReader let path = NSBundle.mainBundle().pathForResource("updog.html", ofType: nil)!let data = NSData(contentsOfFile: path)!let home = HTMLDocument(data: data, contentTypeHeader: nil) home.nodesMatchingSelector("meta").count
and got a count of 92 matching nodes.
Just for fun, I went to the page you linked in Safari, opened the Web Inspector, typed
document.querySelectorAll('meta').length
into the console, and got 93.
Is any of this helpful?
— Reply to this email directly or view it on GitHub https://github.com/nolanw/HTMLReader/issues/35#issuecomment-132402857.
I'm afraid I'm nearly out of ideas! If you take the data from your snippet and log it out as a string, do you see the markup you expect to see?
@jameslin101 did you ever solve this?
No I was not able to solve it. Probably a one-off issue with that particular article, but very strange.
On Sun, Sep 20, 2015 at 3:34 PM, Nolan Waite [email protected] wrote:
@jameslin101 https://github.com/jameslin101 did you ever solve this?
— Reply to this email directly or view it on GitHub https://github.com/nolanw/HTMLReader/issues/35#issuecomment-141824288.