HTMLReader icon indicating copy to clipboard operation
HTMLReader copied to clipboard

not detecting complete set of meta tags

Open jameslin101 opened this issue 10 years ago • 7 comments

Hi,

I am trying to parse meta tags using this code:

NSArray *metaNodes = [document nodesMatchingSelector:@"meta"];

I ran the code through this page: http://www.nytimes.com/2015/08/16/technology/inside-amazon-wrestling-big-ideas-in-a-bruising-workplace.html

and it only picked up 31 meta tags when there is clearly 50+

jameslin101 avatar Aug 17 '15 19:08 jameslin101

Hello! I'm not seeing the same results as you. I downloaded the HTML with the command

curl -OL "http://www.nytimes.com/2015/08/16/technology/inside-amazon-wrestling-big-ideas-in-a-bruising-workplace.html"

When I opened the downloaded file in a text editor, I counted ten instances of <meta. When I loaded it into HTMLReader, the array returned by [document nodesMatchingSelector:@"meta"] had a count of ten.

May I ask how you're counting 31 and 50+ meta tags? Are you loading the page in a browser?

nolanw avatar Aug 17 '15 23:08 nolanw

Hi Nolan

Thanks for getting back so quickly. Yup when I look at the source code in Chrome its showing at least 50+ tags. I'm using the exact code from your github readme and added:

NSArray *metaNodes = [document nodesMatchingSelector:@"meta"];

NSLog(@"metaNodes %@", metaNodes);

The metaNodes array consistently comes back on my side to be 31 objects.

Wow that very strange you are getting 10. Would the meta tag count be different based on how you load it? Maybe some are ones generated by javascript after it is loaded? That wouldn't make sense either because I went through line by line what is in metaNodes and it matches up correctly until it cuts off at:

twitter:app:url:googleplay

Thanks!

On Mon, Aug 17, 2015 at 7:38 PM, Nolan Waite [email protected] wrote:

Hello! I'm not seeing the same results as you. I downloaded the HTML with the command

curl -OL "http://www.nytimes.com/2015/08/16/technology/inside-amazon-wrestling-big-ideas-in-a-bruising-workplace.html"

When I opened the downloaded file in a text editor, I counted ten instances of <meta. When I loaded it into HTMLReader, the array returned by [document nodesMatchingSelector:@"meta"] had a count of ten.

May I ask how you're counting 31 and 50+ meta tags? Are you loading the page in a browser?

— Reply to this email directly or view it on GitHub https://github.com/nolanw/HTMLReader/issues/35#issuecomment-131995003.

jameslin101 avatar Aug 17 '15 23:08 jameslin101

Oh, I was not very careful. Turns out that curl command gets redirected to the login page because it isn't accepting cookies. I changed it to:

curl -OL -c cookies.txt "http://www.nytimes.com/2015/08/16/technology/inside-amazon-wrestling-big-ideas-in-a-bruising-workplace.html"

And got a text file with 92 instances of the string <meta, and with HTMLReader I dragged it into the playground and did

import HTMLReader

let path = NSBundle.mainBundle().pathForResource("updog.html", ofType: nil)!
let data = NSData(contentsOfFile: path)!
let home = HTMLDocument(data: data, contentTypeHeader: nil)
home.nodesMatchingSelector("meta").count

and got a count of 92 matching nodes.

Just for fun, I went to the page you linked in Safari, opened the Web Inspector, typed

document.querySelectorAll('meta').length

into the console, and got 93.

Is any of this helpful?

nolanw avatar Aug 19 '15 00:08 nolanw

Hi Nolan

Yup I'm getting the 93 when I'm running the javascript query as well.

I'm running this code in a brand new iOS singleview project after importing HTMLReader.h via Cocoapods and still getting a 31. Very strange.

  • (void)viewDidLoad { [super viewDidLoad]; NSURL *url = [NSURL URLWithString:@" http://www.nytimes.com/2015/08/16/technology/inside-amazon-wrestling-big-ideas-in-a-bruising-workplace.html?_r=0 "]; NSURLSession *session = [NSURLSession sharedSession]; [[session dataTaskWithURL:url completionHandler: ^(NSData *data, NSURLResponse *response, NSError *error) { NSString *contentType = nil; if ([response isKindOfClass:[NSHTTPURLResponse class]]) { NSDictionary *headers = [(NSHTTPURLResponse *)response allHeaderFields]; contentType = headers[@"Content-Type"]; } HTMLDocument *document = [HTMLDocument documentWithData:data

contentTypeHeader:contentType]; NSArray *metaNodes = [document nodesMatchingSelector:@"meta"]; NSLog(@"metaNodes %@ count:%lu", metaNodes, (unsigned long)[metaNodes count]); }] resume]; }

On Tue, Aug 18, 2015 at 8:31 PM, Nolan Waite [email protected] wrote:

Oh, I was not very careful. Turns out that curl command gets redirected to the login page because it isn't accepting cookies. I changed it to:

curl -OL -c cookies.txt "http://www.nytimes.com/2015/08/16/technology/inside-amazon-wrestling-big-ideas-in-a-bruising-workplace.html"

And got a text file with 92 instances of the string <meta, and with HTMLReader I dragged it into the playground and did

import HTMLReader let path = NSBundle.mainBundle().pathForResource("updog.html", ofType: nil)!let data = NSData(contentsOfFile: path)!let home = HTMLDocument(data: data, contentTypeHeader: nil) home.nodesMatchingSelector("meta").count

and got a count of 92 matching nodes.

Just for fun, I went to the page you linked in Safari, opened the Web Inspector, typed

document.querySelectorAll('meta').length

into the console, and got 93.

Is any of this helpful?

— Reply to this email directly or view it on GitHub https://github.com/nolanw/HTMLReader/issues/35#issuecomment-132402857.

jameslin101 avatar Aug 19 '15 06:08 jameslin101

I'm afraid I'm nearly out of ideas! If you take the data from your snippet and log it out as a string, do you see the markup you expect to see?

nolanw avatar Aug 20 '15 22:08 nolanw

@jameslin101 did you ever solve this?

nolanw avatar Sep 20 '15 19:09 nolanw

No I was not able to solve it. Probably a one-off issue with that particular article, but very strange.

On Sun, Sep 20, 2015 at 3:34 PM, Nolan Waite [email protected] wrote:

@jameslin101 https://github.com/jameslin101 did you ever solve this?

— Reply to this email directly or view it on GitHub https://github.com/nolanw/HTMLReader/issues/35#issuecomment-141824288.

jameslin101 avatar Sep 20 '15 20:09 jameslin101