hpple icon indicating copy to clipboard operation
hpple copied to clipboard

Missing content string when parsing HTML

Open feigi opened this issue 15 years ago • 3 comments

Hi!

I want to use hpple in my little iPhone project at stumbled upon a problem. When parsing a table in a website, I noticed that text within a

tag was not parsed. Following the example I tested with:


Un couple épatant (2002)

aka "Trilogy: Two" - International (English title) , UK

aka "An Amazing Couple" - International (English title)

aka "Two" - UK

The text (2002) should be content of the <td> tag but querying the respective element's content resulted in an empty string. I debugged and found the cause, although I am not sure whether it broke anything else, which is why I would kindly ask you to look into it.

The problem lies in XPathQuery's DictionaryForNode method:

if ([[resultForNode objectForKey:@"nodeName"] isEqual:@"text"] && parentResult) { [parentResult setObject:[currentNodeContent stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]] forKey:@"nodeContent"]; return nil; }

If I got it right, the parser checks the currentNode's name first and if this is set to "text" (I assume libxml does that) it sets the parentNode's content to that string. The problem here is, it doesn't check whether there is already a string. Instead it replaces a potentially present string, even with an empty string. Here is my solution (again not really tested yet):

if ([[resultForNode objectForKey:@"nodeName"] isEqual:@"text"] && parentResult) { currentNodeContent = [currentNodeContent stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceCharacterSet]]; NSString *parentNodeContent = [parentResult objectForKey:@"nodeContent"]; if (parentNodeContent) { currentNodeContent = [parentNodeContent stringByAppendingString:currentNodeContent]; } [parentResult setObject:currentNodeContent forKey:@"nodeContent"]; return nil; }

Please note that I don't remove newline characters. I'm not sure what's the best approach here, but a string without any seperators doesn't seem like a good idea to me.

I hope this really is a bug. Feedback is appreciated.

Cheers, Chris

feigi avatar Apr 19 '10 18:04 feigi

Sorry for the bad formatting. I did what I can but this interpreter sucks

feigi avatar Apr 19 '10 19:04 feigi

Hi all - will someone ever look into this?

I noticed today that the bug seems to be still present parsing certain websites - I also tried your fix, but it doesn't seem to help.

Any comments? Thx, CJ

darktab avatar Jul 21 '12 21:07 darktab

i don't know if it is the same issue, but indeed i had a lot of nil return parsing a website table. I tried to copy paste the XPath from Firebug and it doesn't work very often. When what i need is the href or alt or the img, ... it isn't possible with XPath (at list with this library). The biggest issue was that i got columns with different length when some cells where empty.

My solution was to parse with a shorter XPath request that guarantee me the same column length and to create a method that use Blocks to "manually" parse what i obtained or set to @"" instead of nil:

typedef NSString *(^element)(TFHppleElement *);

-(NSMutableArray ) arrayOfStringWithXPath: (NSString) xpath andBlock:(element) el { NSArray *nodes; NSMutableArray *retour; nodes = [self.categoryParser searchWithXPathQuery:xpath]; retour=[NSMutableArray arrayWithCapacity:[nodes count]]; for (TFHppleElement *i in nodes) { [retour addObject:el(i)];} return retour; }

Some Blocks, every time i had an exception, i had a block :

// if Xpath sends href="..." you got the URL element elementTexteParseChaine = ^(TFHppleElement *el) { NSString *chaine=[el content]; if (chaine==nil) { chaine=@""; } else { NSScanner *scanner = [NSScanner scannerWithString:[el content]]; [scanner setCharactersToBeSkipped:nil]; [scanner scanUpToString:@""" intoString:NULL]; if([scanner scanString:@""" intoString:NULL]) { NSString *tempchaine=@""; if([scanner scanUpToString:@""" intoString:&tempchaine]) chaine=tempchaine; } } return chaine; }; // If XPath sends A give the URL inside href element elementAHref = ^(TFHppleElement *el) { NSString *chaine=[[el firstChildWithTagName:@"a"] objectForKey:@"href"]; if (chaine==nil) { chaine=@""; } return chaine; };

// Return the NString XPath content

element elementTexte = ^(TFHppleElement *el) { NSString *chaine=[el content]; if (chaine==nil) { chaine=@""; } return chaine; }; // If XPath sends IMG return the text inside ALT field element elementAlt = ^(TFHppleElement *el) { NSString *chaine=[el objectForKey:@"alt"]; if (chaine==nil) { chaine=@""; } return chaine; }; // If XPath sends a @name return the content of name element elementName = ^(TFHppleElement *el) { NSString *chaine=[[el firstChild] content]; if (chaine==nil) { chaine=@""; } return chaine; }; // If XPath sends a img it returns the content of alt element elementAImgAlt = ^(TFHppleElement *el) { NSString *chaine=[[[el firstChildWithTagName:@"a"] firstChildWithTagName:@"img" ] objectForKey:@"alt"]; if (chaine==nil) { chaine=@""; } return chaine; };

// retourne le text du champs alt en nsstring si le xpath recherche une cellule ou img est le premier

element elementImageAlt = ^(TFHppleElement *el) { NSString *chaine=[[el firstChildWithTagName:@"img"] objectForKey:@"alt"]; if (chaine==nil) { chaine=@""; } return chaine; }; // retourne l'url en nsstring si le xpath recherche l'attribut @href element elementHref = ^(TFHppleElement *el) { NSString *chaine=[[el firstChild] content]; if (chaine==nil) { chaine=@""; } return chaine; };

targuy avatar Nov 28 '13 14:11 targuy