boilerpipe
boilerpipe copied to clipboard
Bad xml format in html output from Web API
• What steps will reproduce the problem?
Get an html or htmlFragment from any page
• What is the expected output? What do you see instead?
The output have an xml declaration, but instead of a valid html/xml structure
there are extra tags that break the xml:
<?xml version="1.0" encoding="utf-8" ?>
<meta …/>
<base … />
<html>
<body>
...
</body>
</html>
And in the <html> the style comes directly after the <html> and not in a <head>.
The correct output would be:
<?xml version="1.0" encoding="utf-8" ?>
<html>
<head>
<meta …/>
<base … />
<style>...</style>
</head>
<body>
...
</body>
</html>
• What version of the product are you using? On what operating system?
The Web API http://boilerpipe-web.appspot.com/extract
And thanks for this great *GREAT* tool!!!
--
François
Original issue reported on code.google.com by [email protected]
on 3 Dec 2011 at 4:13
Hi François,
thanks for pointing this out.
The addition of meta and base was a deliberate decision (it was just easier to
append it in front of the highlighted HTML). Nevertheless, it is worth fixing.
Cheers,
Christian
Original comment by ckkohl79
on 22 Jan 2012 at 10:57
- Changed state: Accepted
- Added labels: Type-Enhancement, Priority-Low
- Removed labels: Type-Defect, Priority-Medium