Goutte
Goutte copied to clipboard
Crawler doesn't contain html
Hey,
I fixed my previous code. But now, I have a problem with crawler. It doesn't contain any html.
But when I'm dumping $client->getResponse(), I'm getting HTML in it.
$client = new Client();
$client->setHeader('user-agent', "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36");
$crawler = $client->request('GET', 'https://wyobiz.wy.gov/Business/FilingSearch.aspx');
$form = $crawler->selectButton('Search')->form();
$domDocument = new \DOMDocument;
$input = $domDocument->createElement('input');
$input->setAttribute('name', '__ASYNCPOST');
$input->setAttribute('value', 'true');
$formInput = new InputFormField($input);
$form->set($formInput);
$crawler = $client->submit($form, array(
'ctl00$MainContent$txtFilingName' => 'Google',
));
$response = $client->getResponse();
var_dump($crawler);
crawler dump:
object(Symfony\Component\DomCrawler\Crawler)#196 (7) {
["uri":protected]=>
string(48) "https://wyobiz.wy.gov/Business/FilingSearch.aspx"
["defaultNamespacePrefix":"Symfony\Component\DomCrawler\Crawler":private]=>
string(7) "default"
["namespaces":"Symfony\Component\DomCrawler\Crawler":private]=>
array(0) {
}
["baseHref":"Symfony\Component\DomCrawler\Crawler":private]=>
string(48) "https://wyobiz.wy.gov/Business/FilingSearch.aspx"
["document":"Symfony\Component\DomCrawler\Crawler":private]=>
NULL
["nodes":"Symfony\Component\DomCrawler\Crawler":private]=>
array(0) {
}
["isHtml":"Symfony\Component\DomCrawler\Crawler":private]=>
bool(true)
}
response dump:
object(Symfony\Component\BrowserKit\Response)#247 (3) {
["content":protected]=>
string(47340) "1|#||4|25433|updatePanel|MainContent_UpdatePanel1|HTML_HERE......
["status":protected]=>
int(200)
["headers":protected]=>
array(6) {
["Cache-Control"]=>
array(1) {
[0]=>
string(7) "private"
}
["Content-Type"]=>
array(1) {
[0]=>
string(25) "text/plain; charset=utf-8"
}
["X-AspNet-Version"]=>
array(1) {
[0]=>
string(9) "4.0.30319"
}
["X-Powered-By"]=>
array(1) {
[0]=>
string(7) "ASP.NET"
}
["Date"]=>
array(1) {
[0]=>
string(29) "Wed, 22 Mar 2017 10:55:17 GMT"
}
["Content-Length"]=>
array(1) {
[0]=>
string(5) "47340"
}
}
}
What's wrong?
Well, the content in the Response is not valid HTML. so the HTML parsing fails
@stof It's returning just <section></section>
that's what I need. Can I somehow add doctype>html>head>/head>body> <section></section> >/body>/html
Can you fix your comment to use a markdown codeblock around the code ? I think the rendering stripped some content after just
(I don't understand your comment otherwise)
@stof fixed...
You said it's not valid HTML.
$response->getContent()
is returning <section>some info here</section>
To be valid it should be
<!doctrype html>
<html>
<head></head>
<body>
<section>some info here</section>
</body>
</html>
Am I right?
No, in the dump avoid, content is 1|#||4|25433|updatePanel|MainContent_UpdatePanel1|HTML_HERE
, meaning there are extra stuff at the beginning of the response, making it invalid HTML.
The response you receive is a text/plain
, not a text/html
one.