ichrome
ichrome copied to clipboard
Need help with python -m ichrome.web
If i launch a browser as a service:
python -m ichrome.web
Then
import requests
from bs4 import BeautifulSoup
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'Accept-Language': 'es-ES,es;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
'sec-ch-ua': '"Google Chrome";v="117", "Not;A=Brand";v="8", "Chromium";v="117"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"macOS"',
}
params = (
('url', "https://oficinajudicialvirtual.pjud.cl/home/index.php"),
)
response = requests.get('http://127.0.0.1:8080/chrome/preview', headers=headers, params=params)
soup = BeautifulSoup(response.text, 'html.parser')
recaptcha_url = soup.select('iframe[title="reCAPTCHA"]')[0]["src"]
I have the recaptcha url
But If I do it like this:
from bs4 import BeautifulSoup
from torequests import tPool
from inspect import getsource
req = tPool()
async def tab_callback(task, tab, data, timeout):
await tab.wait_loading(20)
return await tab.html
json = {
'tab_callback': getsource(tab_callback),
"timeout": 20,
"incognito_args": {
"url": "https://oficinajudicialvirtual.pjud.cl/home/index.php",
"proxyServer": "37.19.220.129:8443"
}
}
response = req.post('http://127.0.0.1:8080/chrome/do',json=json)
soup = BeautifulSoup(response.text, 'html.parser')
recaptcha_url = soup.select('iframe[title="reCAPTCHA"]')[0]["src"]
I'm not having the fully load soup, I guess it could be some security measure of the origin website im scraping. Any help?
- try "proxyServer": "http://37.19.220.129:8443"
- use
await tab.screenshot(save_path='image_path')
watch the image what happened? - use
python -m ichrome.web --disable-headless
watch what happened while you request
Thanks @ClericPy ,it open the browser in the page, when the browser stops loading, loads the recaptcha but It looks that the response that returns me its without recaptcha url. Maybe it can be an async/await issue.
I tried this:
python -m ichrome.web --disable-headless
from bs4 import BeautifulSoup
from torequests import tPool
from inspect import getsource
req = tPool()
async def tab_callback(task, tab, data, timeout):
await tab.wait_loading(5000)
await tab.screenshot(save_path='./screenshot.png')
return await tab.html
json = {
'tab_callback': getsource(tab_callback),
"timeout": 5000,
"incognito_args": {
"url": "https://oficinajudicialvirtual.pjud.cl/home/index.php",
"proxyServer": "http://37.19.220.129:8443"
}
}
response = req.post('http://127.0.0.1:8080/chrome/do',json=json)
soup = BeautifulSoup(response.text, 'html.parser')
recaptcha_url = soup.select('iframe[title="reCAPTCHA"]')[0]["src"]
what did you see screenshot.png
?
I can't see the html to reappear.
use python -m ichrome.web --disable-headless
async def tab_callback(task, tab, data, timeout):
await asyncio.sleep(10000)
return await tab.html
to check the HTML in real chrome
@ClericPy can you implement one day an API request like this and pass a proxy as a parameter in the payload to the API call?
It's better like this because in this way, async/await it's removed
import requests
from bs4 import BeautifulSoup
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'Accept-Language': 'es-ES,es;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
'sec-ch-ua': '"Google Chrome";v="117", "Not;A=Brand";v="8", "Chromium";v="117"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"macOS"',
}
params = (
('url', "https://oficinajudicialvirtual.pjud.cl/home/index.php"),
)
data = {
"proxyServer": "http://37.19.220.129:8443"
}
response = requests.get('http://127.0.0.1:8080/chrome/preview', headers=headers, params=params, data = data)
soup = BeautifulSoup(response.text, 'html.parser')
recaptcha_url = soup.select('iframe[title="reCAPTCHA"]')[0]["src"]
The headers didn't be used by ichrome yet I need to think about the API some time