fast-instagram-scraper
fast-instagram-scraper copied to clipboard
Problems with Torpy?
Any Ideas on this behaviour? There seem to be problems with torpy, have you experienced this before and any idea how to solve it?
[...]
Initiating tor session 233
Circuit built.
Start iteration 0: 2022-10-06 14:40:11.079573
Tor end node blocked. Last response: <Response [404]>
0it [01:16, ?it/s]
Initiating tor session 234
Circuit built.
Start iteration 0: 2022-10-06 14:41:28.718957
Tor end node blocked. Last response: <Response [404]>
0it [00:07, ?it/s]
Initiating tor session 235
Circuit built.
Start iteration 0: 2022-10-06 14:41:37.347591
ERROR:torpy.cell_socket:_ssl.c:1112: The handshake operation timed out
ERROR:root:[ignored]
Traceback (most recent call last):
File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\cell_socket.py", line 63, in connect
self._socket.connect((self._router.ip, self._router.or_port))
File "C:\Users\...\anaconda3\envs\scrape\lib\ssl.py", line 1343, in connect
self._real_connect(addr, False)
File "C:\Users\...\anaconda3\envs\scrape\lib\ssl.py", line 1334, in _real_connect
self.do_handshake()
File "C:\Users\...\anaconda3\envs\scrape\lib\ssl.py", line 1310, in do_handshake
self._sslobj.do_handshake()
socket.timeout: _ssl.c:1112: The handshake operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\utils.py", line 79, in newfn
return func(*args, **kwargs)
File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\consesus.py", line 183, in newfn
return func(*args, **kwargs)
File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\consesus.py", line 426, in get_descriptor
with self._get_dir_client() as dir_client:
File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\consesus.py", line 375, in _get_dir_client
self._dir_guard, self._dir_circuit = self._create_dir_circuit(purpose='Internal dir client')
File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\consesus.py", line 365, in _create_dir_circuit
guard = TorGuard(router, purpose=purpose)
File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\guard.py", line 66, in __init__
self.__tor_socket.connect()
File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\cell_socket.py", line 69, in connect
raise TorSocketConnectError(e)
torpy.cell_socket.TorSocketConnectError: _ssl.c:1112: The handshake operation timed out
WARNING:torpy.utils:Retry with another router...
0it [00:31, ?it/s]
'graphql'
Initiating tor session 236
Circuit built.
Start iteration 0: 2022-10-06 14:42:09.078514
Tor end node blocked. Last response: <Response [404]>
0it [00:06, ?it/s]
Initiating tor session 237
Circuit built.
Start iteration 0: 2022-10-06 14:42:16.572684
WARNING:torpy.circuit:#80000242 circuit: has been destroyed already
ERROR:torpy.utils:[ignored] torpy.circuit.CellTimeoutError: Timeout wait for CellRelayExtended2 or CellRelayTruncated
WARNING:torpy.utils:Retry circuit creation
Tor end node blocked. Last response: <Response [404]>
0it [00:52, ?it/s]
Initiating tor session 238
That's the expected behavior when mining too fast. Tor end node blocked. Last response: <Response [404]>
indicates that the respective node got blocked which is likely to happen after while. Make sure to work with a higher --wait_between_requests
.
Thanks for your quick reply. I understand that and tried different numbers. But if a circuit is built, there seems to be a problem with torpy? Or would you suggest also increasing Tor-Timeouts?
Initiating tor session 4
0it [00:00, ?it/s]Circuit built.
Start iteration 0: 2022-10-07 11:06:04.996308
ERROR:torpy.utils:[ignored] torpy.circuit.CellTimeoutError: Timeout wait for CellRelayExtended2 or CellRelayTruncated
WARNING:torpy.utils:Retry circuit creation
WARNING:torpy.circuit:#8000000b circuit: has been destroyed already
ERROR:torpy.utils:[ignored] torpy.circuit.CellTimeoutError: Timeout wait for CellRelayExtended2 or CellRelayTruncated
WARNING:torpy.utils:Retry circuit creation
Exception in thread RecvLoop_103.251:
Traceback (most recent call last):
File "C:\Users\...\anaconda3\envs\scrape\lib\threading.py", line 980, in _bootstrap_inner
self.run()
File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\circuit.py", line 233, in run
callback(key.fileobj, mask)
File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\circuit.py", line 220, in _do_recv
for cell in self._tor_socket.recv_cell_async():
File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\cell_socket.py", line 104, in recv_cell_async
more_data = self._socket.recv(TorCellSocket.RECV_BUFF_SIZE)
File "C:\Users\...\anaconda3\envs\scrape\lib\ssl.py", line 1227, in recv
return self.read(buflen)
File "C:\Users\...\anaconda3\envs\scrape\lib\ssl.py", line 1102, in read
return self._sslobj.read(len)
ConnectionAbortedError: [WinError 10053] Eine bestehende Verbindung wurde softwaregesteuert
durch den Hostcomputer abgebrochen
Torsession terminated after 600 seconds tor_timeout.
I have seen this error before and until now only on Windows. This is indeed rather a problem related to torpy/SSL than fast-instagram-scraper.
If you already made sure to have the latest torpy version installed and used a virtual env, I would recommend switching to Ubuntu or if you're under Windows use WSL as the SSL error might be cumbersome to fix. There might be some conflicting SSL libraries or other hard to identify problems.
Let us know if it worked for you!
Just checked again. Instagram changed it's API recently so the logic needs slight refactoring first! Hence, it cannot work at the moment.
Thanks for checking it out - I didn't get to run it on wsl either so I guess the API is the Problem...Am 09.10.2022 20:57 schrieb do-me @.***>: Just checked again. Instagram changed it's API recently so the logic needs slight refactoring first! Hence, it cannot work at the moment.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>
Any updates on this @do-me? Great work btw!
Thanks for asking @fmac2000 (also @thohug), there are indeed.
tl;dr: Mining is getting harder, TOR end points and even residential IPs gets blocked fast (without login), no more GET but POST-Requests needed for pagination.
Let me try to sum up the current status of the active Instagram API's. Basically there are two API's running at the moment, one is the legacy API that I originally designed fast-instagram-scraper for and then there is the new one.
Legacy API
Example: https://instagram.com/graphql/query/?query_hash=ac38b90f0f3981c42092016a37c59bf7&variables={"id":"1020237355","first":50,"after":"2301822988561378864"}
On every page you would receive a cursor for pagination. In this example it's 2301822988561378864
that I retrieved from the previous page and insert in the following GET-request. That's what the very first version of fast-instagram-scraper did.
The legacy API is completely unchanged. You can still query stuff if you're lucky but TOR end nodes are 99% blocked. Even residential IPs get blocked after only a few requests. So the only option here is to use commercial rotating residential IPs. If you google it you will find tons of more or less shady/working/not working services offering such. If anyone needs a good recommendation write an email as I eventually managed to find a good one.
New API
Example: https://instagram.com/explore/locations/1020237355/?__a=1&__d=dis&max_id=<cursor>
The good thing is that the new API offers plenty of new interesting nodes in the response JSON; great for research. Also (strangely) it does not block TOR end nodes. But here comes the catch: You can fire a GET request to get the first page but if you want to paginate you cannot do it with a GET request as you must include the respective headers with a bunch of tokens (e.g. XCSRF etc.). You get these tokens only by accessing the page in a browser that can execute JS to generate them (as far as I understood).
So theoretically, if you do so, copy the tokens and wrap them in a POST request in Python you're good to go. However I am not sure at what point they are eventually blocked but probably fast.
You could also go with a commercial service as some offer those requests to be executed in a real browser (and hence request the needed tokens for the POST headers) and after do normal requests (that cost way less).
Advice
Depending on your needs there are different ways to go:
- Quick and simply working but costly: commercial rotating residential IPs + legacy API's GET request pagination
- Free but only first page per location: fast-instagram-scraper + new API (good for "broad" mining)
- Cumbersome and free: copy tokens from your browser + POST requests to new API in Python (a modified version of fast-instagram-scraper would do)
- Optimized commercial version: 1st request with JS execution, following without until the tokens expire.
Future of fast-instagram-scraper
Doesn't look too bright. Still, in the coming days I will update the script to work at least for every 1st location page of the new API. If someone already did, PR's are welcome.
Hope that clarifies the current situation. Let me know if you find out anything else!
I'm reopening the issue for everyone to see.
Update 11/2023: as torpy is currently unmaintained and needs refactoring due to TOR changes from V2 to V3 fast-instagram-scraper won't work.