pyasic icon indicating copy to clipboard operation
pyasic copied to clipboard

Network scan of some subnets time out on Ubuntu 22.04.2

Open martyfay1 opened this issue 9 months ago • 22 comments

Describe the bug When making a network from a subnet and then calling network.scan() on certain subnets, it will timeout and fail based on a singular ip being present in the subnet. When that ip is removed from the subnet, (by using MinerNetwork.from_list() and supplying all of the ips except the specific one) it works fine. The thing is that on macOS 13.6.6 it works fine even with that ip included.

I've traced the issue of the bug to the _get_miner_web() call, and more specifically the _web_ping() call with the url f"http://{ip}/". When examining the logs for the from the httpx.AsyncClient.get() call, the

[DEBUG][04/26/24 14:06:01](httpcore.connection) - connect_tcp.started host=<IP> port=80 local_address=None timeout=5.0 socket_options=[(1, 13, b'\x01\x00\x00\x00\xe8\x03\x00\x00')]
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/httpcore/_trace.py:85 in atrace
[DEBUG][04/26/24 14:06:01](httpcore.connection) - connect_tcp.complete return_value=<httpcore._backends.anyio.AnyIOStream object at 0x104cf6480>
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/httpcore/_trace.py:85 in atrace
[DEBUG][04/26/24 14:06:01](httpcore.http11) - send_request_headers.started request=<Request [b'GET']>
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/httpcore/_trace.py:85 in atrace
[DEBUG][04/26/24 14:06:01](httpcore.http11) - send_request_headers.complete
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/httpcore/_trace.py:85 in atrace
[DEBUG][04/26/24 14:06:01](httpcore.http11) - send_request_body.started request=<Request [b'GET']>
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/httpcore/_trace.py:85 in atrace
[DEBUG][04/26/24 14:06:01](httpcore.http11) - send_request_body.complete
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/httpcore/_trace.py:85 in atrace
[DEBUG][04/26/24 14:06:01](httpcore.http11) - receive_response_headers.started request=<Request [b'GET']>
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/httpcore/_trace.py:85 in atrace
[DEBUG][04/26/24 14:06:06](httpcore.http11) - receive_response_headers.failed exception=ReadTimeout(TimeoutError())
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/httpcore/_trace.py:85 in atrace
[DEBUG][04/26/24 14:06:06](httpcore.http11) - response_closed.started
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/httpcore/_trace.py:85 in atrace
[DEBUG][04/26/24 14:06:06](httpcore.http11) - response_closed.complete

It starts hanging on the response_closed.started line for around 100 seconds, before failing. On mac, this get request response closes without timeout of the scan.

The only difference I see is that on mac, the same logs show socket_options=[(65535, 128, b'\x01\x00\x00\x00\xe8\x03\x00\x00')]

Expected behavior The http get request should fail without disrupting the scan.

Desktop (please complete the following information):

  • OS: macOS 13.6.6 (working)
  • OS: Ubuntu 22.04.2 LTS (not working)

martyfay1 avatar Apr 26 '24 18:04 martyfay1

This is a very detailed report, thank you. This will likely take some digging, @jameshilliard any initial thought?

b-rowan avatar Apr 26 '24 18:04 b-rowan

@martyfay1 would you mind sharing what that IP is?

b-rowan avatar Apr 26 '24 18:04 b-rowan

I've made some changes on the dev_scan_fail branch, do you want to see if that fixes anything?

UpstreamData avatar Apr 26 '24 19:04 UpstreamData

I've tried with that branch and it still fails. I'll get back to you on what that IP is.

martyfay1 avatar Apr 26 '24 19:04 martyfay1

This has happened on 2 subnets. In both cases, the IPs correspond to a miner in a bad state. Nothing on it is responsive. The miners are S19J Pro running bitmain firmware Mon Dec 26 17:19:30 CST 2022.

martyfay1 avatar Apr 26 '24 19:04 martyfay1

Ok interesting. That makes a lot more sense, I had thought maybe somehow the IP address itself was killing the request, but if it's an unresponsive miner that's a lot different.

It's likely just not handling a read timeout properly.

b-rowan avatar Apr 26 '24 19:04 b-rowan

Do you have any suggestion for how I can mitigate this from delaying the whole scan when done on Linux? I don't want potentially one bad miner to harm scan performance...

martyfay1 avatar Apr 29 '24 13:04 martyfay1

I don't have any ideas yet, we're testing some things, I'm sure it can be fixed here in the library, but I just have to figure out what is actually freezing for that long and why.

b-rowan avatar Apr 29 '24 14:04 b-rowan

Seems like its hanging up somewhere in this function in httpcore -

    async def _response_closed(self) -> None:
        async with self._state_lock:
            if (
                self._h11_state.our_state is h11.DONE
                and self._h11_state.their_state is h11.DONE
            ):
                self._state = HTTPConnectionState.IDLE
                self._h11_state.start_next_cycle()
                if self._keepalive_expiry is not None:
                    now = time.monotonic()
                    self._expire_at = now + self._keepalive_expiry
            else:
                await self.aclose()

Any change you can run a debug python handler and see if you can find where it gets stuck?

Github Permalink - https://github.com/encode/httpcore/blob/39b756f07d40fab4eccb7c0dd3d7d369f1d2b4e8/httpcore/_async/http11.py#L245-L257

b-rowan avatar Apr 29 '24 14:04 b-rowan

I agree that it is getting stuck at that function. I am having a hard time figuring out how to debug it since it is on a server. I ran it with the python -m trace --trace scan.py and I think it is getting stuck on the last line there. Sorry if that's not much help. If you have any links to python debugging in the command line on a server lmk

martyfay1 avatar Apr 29 '24 16:04 martyfay1

Maybe something like this?

https://stackoverflow.com/questions/543196/how-do-i-attach-a-remote-debugger-to-a-python-process

b-rowan avatar Apr 29 '24 16:04 b-rowan

Hey so I spent a little more time working with the pdb. It hangs after the await sleep(0) call at site-packages/anyio/_backends/_asyncio.py(1192), because Executing <Handle _SelectorSocketTransport._call_connection_lost(None) created at /usr/lib/python3.12/asyncio/selector_events.py:875> took 94.301 seconds. That's a call to this, which calls this. I am suspicious of the call to self._sock.close() as I've seen that the code was cancelled at here before, after I'd cancelled it when it was hanging.

martyfay1 avatar Apr 29 '24 19:04 martyfay1

Can you try this at the start of the program? Curious if killing the connections with SO_LINGER will help.

pyasic.settings.update("socket_linger_time", 0)

b-rowan avatar Apr 29 '24 21:04 b-rowan

Set the linger to 0, same problem still.

martyfay1 avatar Apr 30 '24 11:04 martyfay1

I'm stumped on this one. You might just want to play around with the _web_ping method and see if you can get it to stop happening?

b-rowan avatar Apr 30 '24 15:04 b-rowan

Added an asyncio.wait_for() in the web ping method, can you see if the latest version of the dev_scan_fail branch fixes the issue?

b-rowan avatar May 13 '24 14:05 b-rowan

Still the scan takes ~100 seconds longer then it should. I've tried many things, like setting timeouts on the get request and what seems to happen is that the get request knows it needs to timeout after a certain time, say 5 seconds, and starts to close the request then, but the actual closing of the socket takes ~100 seconds. When trying to call a get request from the command line ie. curl -m 10 -v <ip> the timeout works great, and outputs:

curl -m 10 -v <ip>
*   Trying <ip>...
* Connected to <ip> (<ip>) port 80 (#0)
> GET / HTTP/1.1
> Host: <ip>
> User-Agent: curl/7.81.0
> Accept: */*
> 
* Operation timed out after 10001 milliseconds with 0 bytes received
* Closing connection 0
curl: (28) Operation timed out after 10001 milliseconds with 0 bytes received

If not including the -m 10 arguments, it will also take ~100 seconds to time out when the connection is reset by peer. Not sure if this helps at all but I couldn't figure it out from here.

curl -m 10 -v <ip>
*   Trying <ip>...
* Connected to <ip> (<ip>) port 80 (#0)
> GET / HTTP/1.1
> Host: <ip>
> User-Agent: curl/7.81.0
> Accept: */*
> 
* Recv failure: Connection reset by peer
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer

martyfay1 avatar May 13 '24 16:05 martyfay1

The last thing I can think of here before I try to take this to httpx is that it's possibly an issue with an older version of python? Have you tried on the mac with the same python version you have on the Ubuntu device?

b-rowan avatar May 15 '24 22:05 b-rowan

I just made sure, version 3.12.3 on the mac, and I upgraded from 3.12.2 to 3.12.3 on ubuntu and still experiencing the same issue.

martyfay1 avatar May 16 '24 17:05 martyfay1

Started a discussion in httpx, we shall see.

https://github.com/encode/httpx/discussions/3202

b-rowan avatar May 16 '24 18:05 b-rowan

I had a similar problem. I solved it by editing the config. https://github.com/encode/httpx/blob/master/httpx/_config.py#L370 DEFAULT_TIMEOUT_CONFIG DEFAULT_LIMITS DEFAULT_MAX_REDIRECTS

ph4n70m1984 avatar May 19 '24 19:05 ph4n70m1984

the bug is observed before the version 0.40.5

I am using the nmap scanner bypassing the built-in scanner

ph4n70m1984 avatar May 24 '24 19:05 ph4n70m1984