internetarchive https connections aren't closed

I'm trying to upload a large(-ish) dataset and exhausted my file descriptors. After some quick debugging I noticed that the https connections don't seem to be closed correctly:

[...]
ia      20254 user  469u  IPv4 99476489      0t0  TCP X.X.X.X:XXX->207.241.239.10:443 (CLOSE_WAIT)
ia      20254 user  470u  IPv4 99476491      0t0  TCP X.X.X.X:XXX->207.241.239.10:443 (CLOSE_WAIT)
ia      20254 user  471u  IPv4 99476522      0t0  TCP X.X.X.X:XXX->207.241.239.10:443 (CLOSE_WAIT)
ia      20254 user  472u  IPv4 99476528      0t0  TCP X.X.X.X:XXX->207.241.239.10:443 (CLOSE_WAIT)
ia      20254 user  473u  IPv4 99476553      0t0  TCP X.X.X.X:XXX->207.241.239.10:443 (ESTABLISHED)

I'm using 1.8.1-1+deb10u1 from debian stable (which has the close patch for files backported). I don't have the time right now to test if the bug is still present in the current git master, but after a quick code search for close I'm assuming it still is. I'm reporting this issue downstream next, this is just a heads up in case it's still present. If you remember which commit this was fixed in I'd appreciated the commit id so I can backport it.

Apr 14 '20 13:04 kpcyrd

first reported in https://bugs.debian.org/956707

Apr 14 '20 14:04 anarcat

You don't say if XXX source ports are the same or different ones?

Is the ESTABLISHED connection still transferring data and your concern is only about CLOSE_WAIT ones?

Apr 16 '20 14:04 saper

@kpcyrd can you try the newest version? #277 should be fixed there...

Apr 16 '20 14:04 saper

The source port is different for every connection. The ESTABLISHED connection is the currently active one, but it seems there's a new connection for every file transfer and the file descriptor leaks after the file is uploaded:

CLOSE_WAIT - Indicates that the server has received the first FIN signal from the client and the connection is in the process of being closed. This means the socket is waiting for the application to execute close(). A socket can be in CLOSE_WAIT state indefinitely until the application closes it. Faulty scenarios would be like a file descriptor leak: server not executing close() on sockets leading to pile up of CLOSE_WAIT sockets.

The more files are uploaded, the longer the list of CLOSE_WAIT connections gets.

The version I'm running already contains the commit in #277 as it has been backported. It doesn't solve anything since it's only closing files, not tcp connections.

Apr 16 '20 20:04 kpcyrd

i agree that #277 wouldn't fix network problems because it only affects files.

i'm also baffled as to why this bug would occur in the first place: from what i can tell reading the source code, api.py just delegates HTTP requests to the requests module, and uses requests.sessions.Session which should reuse existing HTTP connexions. now maybe we're misconfiguring it somewhere?

Apr 20 '20 18:04 anarcat

@anarcat connection reuse is disabled (in a hacky way) here https://github.com/jjjake/internetarchive/blob/96e02c62d5aa386d77b8ea88f5c185f7f6304fae/internetarchive/session.py#L124 I'm not sure if that's sufficient to cause an fd leak though

Apr 20 '20 23:04 kpcyrd

that certainly looks bad. 8fa4cd6db9301b552dbe8d98875d1b9437fae0d7 could have used a little more reasoning, that definitely defies logic in my book...

Apr 20 '20 23:04 anarcat

I ran into a similar issue recently using search_items imported from internetarchive.api (i.e. from internetarchive import search_items). Here's a simplified example that can be used to replicate the issue:

In [3]: from internetarchive import search_items, get_session

In [2]: try:
   ...:     x = [search_items('identifier:{}'.format(i)).num_found for i in range(0,200)]
   ...:     print(x[0])
   ...: except Exception as e:
   ...:     print(e)
   ...:
(MaxRetryError("HTTPSConnectionPool(host='archive.org', port=443): Max retries exceeded with url: /services/search/v1/scrape?q=identifier%3A119&REQUIRE_AUTH=true&count=10000&total_only=true (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x10e921160>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))"), 'https://archive.org/services/search/v1/scrape?q=identifier%3A119&REQUIRE_AUTH=true&count=10000&total_only=true')

I was able to get around this issue by reusing the same ArchiveSession object:

In [1]: from internetarchive import search_items, get_session

In [2]: s = get_session()

In [3]: try:
   ...:     x = [search_items('identifier:{}'.format(i), archive_session=s).num_found for i in range(0,200)]
   ...:     print(x[0])
   ...: except Exception as e:
   ...:     print(e)
   ...:
0

Sorry for the late reply, @kpcyrd, but if you're able to I'd recommend seeing if doing something similar would fix this issue for you. For example:

from internetarchive import upload, get_session
s = get_session()
r = upload(... archive_session=s)

I think the fix here is to use a single/global session object in internetarchive.api (as opposed to inside each funtion), unless one is provided in a function call. Does that sound right? Any thoughts on this?

This would require changing the parameters for all functions in internetarchive.api, because if we had a single global session object that is initialized at import, you wouldn't be able to provide it with any ArchiveSession parameters when a function from internetarchive.api is called (to use any of the ArchiveSession parameters, you'd need to proved your own ArchiveSession object via the archive_session parameter when a function is called).

This has the potential to break a lot existing code, so I'd make this a v2.0.0 change.

Please let me know if anyone has any thoughts, questions, or concerns.

May 21 '20 20:05 jjjake

Could this be solved with a context manager, like you did using with here? https://github.com/jjjake/internetarchive/pull/277/commits/2b90de3b8293d3b4d69eebc0cc7bd37424de179e

Hold my beer... maybe we could ensure connections are closed in a __del__ function as they go out of scope, or even write a decorator for this purpose.

May 22 '20 00:05 siznax

@siznax It didn't seem to help in my testing. I think using a single session object is the way to go (and how requests.Session objects are intended to be used, I believe).

Let me know if you experience otherwise! Thanks for the thought/feedback!

May 22 '20 17:05 jjjake

@siznax It didn't seem to help in my testing. I think using a single session object is the way to go (and how requests.Session objects are intended to be used, I believe).

Can we fix the ia executable to do that? :)

May 22 '20 19:05 kpcyrd

@kpcyrd Interesting, for some reason I thought you were using internetarchive.api.upload (e.g. from internetarchive import upload). ia upload already only uses a single session object, so it looks like there are at least a couple issues here.

Sorry for the trouble, hopefully we can get this resolved soon.

May 22 '20 19:05 jjjake

@jjjake Does it really use a single session object though? _upload_files is passed the session object, but it is never used as far as I can see.

EDIT: Sorry, I think you are right. I missed that the item already has the session when it is retrieved. The archive_session argument of _upload_files is confusing though.

May 26 '20 02:05 Dobatymo

I can't recall at the moment why archive_session exists as a parameter. The ia_upload.py script is a little confusing, and could probably be cleaned up...

At any rate, it should be using a single session object. The session object is used to initialize the item (i.e. session.get_item(args['<identifier>'])), and whenever the Item.upload() method is used it uses the session object attached to the item. There could be a bug there, but that's how it should be working.

May 27 '20 18:05 jjjake