PyDrive
PyDrive copied to clipboard
Downloaded files left with a CLOSE_WAIT status, causing [Errno 24] Too many open files
First of all I'd like to thank for a great and fun tool! Secondly, I'd like to say that this stuff about socket status, file descriptors is definitely out of my league. However, after plenty of googling I think there may be an issue with pydrive.
I have a folder with lots of files (~2000) that I'm downloading with a simple script. The code is basically
file_list = drive.ListFile({'q': "'%s' in parents and trashed=false" % parent}).GetList()
for f in file_list:
f.GetContentFile(f['title'])
The files downloads perfectly fine, but the sockets do not close completely. The status from lsof
looks like
python 1344 user 1022u IPv4 103326 0t0 TCP local_ip:53976->ams....net:https (ESTABLISHED)
The status eventually change from ESTABLISHED to CLOSE_WAIT (after ~1024 downloads, the first 800 have changed to CLOSE_WAIT). Here's the problem: Since they all occupy a file descriptor, and since there seems to be a limit on 1024 of these open (regardless of e.g. ulimit settings), I cannot download more files and get the error
[Errno 24] Too many open files
My perhaps naive question is: shouldn't pydrive close the connection and free that connection?
Alternatively: Can I manually get pydrive to close the socket?
Thanks!
Here's a MWE example
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
import datetime
import time
import os
import inspect
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
drive = GoogleDrive(gauth)
upload_dir = "errno24-testing"
download_dir = "/tmp/" + upload_dir
num_files = 1200
def generate_files(upload_dir):
print inspect.stack()[0][3]
os.makedirs(upload_dir)
content = "hello %d\n"
for i in range(num_files):
with open("./" + upload_dir + "/file%d.txt" % i, "w") as f:
f.write(content % i)
f.close()
def create_gdrive_folder(gdrive_upload_dir):
print inspect.stack()[0][3]
gdrive_folder = drive.CreateFile({'title': upload_dir,
'mimeType': 'application/vnd.google-apps.folder'})
gdrive_folder.Upload()
return gdrive_folder['id']
# not a problem with CLOSE_WAIT uploading
def upload_files(gdrive_folder_id, system_upload_dir):
print inspect.stack()[0][3]
cnt = 0
for filename in os.listdir(system_upload_dir):
print 'uploading %d %s' % (cnt, filename)
gdrive_file = drive.CreateFile({'parents': [{'id': gdrive_folder_id}],
'title': filename})
gdrive_file.SetContentFile(system_upload_dir + "/" + filename)
gdrive_file.Upload()
cnt += 1
def download_files(gdrive_folder_id, system_download_dir):
print inspect.stack()[0][3]
os.makedirs(download_dir)
files = drive.ListFile({'q': "'%s' in parents and trashed=false" % gdrive_folder_id}).GetList()
cnt = 0
for f in files:
print 'downloading %d %s' % (cnt, f['title'])
f.GetContentFile(system_download_dir + "/" + f['title'])
cnt += 1
if cnt > 1000:
input("check lsof -p for ESTABLISHED and CLOSE_WAIT")
# main
generate_files(upload_dir)
gid = create_gdrive_folder(upload_dir)
upload_files(gid, upload_dir)
download_files(gid, download_dir)
I hit this problem as well, and was able to hack around it by closing all of the underlying httplib2.Http object's connections after every call to GoogleDriveFile.GetContentFile()
:
file.GetContentFile(path)
for c in file.http.connections.values():
c.close()
I'm not exactly sure what pattern might have been intended for the GoogleDriveFile
objects, but I tend to keep thousands of them around indefinitely. Once you call GetContentFile()
, each file object holds on to a socket, which won't work if you use them like this.
By the way, in case anyone else is searching for this, another symptom I got was the message "nodename nor servname provided, or not known" regarding URLs with the form https://doc-xx-xx-docs.googleusercontent.com
. I don't know if this is some broken internal fallback hostname that shows up when you run out of sockets or what, but the fix above seems to have also resolved this issue for me.
I hit this problem today, for the first time, anything new from 2018?
@thekillgfx I think this issue should be fixed (as well as thread safety and other issues) in the maintained fork - PyDrive2.