imap2maildir icon indicating copy to clipboard operation
imap2maildir copied to clipboard

Failure for ~200k messages

Open viric opened this issue 13 years ago • 13 comments

Hello,

having four or five stops, I could end up downloading 47k of messages of 200k: $ ls gmailbackup/new/ | wc -l 43815

I run the command, to grab all the messages: $ python imap2maildir -u xxxxxx -r "[Gmail]/Tots els missatges" -s ALL --create -v -d gmailbackup

and for every run, I'm asked the password, and then it goes: Opening sqlite3 database 'gmailbackup/.imap2maildir.sqlite' Synchronizing 199663 messages from imap.gmail.com:[Gmail]/Tots els missatges to /home/llbatlle/tmp/rtucker-imap2maildir-fa0abe3/gmailbackup... TURBO MODE ENGAGED! Exception! Clearing locks and safing database. Traceback (most recent call last): File "imap2maildir", line 495, in main() File "imap2maildir", line 476, in main search=options.search) File "imap2maildir", line 396, in copy_messages_by_folder for i in folder.Summaries(search=search): File "/home/llbatlle/tmp/rtucker-imap2maildir-fa0abe3/simpleimap.py", line 357, in Summaries summ = self.__parent.get_summary_by_uid(u) File "/home/llbatlle/tmp/rtucker-imap2maildir-fa0abe3/simpleimap.py", line 256, in get_summary_by_uid '(UID ENVELOPE RFC822.SIZE INTERNALDATE)') File "/nix/store/qlmlvbsgb3q8iqlhkc7j8m6f9z71sbd6-python-2.6.5/lib/python2.6/imaplib.py", line 753, in uid typ, dat = self._simple_command(name, command, *args) File "/nix/store/qlmlvbsgb3q8iqlhkc7j8m6f9z71sbd6-python-2.6.5/lib/python2.6/imaplib.py", line 1060, in _simple_command return self._command_complete(name, self._command(name, *args)) File "/nix/store/qlmlvbsgb3q8iqlhkc7j8m6f9z71sbd6-python-2.6.5/lib/python2.6/imaplib.py", line 890, in _command_complete raise self.abort('command: %s => %s' % (name, val)) imaplib.abort: command: UID => socket error: unterminated line

I cannot download anymore. It takes quite a lot of time until the error appears. Can it be that gmail disconnects due to an inactivity timeout?

viric avatar Oct 13 '10 21:10 viric

I notice that in checkmessage() the turbo mode does an sql select query for every possible message to check if the message is there. This is a lot of work; I think that it would be far better to get the list into memory into an appropiate searchable structure, and do the check there.

viric avatar Oct 15 '10 22:10 viric

I've run into a couple cases where a specific message is "corrupted" on gmail's end, and trying to fetch it via IMAP fails. In simpleimap.py, putting a try/except around the get_summary_by_uid should find the IMAP UID that is choking it:

try:
    summ = self.__parent.get_summary_by_uid(u)
except:
    print "uid", u
    raise

Once you have that, it should be possible to delete the offending message.

It should be doing a better job of handling errors such as these. And yes, it is doing a SQL query for each UID... I don't remember why I did it that way, but I think memory consumption was a concern. On second thought, it shouldn't take THAT much memory, and it would likely improve performance a lot. :-) Good catch.

rtucker avatar Oct 18 '10 16:10 rtucker

Gmail simply closes the socket due to that much inactivity during the first stage of the TURBO MODE.

Once having the list of uids on memory, and checking there instead of by a sql query per uid, I think the turbo mode will work great.

I'm trying without turbo mode, but gmail disconnects me before I can reach even the 15% of my mail.

viric avatar Oct 18 '10 16:10 viric

Well.

On my gmail mailbox of ~145,000 messages, Last night's run: about 3.75 hours With a cache: 7 minutes, 22 seconds

Pull in the latest HEAD and let me know how that works for you.

rtucker avatar Oct 18 '10 21:10 rtucker

I just tried. I got, with turbo mode, with the old maildir directory that had some letters:

Exception!  Clearing locks and safing database.
Traceback (most recent call last):
  File "./imap2maildir", line 536, in 
    main()
  File "./imap2maildir", line 517, in main
    seencache=seencache)
  File "./imap2maildir", line 435, in copy_messages_by_folder
    for i in folder.Summaries(search=search):
  File "/home/llbatlle/tmp/imap2maildir/simpleimap.py", line 357, in Summaries
    summ = self.__parent.get_summary_by_uid(u)
  File "/home/llbatlle/tmp/imap2maildir/simpleimap.py", line 256, in get_summary_by_uid
    '(UID ENVELOPE RFC822.SIZE INTERNALDATE)')
  File "/nix/store/hd089201zv5fb1lqdxscv194snnynplj-python-2.7/lib/python2.7/imaplib.py", line 753, in uid
    typ, dat = self._simple_command(name, command, *args)
  File "/nix/store/hd089201zv5fb1lqdxscv194snnynplj-python-2.7/lib/python2.7/imaplib.py", line 1060, in _simple_command
    return self._command_complete(name, self._command(name, *args))
  File "/nix/store/hd089201zv5fb1lqdxscv194snnynplj-python-2.7/lib/python2.7/imaplib.py", line 890, in _command_complete
    raise self.abort('command: %s => %s' % (name, val))
imaplib.abort: command: UID => socket error: unterminated line

I am not very good at python, so sorry if I don't get more into details of the code. :) I will try again creating a new maildir.

viric avatar Oct 18 '10 21:10 viric

Well, at least it should be faster to test :-)

I just pushed a patch that will spit out the UID it choked on. Once you have that UID, you can try firing up Python and seeing if you can figure out what's wrong with the message:

import simpleimap
server = simpleimap.Server(hostname='imap.gmail.com', username='[email protected]', password='blah').Get()
server.select('[Gmail]/All Mail')
server.uid('FETCH', 376544, '(RFC822)')

... would spit out message uid 376544. Try the neighboring messages (presumably 376543 and 376545) as well. You can also try:

    server.uid('FETCH', 376544, '(UID ENVELOPE RFC822.SIZE INTERNALDATE)')

to see what that does, since that's what it is trying to do when it crashes.

imap2maildir could easily ignore this exception and have it continue on, but I think understanding why it is happening will be a very good thing.

Thanks! -rt

rtucker avatar Oct 18 '10 22:10 rtucker

Here you have it:

>>> server.uid('FETCH', 165982, '(RFC822)')
('OK', [('43816 (UID 165982 RFC822 {5523}', 'Delivered-To: [email protected]\r\nReceived: by 10.142.169.1 with SMTP id r1cs178792wfe;\r\n        Sun, 28 Sep 2008 07:49:53 -0700 (PDT)\r\nReceived: by 10.115.23.19 with SMTP id a19mr4311058waj.133.1222613393492;\r\n        Sun, 28 Sep 2008 07:49:53 -0700 (PDT)\r\nReturn-Path: \r\nReceived: from n16a.bullet.sp1.yahoo.com (n16a.bullet.sp1.yahoo.com [69.147.64.121])\r\n        by mx.google.com with SMTP id t1si2136057poh.13.2008.09.28.07.49.52;\r\n        Sun, 28 Sep 2008 07:49:52 -0700 (PDT)\r\nReceived-SPF: pass (google.com: domain of sentto-9862331-5848-1222613385-viriketo=gmail.com@returns.groups.yahoo.com designates 69.147.64.121 as permitted sender) client-ip=69.147.64.121;\r\nDomainKey-Status: good\r\nAuthentication-Results: mx.google.com; spf=pass (google.com: domain of sentto-9862331-5848-1222613385-viriketo=gmail.com@returns.groups.yahoo.com designates 69.147.64.121 as permitted sender) smtp.mail=sentto-9862331-5848-1222613385-viriketo=gmail.com@returns.groups.yahoo.com; domainkeys=pass [email protected]\r\nComment: DomainKeys? See http://antispam.yahoo.com/domainkeys\r\nDomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=lima; d=yahoogroups.com;\r\n\tb=LSlgVDUGFtooqe064kt32c5atqJ2pBA+7kklkoqGGl95lG8xCcl8wjfXI6G5C61jPvg4vE0TWl1f2ZdNkYh5Xeade6B9I0le2BqDz8bMtZLINLIKi8XRYyp1pFTQEyGw;\r\nReceived: from [69.147.65.171] by n16.bullet.sp1.yahoo.com with NNFMP; 28 Sep 2008 14:49:45 -0000\r\nReceived: from [66.218.67.109] by t13.bullet.mail.sp1.yahoo.com with NNFMP; 28 Sep 2008 14:49:45 -0000\r\nX-Yahoo-Newman-Id: 9862331-m5848\r\nX-Sender: [email protected]\r\nX-Apparently-To: [email protected]\r\nX-Received: (qmail 68424 invoked from network); 28 Sep 2008 14:49:42 -0000\r\nX-Received: from unknown (66.218.67.96)\r\n  by m45.grp.scd.yahoo.com with QMQP; 28 Sep 2008 14:49:42 -0000\r\nX-Received: from unknown (HELO mail.libertysurf.net) (213.36.80.105)\r\n  by mta17.grp.scd.yahoo.com with SMTP; 28 Sep 2008 14:49:42 -0000\r\nX-Received: from aliceadsl.fr (192.168.10.57) by mail.libertysurf.net (8.0.015)\r\n        id 482DC6AA00F031DC for [email protected]; Sun, 28 Sep 2008 16:49:42 +0200\r\nMessage-Id: \r\nX-Sensitivity: 3\r\nTo: "=?iso-8859-1?Q?tradukado?=" \r\nX-XaM3-API-Version: 3.2 R18 (B34 pl1)\r\nX-type: 0\r\nX-SenderIP: 91.171.195.43\r\nX-Originating-IP: 213.36.80.105\r\nX-eGroups-Msg-Info: 1:12:0:0:0\r\nFrom: "[email protected]?=" \r\nX-Yahoo-Profile: jorgos_esperanto\r\nSender: [email protected]\r\nMIME-Version: 1.0\r\nMailing-List: list [email protected]; contact [email protected]\r\nDelivered-To: mailing list [email protected]\r\nList-Id: \r\nPrecedence: bulk\r\nList-Unsubscribe: \r\nDate: Sun, 28 Sep 2008 16:49:42 +0200\r\nSubject: =?iso-8859-1?Q?Re:[tradukado]_verboj_por_tabulaj_sportoj_(surftabulo,\r\n\t_negxtabulo,_rultabulo,_ktp)?=\r\nReply-To: [email protected]\r\nX-Yahoo-Newman-Property: groups-email-tradt-m\r\nContent-Type: text/plain; charset=ISO-8859-1\r\nContent-Transfer-Encoding: quoted-printable\r\n\r\nOni jam delonge neplu biciklumas au gitarludas sed biciklas=0D\r\nkaj gitaras (kvankam ne mem estas biciklo au gitaro) kaj=0D\r\npraktikas bicikladon kaj gitaradon, ^cu ne ? ; nu kial ne ? =0D\r\n=0D\r\n^Ciu elektu mem kaj la popolo decidos tion, kion akcepti...=0D\r\n=0D\r\nJs.=0D\r\n=0D\r\ntradukado, 28 Sep 2008 : verboj por tabulaj sportoj=0D\r\n(surftabulo, negxtabulo, rultabulo, ktp)=0D\r\n=0D\r\nSaluton,=0D\r\nkiel vi verbe esprimus la diversajn X-tabulan sportojn, ekz=0D\r\nuzon de=0D\r\nsurftabulo, negxtabulo, rultabulo, ktp?=0D\r\n1. simple verbigu la substantivon, kompreneble!=0D\r\nsurftabuli, negxtabuli, rultabuli, ...  Do "Li X-tabulas."=0D\r\n2. ne ne, tia verba formo de "tabul-" sensencas aux sugestas=0D\r\nke la=0D\r\nsubjekto ESTAS tia tabulo, do necesas aldoni -um al la=0D\r\nsubstantivo:=0D\r\nsurftabulumi, negxtabulumi, rultabulumi, ... Do "Li X-tabulumas"=0D\r\n3. ne eblas verbigi tiel, oni bezonas uzi ian verbon kun la=0D\r\nsubstantivo: rajdi surftabulon, gliti sur negxtabulo, veturi=0D\r\nper rultabulo, ... Do "Li iras per X-tabulo" aux "Li iras=0D\r\nX-tabule" ktp=0D\r\n4. io alia...?=0D\r\nKiel oni nomu la agadojn substantive?=0D\r\n1. surftabulado, negxtabulado, rultabulado, ...=0D\r\n2. surftabulumado, negxtabulumado, rultabulumado, ...=0D\r\n3. surftabulrajdado, negxtabulglitado, rultabulveturado, ...=0D\r\n4. io alia...?=0D\r\ndankon,    russ=0D\r\n\r\n\r\n\r\n---------------------- ALICE N=B01 de la RELATION CLIENT 2008*-------------=\r\n-------\r\nD=E9couvrez vite l\'offre exclusive ALICE BOX! En cliquant ici http://abonne=\r\nment.aliceadsl.fr Offre soumise =E0 conditions.*Source : TNS SOFRES / BEARI=\r\nNG POINT. Secteur Fournisseur d.Acc=E8s Internet\r\n\r\n\r\n\r\n------------------------------------\r\n\r\nYahoo! Groups Links\r\n\r\n To visit your group on the web, go to:\r\n    http://groups.yahoo.com/group/tradukado/\r\n\r\n Your email settings:\r\n    Individual Email | Traditional\r\n\r\n To change settings online go to:\r\n    http://groups.yahoo.com/group/tradukado/join\r\n    (Yahoo! ID required)\r\n\r\n To change settings via email:\r\n    mailto:[email protected]=20\r\n    mailto:[email protected]\r\n\r\n To unsubscribe from this group, send an email to:\r\n    [email protected]\r\n\r\n Your use of Yahoo! Groups is subject to:\r\n    http://docs.yahoo.com/info/terms/\r\n\r\n'), ' FLAGS (\\Seen))'])

The big trouble looks like the Subject: line having a \r\n\t in the middle.

The relevant information from rfc2822 is in section 2.2.3. In short:

""" The process of moving from this folded multiple-line representation of a header field to its single line representation is called "unfolding". Unfolding is accomplished by simply removing any CRLF that is immediately followed by WSP. Each header field should be treated in its unfolded form for further syntactic and semantic evaluation. """ (I took this reference from this http://bugs.python.org/issue504152 )

viric avatar Oct 19 '10 13:10 viric

Sorry, I notice it is a problem of imaplib, still in python2,.7 and python3. I'll have to get around it somehow.

viric avatar Oct 23 '10 14:10 viric

I had the chance to investigate the issue more. My mailbox has messages from a specific person that, when he wrote long Subjects, his letters were written with an RFC 2822 violation. Instead of breaking the subject with CRLF + WSP, his letters have the subject broken only LF + WSP. That affects parsing the ENVELOPE answer, as imaplib works with readline(), and for readline() either \n or \r\n are end of lines. I wrote a patch for imaplib so I can keep on downloading. When finding a line ending in \n (not \r\n), I concatenate the next line and remove the \n\t sequence.

viric avatar Oct 23 '10 18:10 viric

Cool! I, unfortunately, haven't had a chance to look at this yet but that's probably where I was headed.

I am not opposed to working around bugs in imaplib.py using simpleimap.py... see the SimpleImapSSL class for an example of this. The process of getting a bug fixed in the Python library is very slow, and then it has to actually make it onto people's systems via Debian/Ubuntu/RHEL/CentOS/. And yes, there are more than a few such bugs.

rtucker avatar Oct 23 '10 19:10 rtucker

Once I success getting all my gmail mail, I'll try to write something worth sending, for that bug.

viric avatar Nov 18 '10 22:11 viric

Ouch - my quick hack worked for the case I had, but I got a new more difficult to defeat, also failing in the python library, not your code: Date: Sat, 12 Aug 2006 21:07:54 +0400 Subject: [EK-MASI] =?koi8-r?B?IkFydG8ga2FqIGFrdGl2ZWNvIg0KDQojRWtvdG9waW8gMjAwNiBaYWplanhv?= =?koi8-r?B?dmEgU2xvdmFraW8j?=

viric avatar Nov 19 '10 21:11 viric

Niiiice!

See my comment on Issue #10 -- having the "raw" response from the IMAP server helps with testing the weird ones.

rtucker avatar Nov 23 '10 17:11 rtucker