mailsplit icon indicating copy to clipboard operation
mailsplit copied to clipboard

FlowedDecoder blocks with large bodies

Open TjlHope opened this issue 1 year ago • 3 comments

We were bitten late last week by our email processing microservice getting completely locked up, memory ballooning, and eventually dying (I'm guessing an OOM).

Once we'd tracked the issue down to the email at fault, I was eventually able to work out that the problem was that the email was a text/plain; format=flowed reply from Mozilla Thunderbird to an HTML email containing ~10M of pictures.

What appears to have happened is that Thunderbird has converted the HTML to plain text which is then quoted (>) in the reply, and the images have been included as massive blobs of base64 encoded text on the end of the reply body.

FlowedDecoder is storing all the chunks, splitting on newline, doing its reflow processing, then joining again - and in our case there were >130k lines of base64 in the body.

We've got a temporary workaround tested where we just make the FlowedDecoder passthrough:

const FlowedDecoder = require("mailsplit/lib/flowed-decoder.js");
FlowedDecoder.prototype._transform = function _transform(chunk, enc, cb) { return cb(null, chunk); }
FlowedDecoder.prototype._flush = function _flush(cb) { return cb(); }

I'm hoping to start working on a proper fix this week where FlowedDecoder does proper stream decoding. I'm intending to have most of the logic in FlowedDecoder itself because libmime seems to only work with strings. Would this be acceptable, or would you really want libmime to be made stream aware instead?

Oh and thank you for a great set of email processing libs!

TjlHope avatar May 07 '24 10:05 TjlHope

Would increasing available memory help preventing OOM? Because there are other kinds of issues as well with very large text emails, eg a cron monitoring tool sending out an email with 20MB of plaintext content.

andris9 avatar May 07 '24 10:05 andris9

Thanks for the suggestion. It probably wouldn't hurt, but given it was only a ~12M email and we already truncate anything >25M I'm not sure how much it would help.

The process seemed to get completely blocked for 3-5 minutes which is more the problem for us. I guess preventing the OOM would at least mean it eventually recovers, and doesn't keep retrying the dodgy email for the initial processing, but as we also currently parse server side for view reqs from an end-user browser those reqs would always time out (and block the whole proc for the duration).

Thinking about it - defiitely take my OOM comment with a pinch of salt - I was much more focussed on the blocking than the memory, so need to investigate what eventually killed the proc more rather than making assumptions.

My initial thought for the workaround of making the FlowedDecoder passthrough was because that changed the processing time on my PC from 2+ minutes to <10s (IIRC), with no outrageous memory spikes. (Which is obviously with a 10M plain text body - I'll definitely test bigger text bodies as well though - thanks.) I wasn't able to repro the OOM on my PC with my noddy test case, need to double check that as well.

My OOM assumption was thinking that node/v8 wasn't able to trigger a GC with the main JS thread blocked, the join is incrementally creating and copying into bigger and bigger strings, and there was enough other stuff going on in prod that it hit the max heap? I've not looked into JS GC in a couple of years though, and never into string alloc, so might be barking up the wrong tree there. I'm more familiar with e.g.java StringBuilder but that doubles its backing char array every time, so would have only used up to ~30M, not the more than 1G that would be needed to exhaust our nodejs heap. Still need to investigate this some more as it's not really adding up for me - maybe we've got a healthcheck that's killing the blocked process that I'm not aware of, and it's not an OOM at all.

I'll try and get a reproducable test case that doesn't use customer emails as soon as I'm back working on this.

TjlHope avatar May 07 '24 12:05 TjlHope

Apologies for the delay, I'm currently having to progress this in bits of spare time (now that the workaround is in it's lower priority).

Unit test demonstrating the event loop blocking added https://github.com/TjlHope/mailsplit/commit/f0ed60c8bd376ffc0db7a93b6bf76f96cd60fbeb

I'll raise a PR if/when I'm able to actually get round to implementing a fix.

TjlHope avatar May 28 '24 19:05 TjlHope