parse-server-push-adapter icon indicating copy to clipboard operation
parse-server-push-adapter copied to clipboard

Large number of targeted Installations make the servers failed (499 errors)

Open SebC99 opened this issue 7 years ago • 23 comments

Hello, We want to be able to send a push request to all our users in an area (more than 100k) but it fails the servers with far too many connexions (nginx 1024 worker_connections are not enough) and all the standard requests to the servers end up with 499 errors.

Our Parse servers are on Elastic Beanstalk, and we use a simple query new Parse.Query("Parse.Installation").exists("deviceToken") in the Parse.Push.send method.

SebC99 avatar Mar 14 '19 10:03 SebC99

anyone here?

SebC99 avatar Apr 30 '19 18:04 SebC99

Hi @SebC99

Not an issue I have run into personally.

Given the large number, can you use a queue and send them in smaller batches?

acinader avatar Apr 30 '19 18:04 acinader

Why not, but I honestly don't know how do use this kind of queue ;) And I've tried to use the batchSize parameter in the query or push methods, but with no better results. Which size of batches would you recommend anyway? Even 10.000 pushes take a lot of time (more than an hour)

SebC99 avatar Apr 30 '19 18:04 SebC99

Have you tried PARSE_SERVER_ENABLE_EXPERIMENTAL_DIRECT_ACCESS=1? I recently hit max open file (TCP connection) on a completely separate issue.

Do you know where your connections are coming from / going to?

dplewis avatar Apr 30 '19 18:04 dplewis

I don't know what the batch size should be. for large pushes, ideally, you could parallelize

@flovilmart looked into this in the past and I wrote https://github.com/parse-community/parse-server-sqs-mq-adapter, but i never used it.

acinader avatar Apr 30 '19 18:04 acinader

@dplewis what do you mean? It's just the push adapter emits too many connections I guess so there is no room for any other requests. The batch feature of the adapter for identical payload doesn't seem to work... But it's very hard to understand the push and queue code ;)

SebC99 avatar Apr 30 '19 18:04 SebC99

I understand that the problem is not about sending the pushes. It seems that the pushes are successfully sent, right @SebC99 ? Can you observe in the push status if they are all sent?

What I saw sometimes is: the pushes are successfully sent but, as the clients receive them, they hit back the parse api and it makes the server to crash. Since you are noticing the worker_connections error in nginx, it might be the problem.

I see two possible solutions:

  • Like @acinader suggested, send the pushes in batches. You don't necessarily need to use a queue. You can use an approach as simple as first push everybody that is ios, then android. Or split by installation date, for example.
  • Scale up your servers horizontally in order to handle more requests from the clients in the peak.

davimacedo avatar Apr 30 '19 18:04 davimacedo

@davimacedo not at all!! Only a very small number are sent, like 5000

SebC99 avatar Apr 30 '19 18:04 SebC99

What is the status you see in your push status? Sending forever? How are you running your parse server process? Is it a docker container? A service? Have you noticed this process crashing when sending the pushes?

davimacedo avatar Apr 30 '19 18:04 davimacedo

Have you tried batchSize < 5000 ?

davimacedo avatar Apr 30 '19 18:04 davimacedo

Here's what is in the DB for the last try

{ 
    "_id" : "YsXd23ED", 
    "pushTime" : "2019-03-09T12:24:30.138Z", 
    "query" : "{\"deviceToken\":{\"$exists\":true}}", 
    "payload" : "{
        \"alert-fr\":{\"title\":\"XXXX\",\"body\":\"XXXX\"},
        \"alert\":{\"title\":\"XXXX\",\"body\":\"XXXX\"},
        \"category\":\"update\",
        \"channel\":\"remote_notifications\",
        \"campaign\":\"marketing\"
    \"}", 
    "source" : "rest", 
    "status" : "running", 
    "numSent" : NumberInt(1496), 
    "pushHash" : "c4bf3a4c2e953169ead4d9c034576006", 
    "_wperm" : [

    ], 
    "_rperm" : [

    ], 
    "_acl" : {

    }, 
    "_created_at" : ISODate("2019-03-09T12:24:30.140+0000"), 
    "_updated_at" : ISODate("2019-03-09T12:28:00.645+0000"), 
    "count" : NumberInt(3595), 
    "failedPerType" : {
        "android" : NumberInt(327), 
        "ios" : NumberInt(40)
    }, 
    "numFailed" : NumberInt(367), 
    "sentPerType" : {
        "android" : NumberInt(537), 
        "ios" : NumberInt(959)
    }
}

SebC99 avatar Apr 30 '19 19:04 SebC99

@SebC99 This is what I was talking about https://github.com/parse-community/parse-server/pull/4173

With direct access there isn't any overhead but now that it uses HTTP interface that opens another connection. I think that's where your issue is coming from.

dplewis avatar Apr 30 '19 19:04 dplewis

Thanks, I haven't noticed that one, I'll give it a try (direct access has failed me before for cloud functions, so I haven't try for push yet)

SebC99 avatar Apr 30 '19 19:04 SebC99

Ignore that last comment looks like that has been updated. I don't know much about the push and queue code. I can try to run it locally and see what's causing the issue. I think is similar to what @davimacedo mentioned, something might be hitting the Parse API.

dplewis avatar Apr 30 '19 19:04 dplewis

I'll try to investigate too. If I remember well, a lot of beforeFind or beforeSave were appearing in the log, and I think it was about _User class, but I'm not sure.

SebC99 avatar Apr 30 '19 19:04 SebC99

After some tests, PARSE_SERVER_ENABLE_EXPERIMENTAL_DIRECT_ACCESS seems to decrease the load of the server. But:

  • with content-available:1 and no payload, the push is sent quite fast (it is still marked as running hours after, but in 5 minutes 300.000 pushes have been sent)
  • with a payload, it is very very slow and seems to stop after 3000 sent pushes (in 5 minutes)

BTW, I understand numSent and numFailed values, but what is the count value?

SebC99 avatar May 01 '19 13:05 SebC99

And with VERBOSE, I can clearly see: MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 connect listeners added. Use emitter.setMaxListeners() to increase limit

I have the exact same error with batchSize set to 10 than batchSize set to 5000. Even for a push with 50 device tokens only!

SebC99 avatar May 01 '19 15:05 SebC99

I also noticed this weird error from node-apn: https://github.com/node-apn/node-apn/issues/653#issue-439213015

SebC99 avatar May 01 '19 15:05 SebC99

If it helps, I keep testing things:

  • it works much better with android devices than ios devices
  • removing the database maxTimeMS on parse server helps
  • removing invalid device tokens is very very long, and when maxTimeMS is set to 5000ms it can fail, and it seems the promises stop here and the push just hangs (which explain the forever "pending" status)
  • there's clearly an issue with ios push, where there is a far too long promises chains that fails nodejs
  • without data payload it's much faster in any case

SebC99 avatar May 01 '19 16:05 SebC99

If you look here promises are serialized.

Maybe do something similar to https://github.com/parse-community/parse-server/pull/5420 to prevent a bottleneck.

Enqueue by PushStatusId or pushStatus.objectId

dplewis avatar May 01 '19 17:05 dplewis

@SebC99 You said that it is much better without payload and that's interesting. I am wondering if the problem is the whole payload. Can you please try without sending "alert-fr":{"title":"XXXX","body":"XXXX"}, in the payload? I am wondering if the problem is related to the locale feature.

BTW, count is total pushes that should be sent, sent is how many succeeded and failed is how many failed. Ideally count should be sent + failed. In your case, the status is "sending" forever. It uses to happen when some of your batches failed to send due to a server crash (and will never be sent again). Because of this, sent + failed is always < count and the status never changes. The 3 most common reasons that I've seen to make it happen:

  1. The reason I told before - something hitting back the server, crashing the server process and therefore stopping the batched to be sent
  2. The query that is submitted to mongodb for each batch timeouts: parse server uses skip/limit to build the batches and it sometimes doesn't perform well
  3. When building the batches, the process that is running parse server hits the maximum RAM limit and the processes crash.

Would you be able to observe if some of these are likely to be happening?

davimacedo avatar May 01 '19 21:05 davimacedo

@davimacedo I tried with a simple "alert" payload (not localized) and I do have the exact same thing.

  • It's not 1. as I run my test on a standalone server without any other incoming requests
    1. timeout is clearly an issue as removing the maxTimeMS improves the result
    1. I reach the MaxListenersExceededWarning limit but not a max RAM limit, and in every case no crash on server side, just infinite hangs. Again PARSE_SERVER_ENABLE_EXPERIMENTAL_DIRECT_ACCESS removes the saturation of the server, but the speed/hang issues are still there. I think the serialized promises + the long request timeouts (to delete deviceToken?) can explain a lot, but there's still the payload impact which I can't explain...

SebC99 avatar May 01 '19 21:05 SebC99

@SebC99 Thank you for providing detailed feedback. We have a general idea and suggestions on where the issue may be coming from.

Would you like to take a look at the serialized promises I pointed out https://github.com/parse-community/parse-server-push-adapter/issues/123#issuecomment-488357508 and submit a fix?

dplewis avatar May 01 '19 22:05 dplewis