builder icon indicating copy to clipboard operation
builder copied to clipboard

Builder SDK - servers hang up until we restarted

Open XianfuZhengAfterpay opened this issue 3 years ago • 4 comments

What happened

During the incident, all page requests were timeout except for those pages cached by cloudflare. See below sceenshot. image

Our investigations and findings

  • There was a build in progress which will call Builder API to get page content but never ends. We killed from heroku command line. The incident happens ~10 minutes of start of the build

image

  • Dyno memory usage looks normal

image

  • There wasn't traffic spikes image

  • Calls to Builder API looks normal image

XianfuZhengAfterpay avatar Jun 04 '21 01:06 XianfuZhengAfterpay

Actually the outgoing calls to Builder stopped during the incident until we restarted the dynos:

image

For comparison, the traffic going out to our internal API increased during the incident (probably due to people refreshing the pages)

image

gydongAP avatar Jun 04 '21 22:06 gydongAP

Thanks Guangyu , wondering why we still have metrics reading in response time graph

One more thing I noticed, is the request number to builder API gradually decreased.

On Sat, Jun 5, 2021 at 8:49 AM Guangyu Dong @.***> wrote:

Actually the outgoing calls to Builder stopped during the incident until we restarted the dynos:

[image: image] https://user-images.githubusercontent.com/43394294/120869895-58a00300-c54c-11eb-92a3-95e376c2bab5.png

For comparison, the traffic going out to our internal API increased during the incident (probably due to people refreshing the pages)

[image: image] https://user-images.githubusercontent.com/43394294/120869915-68b7e280-c54c-11eb-93f2-084e821da8cd.png

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BuilderIO/builder/issues/484#issuecomment-855073383, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOX2QMQBIIUC3V4VAWQZB5DTRFKBBANCNFSM46B3MTPQ .

XianfuZhengAfterpay avatar Jun 04 '21 23:06 XianfuZhengAfterpay

thanks @gydongAP and @XianfuZhengAfterpay - this is helpful

the next thing we need to try and do is isolate what happened here. shy of having full access to your code and understanding how different states are handled, I am wondering if there is a way we can isolate a potential problem in the SDK

shy of seeing your code directly and knowing the nuances of how your stack handles certain situations I'm having trouble thinking of how to reproduce these findings in isolation - would it be possible for you to try and take a stab at this?

steve8708 avatar Jun 07 '21 17:06 steve8708

  • Shahar

@Shahar Sharon @.***> since you are more familiar with the code base, do you mind providing some insights here?

On Tue, Jun 8, 2021 at 3:11 AM Steve Sewell @.***> wrote:

thanks @gydongAP https://github.com/gydongAP and @XianfuZhengAfterpay https://github.com/XianfuZhengAfterpay - this is helpful

the next thing we need to try and do is isolate what happened here. shy of having full access to your code and understanding how different states are handled, I am wondering if there is a way we can isolate a potential problem in the SDK

shy of seeing your code directly and knowing the nuances of how your stack handles certain situations I'm having trouble thinking of how to reproduce these findings in isolation - would it be possible for you to try and take a stab at this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BuilderIO/builder/issues/484#issuecomment-856113155, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOX2QMV7GLU5RFCPV56YRDTTRT4VVANCNFSM46B3MTPQ .

XianfuZhengAfterpay avatar Jun 07 '21 23:06 XianfuZhengAfterpay