ord icon indicating copy to clipboard operation
ord copied to clipboard

Gracefully stopping server works unless server has been running a few days

Open bodily11 opened this issue 1 year ago • 13 comments

I'm running the latest version of Ord on AWS Linux on a beefy ec2 instance. With the latest changes, I can start the server, then gracefully stop the server, then start the server again quickly. This is fantastic as it used to have issues with never gracefully stopping.

However, if I run the server for a few days and then try to gracefully stop, it appears unresponsive to SigTerm.

Do I need to wait hours for the server to gracefully stop after it has been running for a few days? Or are there additional changes needed to enable graceful stopping of the server after a long process? Anyone else having issues here?

bodily11 avatar Jul 18 '23 14:07 bodily11

Haven't been able to reproduce this yet.

veryordinally avatar Jul 20 '23 09:07 veryordinally

What does the log (RUST_LOG=info) say after you CTRL-C?

raphjaph avatar Jul 20 '23 22:07 raphjaph

Is the server caught up on indexing? If not, it will shut down quickly if you shut it down shortly after starting because it has a tiny amount of data to flush. If it's still indexing and running for a while, it could have up to 5000 blocks to flush before it will fully shut down. In my experience, that could take a few hours, even when backed by an nvme drive.

To give you an idea of time, I changed my batch size from 5k to 500 and it takes just under an hour to run the commit step (sats indexing is on).

victorkirov avatar Jul 21 '23 19:07 victorkirov

It might be a better experience to only stop the web server after the indexer is shut down. That way at least the server will be useable while the indexer is committing the new blocks.

victorkirov avatar Jul 21 '23 20:07 victorkirov

Going to get my rust logs going today to check for you Raph.

@victorkirov yes server is caught up on indexing. I did dozens of SIGTERM and SIGINT and then waited 3 days and it was still running (appears to be ignoring).

Will report back on rust logs shortly.

bodily11 avatar Jul 21 '23 20:07 bodily11

Ok, that happened to me from one of the commits in the branch. Are you running off main or off my old branch? There was one commit that would send ord into zombie mode with an uninterruptible sleep. I had to restart one of my k8s nodes to kill the process 😅

victorkirov avatar Jul 21 '23 20:07 victorkirov

I was able to reproduce this issue along with find a potential solution.

When starting the server, I was originally running ord server --http-port 8080 &. The ampersand allows the process to be run in the background. I'm running on an AWS Linux instance that allows these processes to continue running even after exiting the terminal. Using this command I was able to kill the process using SIGTERM if issued the command immediately, but after a few hours/days, it would become unresponsive to all kill commands except SIGKILL.

The solution is to use nohup nohup ord server --http-port 8080 &. When I use nohup, even if I leave the server running for hours/days, it maintains the ability to respond to SIGTERM.

I tried to dig into the difference between using nohup and just using the ampersand, but couldn't identify why this fix would work. But I can verify that using nohup solves the issue.

bodily11 avatar Jul 25 '23 13:07 bodily11

I would recommend just using systemd for running ord on a server. We have an example unit file in the deploy directory. With that you can easily manage it with systemctl and see logs with journalctl.

raphjaph avatar Jul 25 '23 21:07 raphjaph

I was able to reproduce this issue along with find a potential solution.

When starting the server, I was originally running ord server --http-port 8080 &. The ampersand allows the process to be run in the background. I'm running on an AWS Linux instance that allows these processes to continue running even after exiting the terminal. Using this command I was able to kill the process using SIGTERM if issued the command immediately, but after a few hours/days, it would become unresponsive to all kill commands except SIGKILL.

The solution is to use nohup nohup ord server --http-port 8080 &. When I use nohup, even if I leave the server running for hours/days, it maintains the ability to respond to SIGTERM.

I tried to dig into the difference between using nohup and just using the ampersand, but couldn't identify why this fix would work. But I can verify that using nohup solves the issue.

Wow! That is interesting and very strange. I agree with Raphjaph though, best to use some sort of service management or even a docker container with recovery set up than run the command with nohup on a server.

victorkirov avatar Jul 26 '23 10:07 victorkirov

I have experienced ord hanging repeatedly and I eventually press ctrl-c, and it says "shutting down gracefully" and hours later it still says "shutting down gracefully" ... Then I eventually press ctrl-c again and then subsequent commands give me the message:

Index file "/Users/patrick/Library/Application Support/ord/index.redb" needs recovery. This can take a long time, especially for the --index-sats index.

Note that when I had debug logging enabled, what I saw was it would take sometimes up to 45 minutes when it said it was flushing entries from memory to the database. This is on a mac mini m2 10-core cpu with 32gb ram, so I am shocked at how long this was taking...

The other odd behavior I saw was regular JSON-rpc calls that resulted in block height out of range. For example in the screenshot below, at the time the latest block was 800796 and ord was trying to fetch 800797 Y6d8QMFT

patrick99e99 avatar Aug 04 '23 00:08 patrick99e99

Yeah, the flushing of data takes ages. I've decreased mine to only index 500 instead of 5000 blocks and it takes just under an hour with sats indexing enabled. Since ReDB stores things in BTrees, I'm guessing it's rebalancing the trees before saving to disk.

The block height out of range thing looks like a legit error though. It shouldn't cause any issues though.

victorkirov avatar Aug 04 '23 05:08 victorkirov

I would really like to implement a flushing strategy that looks at memory usage instead of number of blocks to decide when to write to disk. If anyone wants to take a stab at that, that would be great!

raphjaph avatar Aug 22 '23 12:08 raphjaph

是的,数据的刷新需要很长时间。我已将我的索引减少到仅索引 500 个而不是 5000 个块,并且在启用 sats 索引的情况下只需不到一个小时。由于 ReDB 将内容存储在 BTree 中,我猜测它会在保存到磁盘之前重新平衡树。

不过,块高度超出范围看起来像是一个合法错误。但它不应该引起任何问题。

请问windows系统中 如何把索引5000改成500?可以改成100吗??用什么命令?

xinzhongyouhai avatar Apr 20 '24 16:04 xinzhongyouhai