postgresql-simple icon indicating copy to clipboard operation
postgresql-simple copied to clipboard

Database.PostgreSQL.Simple.Internal.exec: resource vanished (failed to fetch file descriptor)

Open deem0n opened this issue 10 years ago • 11 comments

Hello,

This issue was already discussed: https://github.com/lpsmith/postgresql-simple/issues/117 The conclusion was that it is due to some websocket related code which used pg socket and sent unexpected by PostgreSQL server ping packets.

I have quite simple Haskell program which just listen to messages and calls plpgsql procedure on each message. No websocket or any other additional socket related code.

processAuditMessage :: Connection -> Notification -> IO ()
processAuditMessage con n = do
  let json = notificationData n
  let sql = "SELECT process_message(?)"
  res <- catchSQLErr (query con sql (Only json) :: IO [Only Bool]) $ \e -> do
    System.IO.hPutStr stderr "SQL Error: "
    UTF8.hPutStrLn stderr $ sqlErrorMsg e
    UTF8.hPutStrLn stderr json
    return [Only False]
  return ()

runLoggerLoop :: Connection -> String -> IO ()
runLoggerLoop con ch = listen >> loop
  where
    listen = execute con "LISTEN ?" [Identifier (fromString ch)]
    loop = forever $
      getNotification con >>= processAuditMessage con

After about 30 hours of work it died with Database.PostgreSQL.Simple.Internal.exec: resource vanished (failed to fetch file descriptor)

Any idea what may be a reason for the error and how to prevent it ?

deem0n avatar Nov 05 '15 08:11 deem0n

Well, typically it means that the connection vanished, without receiving an error message from the database backend. postgresql-simple will generate a IOError in this case, instead of a SQLError. How are you connecting to your database? Via unix sockets, loopback, or over a network? If it's over a network, do you know much about it? Is it a local network, or are you going across a wider-area network?

Any time you go over anything more than loopback, and especially a wide-area network, you probably need to write your logger to be tolerant to network hiccups leading to a loss of connection (and thus potentially also lost notifications that were sent while your logger was away, if that's important).

postgresql-simple can also produce a Database.PostgreSQL.Simple.Notification.getNotification: resource vanished (failed to fetch file descriptor) IOError, so your exception would seem to be coming from the SELECT process_message(?) query. If it's failing consistently there, that's somewhat curious. I suppose it's entirely possible that there is a bug in postgresql-simple that's swallowing a postgres protocol error message that is received shortly before the server hangs up.

I'd also check your server's logs for any messages, if you can. And, if this is happening consistently, then running a packet capture on the client and/or server can be very enlightening. (And wireshark has a postgresql protocol dissector, so I highly recommend using wireshark for analyzing the capture after the fact.)

lpsmith avatar Nov 05 '15 15:11 lpsmith

Hello, connection is via loopback interface on localhost (127.0.0.1) without SSL. I'm a haskell newbie, so it is quite hard for me to add verbose logging on IOError exceptions. How it would look like ?

Currently my code does some error checks (as shown in my previous message), but there was no SQL Error: part in the error log, so I assume that text Database.PostgreSQL.Simple.Internal.exec: resource vanished (failed to fetch file descriptor) was caused by some exception which I did not catch.

Also, haskell application did exit immediately after this error, which is something I want to avoid.

deem0n avatar Nov 09 '15 10:11 deem0n

Here is my stack list-dependencies

aeson 0.8.0.2
array 0.5.1.0
attoparsec 0.12.1.6
base 4.8.1.0
binary 0.7.5.0
blaze-builder 0.4.0.1
blaze-textual 0.2.1.0
byteable 0.1.1
bytestring 0.10.6.0
case-insensitive 1.2.0.5
containers 0.5.6.2
cryptohash 0.11.6
deepseq 1.4.1.1
dlist 0.7.1.2
ghc-prim 0.4.0.0
hashable 1.2.3.3
integer-gmp 1.0.0.0
mtl 2.2.1
network-info 0.2.0.8
old-locale 1.0.0.7
postgresql-libpq 0.9.1.1
postgresql-simple 0.4.10.0
primitive 0.6.1.0
random 1.1
scientific 0.3.4.2
syb 0.5.1
template-haskell 2.10.0.0
text 1.2.1.3
time 1.5.0.1
transformers 0.4.2.0
unordered-containers 0.2.5.1
uuid 1.3.11
uuid-types 1.0.2
vector 0.10.12.3

PostgreSQL server is 9.3.9

deem0n avatar Nov 09 '15 11:11 deem0n

Well, what I'm saying is (if necessary) turn on logging on the postgresql server side of things, and check the error logs there too.

This particular exception is an IOError, which you can certainly catch yourself on the client side of things. In particular, IIRC, this is raised when your SELECT process_message(?) query eventually calls postgresql-simple's exec, which in turn calls PQsocket and this function returns an invalid file descriptor. Perhaps we need to check for a PQerrorMessage at this point.

lpsmith avatar Nov 10 '15 19:11 lpsmith

Also, while it's certainly a worthy goal to write a daemon that never crashes, in practice you probably still want to have a supervisor process that will detect when a daemon crashes and restarts it automatically. There are a number of options, but both upstart and systemd can natively handle simple single-process supervision like this.

lpsmith avatar Nov 10 '15 19:11 lpsmith

Thanks, very helpful suggestions.

May be this discussion would be interesting and put some light on the problem.

deem0n avatar Nov 11 '15 13:11 deem0n

Well, I didn't read the whole thread carefully, but I don't think we have any evidence that the current issue is related to the psycopg one that you linked, and perhaps a bit of evidence that it's not.

But I do find this to be a slightly curious event over loopback connections. So:

  1. Has this issue recurred?
  2. In the instances where this recurred, has the exception been exactly the same? In particular, has this "failed to fetch file descriptor" exception ever originated from getNotification instead?
  3. Have you checked the postgresql server logs for any related error messages?
  4. Have you checked any system logs for related error messages? (e.g. if the postgres process that handles your connection was killed by the system, by say the OOM killer or due to a ulimit violation, then it could cause this exception in Haskell-land, and it may leave some evidence in the system logs.)
  5. Have you run a packet capture on the connection? A command along the following lines might be a reasonable place to get started, but I would highly recommend referring to the manpage:
tcpdump -i lo -n -s 65535 -C 10 -w pgdebug_%Y%m%dT%H%M%S%z.pcap 'tcp port 5432'

This will capture all loopback packets that have a source or destination port of 5432, which should capture every loopback connection to postgres. If there's a lot of other connections to postgres via loopback, you may want to refine your filter expression somehow.

lpsmith avatar Nov 11 '15 15:11 lpsmith

Also, in many distributions, SSL will by default be used in a situation like this; so if you do a packet capture, you may want to include sslmode=disable in your postgresql connection string, use the PGSSLMODE environment variable to do the same, or disable it in the postgresql server config (e.g. via pg_hba.conf). And of course, you should verify that you can actually read the packet captures before you wait for the problem to recur.

lpsmith avatar Nov 11 '15 15:11 lpsmith

Hello

  1. I did not use SSL, it is disabled by pg_hba.conf for localhost
  2. I changed things slightly on the server to reduce amount of payload in postgres messages. haskell daemon worked for 4 days without any problem since then.
  3. I plan to do stress-testing sending a lot of short messages, I will post results&logs in case of reproducible errors

deem0n avatar Nov 12 '15 14:11 deem0n

News out there: it crashed again with the same error. Also, I restarted Postgres due to config file changes and haskell daemon crashed with these lines:

FATAL:  terminating connection due to administrator command
luxmsbi-postgres-sync-logger-exe: Database.PostgreSQL.Simple.Notification.getNotification: resource vanished (failed to fetch file descriptor)

I set up tcpdump traffic logging and will publish results after next crash.

deem0n avatar Nov 17 '15 07:11 deem0n

Regarding the postgresql restart, that's not surprising. Were you expecting something different?

The first message I suspect is a notice being printed by libpq. New enough versions of postgresql-libpq have the ability to turn off these messages, or programmatically capture them. The second message is an IOError exception being intentionally thrown by postgresql-simple, although it's likely possible to turn it into something else in this particular case. There was some discussion of these issues in #155

I have a little utility program that throws that particular exception on a regular basis for slightly different reasons: the program replicates and merges some event tables in near real time, and is connected via a VPN over a the public internet to a site fairly far away in real terms. This connection is somewhat flaky, so whenever the connection disappears, and the operating system finally decides the connection is gone and informs that program, then libpq does it's thing and informs my program by not returning a file descriptor, which causes this error to be raised.

lpsmith avatar Nov 17 '15 17:11 lpsmith