quickfixj icon indicating copy to clipboard operation
quickfixj copied to clipboard

Stale file handle and Stream Closed issues after session end time

Open ajinkyatankhiwale1510 opened this issue 3 years ago • 12 comments

Describe the bug We have two nodes(different physical machines) working in active-active mode, where only one node is supposed to handle all incoming traffic. In case of failover, F5 would transfer all the traffic to node 2 and then node 2 will refresh its state and continue processing. Both nodes share the same store settings and files. Its observed that during daily end time at 00:00:00 UTC. Node 1 first throws stale file handle error and after some time throws stream closed error indefinitely. End user then cannot establish FIX session, until both node 1 and node 2 are restarted. It's our requirement to have file based store.

20220824-00:00:00.582: Error reading/writing in MessageStore java.io.IOException: Stale file handle at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:326) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at quickfix.FileStore.set(FileStore.java:373) at quickfix.SessionState.set(SessionState.java:301) at quickfix.Session.persist(Session.java:2563) at quickfix.Session.sendRaw(Session.java:2622) at quickfix.Session.generateLogout(Session.java:1479) at quickfix.Session.generateLogout(Session.java:1448) at quickfix.Session.reset(Session.java:884) at quickfix.Session.resetIfSessionNotCurrent(Session.java:3022) at quickfix.Session.next(Session.java:1921) at quickfix.mina.SessionConnector$SessionTimerTask.run(SessionConnector.java:347) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750)

and then

20220824-00:06:50.046: Error during message processing quickfix.RuntimeError: quickfix.RuntimeError: java.io.IOException: Stream Closed at quickfix.Session.next(Session.java:1181) at quickfix.Session.next(Session.java:1228) at quickfix.mina.ThreadPerSessionEventHandlingStrategy$MessageDispatchingThread.doRun(ThreadPerSessionEventHandlingStrategy.java:222) at quickfix.mina.ThreadPerSessionEventHandlingStrategy$ThreadAdapter.run(ThreadPerSessionEventHandlingStrategy.java:146) at java.lang.Thread.run(Thread.java:750) Caused by: quickfix.RuntimeError: java.io.IOException: Stream Closed at quickfix.SessionState.reset(SessionState.java:387) at quickfix.Session.resetState(Session.java:2665) at quickfix.Session.reset(Session.java:887) at quickfix.Session.resetIfSessionNotCurrent(Session.java:3022) at quickfix.Session.nextLogon(Session.java:2119) at quickfix.Session.next(Session.java:1050) ... 4 more Caused by: java.io.IOException: Stream Closed at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:326) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at java.io.FilterOutputStream.close(FilterOutputStream.java:158) at quickfix.FileStore.close(FileStore.java:218) at quickfix.FileStore.close(FileStore.java:209) at quickfix.FileStore.closeAndDeleteFiles(FileStore.java:223) at quickfix.FileStore.initialize(FileStore.java:101) at quickfix.FileStore.reset(FileStore.java:405) at quickfix.SessionState.reset(SessionState.java:385) ... 9 more Suppressed: java.io.IOException: Stream Closed at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:326) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.FilterOutputStream.close(FilterOutputStream.java:158) at java.io.FilterOutputStream.close(FilterOutputStream.java:159) ... 15 more Cause: java.io.IOException: Stream Closed quickfix.RuntimeError: java.io.IOException: Stream Closed at quickfix.SessionState.reset(SessionState.java:387) at quickfix.Session.resetState(Session.java:2665) at quickfix.Session.reset(Session.java:887) at quickfix.Session.resetIfSessionNotCurrent(Session.java:3022) at quickfix.Session.nextLogon(Session.java:2119) at quickfix.Session.next(Session.java:1050) at quickfix.Session.next(Session.java:1228) at quickfix.mina.ThreadPerSessionEventHandlingStrategy$MessageDispatchingThread.doRun(ThreadPerSessionEventHandlingStrategy.java:222) at quickfix.mina.ThreadPerSessionEventHandlingStrategy$ThreadAdapter.run(ThreadPerSessionEventHandlingStrategy.java:146) at java.lang.Thread.run(Thread.java:750) Caused by: java.io.IOException: Stream Closed at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:326) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at java.io.FilterOutputStream.close(FilterOutputStream.java:158) at quickfix.FileStore.close(FileStore.java:218) at quickfix.FileStore.close(FileStore.java:209) at quickfix.FileStore.closeAndDeleteFiles(FileStore.java:223) at quickfix.FileStore.initialize(FileStore.java:101) at quickfix.FileStore.reset(FileStore.java:405) at quickfix.SessionState.reset(SessionState.java:385) ... 9 more Suppressed: java.io.IOException: Stream Closed at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:326) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.FilterOutputStream.close(FilterOutputStream.java:158) at java.io.FilterOutputStream.close(FilterOutputStream.java:159) ... 15 more

To Reproduce Node 1 and Node 2 has Start Time and End Time as 00:00:00-UTC - 00:00:00-UTC sharing same FileStorePath (configured for failover scenario) and an active FIX connection on node 1, during session end time, it's sometimes seen that file resetting does not work as expected.

Expected behavior In case of failover setup, if members are connected to FIX environment. Session reset should happen seamlessly without any error.

system information:

  • OS: Red Hat Enterprise Linux 8.6 (Ootpa)
  • Java version : openjdk 8
  • QFJ Version 2.3.0

Additional context It works well with single node, if both nodes are up and running then the issue is seen.

ajinkyatankhiwale1510 avatar Aug 29 '22 10:08 ajinkyatankhiwale1510

Seems to be related to #399

Any idea how this could be reproduced without two nodes since I don't have such a setup? Could you please share your session settings?

chrjohn avatar Aug 29 '22 11:08 chrjohn

Node 1(machine 1) fix-acceptor-settings.cfg [DEFAULT] ConnectionType=acceptor DataDictionary=./FIX44.xml SenderCompID=XXXXX ThreadModel=ThreadPerSession BeginString=FIX.4.4 FileIncludeMilliseconds=Y FileIncludeTimeStampForMessages=Y FileLogPath=/local/fix-engine-acceptor_node/FIX_LOG <- Logs generated on node 1 /local/ directory FileStorePath=/home/FIX_Seq_Store <- Store generated in /home/ directory, is a shared directory between node 1 and node 2 StartTime=00:00:00 EndTime=00:00:00 SocketUseSSL=Y SocketKeyStore=/home/FIXClient.ks SocketKeyStorePassword=clientpass RefreshOnLogon=Y

[SESSION] TargetCompID=ABCDE SocketAcceptPort=12002

Node 2(machine 2 failover) fix-acceptor-settings.cfg [DEFAULT] ConnectionType=acceptor DataDictionary=./FIX44.xml SenderCompID=XXXXX ThreadModel=ThreadPerSession BeginString=FIX.4.4 FileIncludeMilliseconds=Y FileIncludeTimeStampForMessages=Y FileLogPath=/local/fix-engine-acceptor_node/FIX_LOG <- Logs generated on node 2 /local/ directory FileStorePath=/home/FIX_Seq_Store <- Store generated in /home/ directory, is a shared directory between node 1 and node 2 StartTime=00:00:00 EndTime=00:00:00 SocketUseSSL=Y SocketKeyStore=/home/FIXClient.ks SocketKeyStorePassword=clientpass RefreshOnLogon=Y

[SESSION] TargetCompID=ABCDE SocketAcceptPort=12002

Without two nodes its difficult to reproduce, on single node, starting two instances of acceptor and giving different SocketAcceptPort would be the identical scenario in this case. e.g. node 2's SocketAcceptPort can be 22002

ajinkyatankhiwale1510 avatar Aug 29 '22 11:08 ajinkyatankhiwale1510

Are you able to reproduce this in a test environment? If yes, could you test a pull request branch in your test environment? The only solution I could come up with is to ignore the IOExceptions on closing the FileStore's files. This will help with the Stream Closed exceptions on closing the store but not with the Stale file handle exceptions (but actually I don't think we can do anything about them anyway).

chrjohn avatar Aug 29 '22 15:08 chrjohn

Yes, we were able to replicate the scenario in test environment, we used two separate machines hosted our acceptor on both machines and had FileStore shared between machines. However, I'm not sure if I could test any pull request in our environment. could you please elaborate your solution of ignoring the IOException and how it will help us to establish FIX session again?

ajinkyatankhiwale1510 avatar Aug 30 '22 08:08 ajinkyatankhiwale1510

Currently the IOException causes the message store to never close the old store and open new files for storing. This error then happens on every access to the store, as you reported. On ignoring the Exception on closing, QFJ should be able to open new files for the message store.

chrjohn avatar Aug 30 '22 10:08 chrjohn

Thanks for your explanation, could you please let me know how can we fix this issue? Generally, in case of IOException there should be a finally block which must close this files or this is something which is currently missing? For time being, do you suggest any settings which would not reset the session, so such IOException do not occur? let's say using NonStopSession

ajinkyatankhiwale1510 avatar Aug 30 '22 15:08 ajinkyatankhiwale1510

Generally, in case of IOException there should be a finally block which must close this files or this is something which is currently missing

Currently the FileStore tries to close the files but there is an IOException thrown on closing. IMHO this exception can be ignored when closing the store since we can assume that the file has been moved/deleted or the like. When we catch that exception the store will subsequently create new files for the store and normal operation should be possible again.

Using NonStopSession is a good idea BTW, that should prevent the problem with the automated reset. But you need to reset the sequence numbers at some point so eventually might run into the same problem, don't you? Unless you can make sure that there is no failover taking place when trying to reset.

chrjohn avatar Aug 31 '22 15:08 chrjohn

Sequence reset can be done by initiator at SessionLogon, and that is perfectly fine for us. I could try NonStopSession and see if that helps our case.

Otherwise, we will have to somehow handle the exception to ignore IOException. Do you have any idea in that case what can be done?

ajinkyatankhiwale1510 avatar Sep 02 '22 08:09 ajinkyatankhiwale1510

Otherwise, we will have to somehow handle the exception to ignore IOException. Do you have any idea in that case what can be done?

I am not sure, but you could try to extend the FileStore and catch the thrown IOException in the reset() method and then re-initialize the store via the code that is in FileStore.initialize().

chrjohn avatar Sep 05 '22 12:09 chrjohn

do you see any issues with NonStopSession if the sequence number are not reset?

ajinkyatankhiwale1510 avatar Sep 28 '22 14:09 ajinkyatankhiwale1510

At some point (depending on how much messages you will send/receive) the sequence number will probably overflow (AFAIR we have a bug there). Moreover, the larger your store gets, the slower it will become. Potentially the counterparty could ask for a resend of all of your messages which could be billions of messages.

chrjohn avatar Oct 07 '22 08:10 chrjohn

I do agree with your approach of having our own implementation of FileStore, on that lines I have managed to implement MessageStore and in reset method of my implementation I have handled exceptions instead of throwing, do you think this would be a fine enough? public class FixFileStore implements MessageStore, Closeable ........... public void reset() { try { LOGGER.info("Resetting store files"); initialize(true); } catch (IOException e) { LOGGER.error("Error closing the store files, re-initialising store files again"); try { close(); String mode = "rw" + (this.syncWrites ? "d" : ""); this.messageFileWriter = new RandomAccessFile(this.msgFileName, mode); this.messageFileReader = new RandomAccessFile(this.msgFileName, "r"); this.senderSequenceNumberFile = new RandomAccessFile(this.senderSeqNumFileName, mode); this.targetSequenceNumberFile = new RandomAccessFile(this.targetSeqNumFileName, mode); initializeCache(); } catch (IOException e1) { LOGGER.error("Error closing the store files in exception"); } } }

ajinkyatankhiwale1510 avatar Nov 22 '22 11:11 ajinkyatankhiwale1510