flowvisor icon indicating copy to clipboard operation
flowvisor copied to clipboard

FlowVisor crash: java.lang.ArrayIndexOutOfBoundsException

Open jbsbbn opened this issue 11 years ago • 11 comments

The production FlowVisor at GENI regional network provider Southern Crossroads (SoX) crashed recently. In syslog:

Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR slicer_cfece52e-c0ec-45e6-9781-54b0c385eab4_dpid=0f:fa:34:40:b5:03:14:00 : STARVING: handling event took 4130ms: org.flowvisor.events.FVIOEvent@42ef4994 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: WARN slicer_1c4c697d-44e4-493f-9f09-c1c96b3d0487_dpid=0f:fa:34:40:b5:03:14:00 : Verifying Slice is not over its flow rule limit
Jul  7 01:05:47  1>Jul last message repeated 16 times
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : MAIN THREAD DIED!!! 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : ---------------------------- 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : ----- exception in thread Thread[main,5,main]::java.lang.ArrayIndexOutOfBoundsException: 57 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.openflow.protocol.OFType.valueOf(OFType.java:185) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.openflow.protocol.OFMessage.readFrom(OFMessage.java:130) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.openflow.protocol.factory.BasicFactory.parseMessages(BasicFactory.java:49) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.openflow.io.OFMessageAsyncStream.read(OFMessageAsyncStream.java:54) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.flowvisor.io.FVMessageAsyncStream.read(FVMessageAsyncStream.java:86) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.openflow.io.OFMessageAsyncStream.read(OFMessageAsyncStream.java:44) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.flowvisor.classifier.FVClassifier.handleIOEvent(FVClassifier.java:460) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.flowvisor.classifier.FVClassifier.handleEvent(FVClassifier.java:379) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.flowvisor.events.FVEventLoop.doEventLoop(FVEventLoop.java:168) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.flowvisor.FlowVisor.run(FlowVisor.java:196) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : org.flowvisor.FlowVisor.main(FlowVisor.java:229) 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : ---------------------------- 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: ERROR none : restarting after main thread died 
Jul  7 01:05:47 foam 1>Jul  7 01:05:47 flowvisor: WARN classifier-dpid=0f:fa:34:40:b5:03:14:00 : tearing down
Jul  7 01:05:49 foam 1>Jul  7 01:05:49 flowvisor: INFO none : initializing poll loop
Jul  7 01:05:49 foam 1>Jul  7 01:05:49 flowvisor: INFO none : initializing FlowVisor UserAPI JSONRPC SSL WebServer on port 8081
Jul  7 01:05:49 foam 1>Jul  7 01:05:49 flowvisor: INFO OFSwitchAcceptor : Listenning on port 6633
Jul  7 01:05:49 foam 1>Jul  7 01:05:49 flowvisor: INFO none : initializing FlowVisor UserAPI XMLRPC SSL WebServer on port 8080
Jul  7 01:05:49 foam 1>Jul  7 01:05:49 flowvisor: FATAL none : failed to spawn APIServer 
Jul  7 01:05:49 foam 1>Jul  7 01:05:49 flowvisor: INFO none : Shutting down config database.
Jul  7 01:05:49 foam 1>Jul  7 01:05:49 flowvisor: INFO none : Shutting down config database.

They restarted it, and it came back up fine.

What other info can we gather that might help?

jbsbbn avatar Jul 08 '13 17:07 jbsbbn

Would it be possible to find the actual message that generated this crash?

The crash occurred in openflowj, which is the library responsible for marshalling/unmarshalling openflow messages. It crashed while trying to parse a message with type value 57 which is strange because openflow (at least standard of) does not have a message with that type value.

All that said, openflowj should not crash when such a message occurs. I'll escalate this to openflowj's maintainer.

On Jul 8, 2013, at 10:37 AM, Josh Smift [email protected] wrote:

The production FlowVisor at GENI regional network provider Southern Crossroads (SoX) crashed recently. In syslog:

Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR slicer_cfece52e-c0ec-45e6-9781-54b0c385eab4_dpid=0f:fa:34:40:b5:03:14:00 : STARVING: handling event took 4130ms: org.flowvisor.events.FVIOEvent@42ef4994 Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: WARN slicer_1c4c697d-44e4-493f-9f09-c1c96b3d0487_dpid=0f:fa:34:40:b5:03:14:00 : Verifying Slice is not over its flow rule limit Jul 7 01:05:47 1>Jul last message repeated 16 times Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : MAIN THREAD DIED!!! Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : ---------------------------- Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : ----- exception in thread Thread[main,5,main]::java.lang.ArrayIndexOutOfBoundsException: 57 Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.openflow.protocol.OFType.valueOf(OFType.java:185) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.openflow.protocol.OFMessage.readFrom(OFMessage.java:130) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.openflow.protocol.factory.BasicFactory.parseMessages(BasicFactory.java:49) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.openflow.io.OFMessageAsyncStream.read(OFMessageAsyncStream.java:54) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.flowvisor.io.FVMessageAsyncStream.read(FVMessageAsyncStream.java:86) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.openflow.io.OFMessageAsyncStream.read(OFMessageAsyncStream.java:44) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.flowvisor.classifier.FVClassifier.handleIOEvent(FVClassifier.java:460) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.flowvisor.classifier.FVClassifier.handleEvent(FVClassifier.java:379) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.flowvisor.events.FVEventLoop.doEventLoop(FVEventLoop.java:168) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.flowvisor.FlowVisor.run(FlowVisor.java:196) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : org.flowvisor.FlowVisor.main(FlowVisor.java:229) Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : ---------------------------- Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: ERROR none : restarting after main thread died Jul 7 01:05:47 foam 1>Jul 7 01:05:47 flowvisor: WARN classifier-dpid=0f:fa:34:40:b5:03:14:00 : tearing down Jul 7 01:05:49 foam 1>Jul 7 01:05:49 flowvisor: INFO none : initializing poll loop Jul 7 01:05:49 foam 1>Jul 7 01:05:49 flowvisor: INFO none : initializing FlowVisor UserAPI JSONRPC SSL WebServer on port 8081 Jul 7 01:05:49 foam 1>Jul 7 01:05:49 flowvisor: INFO OFSwitchAcceptor : Listenning on port 6633 Jul 7 01:05:49 foam 1>Jul 7 01:05:49 flowvisor: INFO none : initializing FlowVisor UserAPI XMLRPC SSL WebServer on port 8080 Jul 7 01:05:49 foam 1>Jul 7 01:05:49 flowvisor: FATAL none : failed to spawn APIServer Jul 7 01:05:49 foam 1>Jul 7 01:05:49 flowvisor: INFO none : Shutting down config database. Jul 7 01:05:49 foam 1>Jul 7 01:05:49 flowvisor: INFO none : Shutting down config database.

They restarted it, and it came back up fine.

What other info can we gather that might help?

— Reply to this email directly or view it on GitHub.

alshabib avatar Jul 08 '13 17:07 alshabib

AA> Would it be possible to find the actual message that generated this crash?

Mm, maybe. To clarify, by "actual message", you mean like what API call was made immediately before the crash?

Would that be in syslog? If not, where would we find it?

jbsbbn avatar Jul 08 '13 17:07 jbsbbn

No I was hoping for the actual openflow message. That's probably a long shot but you never know ;)

Do you know what OF devices are connected to that FlowVisor?

On Jul 8, 2013, at 10:55 AM, Josh Smift [email protected] wrote:

AA> Would it be possible to find the actual message that generated this crash?

Mm, maybe. To clarify, by "actual message", you mean like what API call was made immediately before the crash?

Would that be in syslog? If not, where would we find it? — Reply to this email directly or view it on GitHub.

alshabib avatar Jul 08 '13 17:07 alshabib

AA> No I was hoping for the actual openflow message. That's probably a long shot but you never know ;)

Heh. I don't think anyone would have that, unless SoX happened to be running a packet capture at that time. I'll ask just in case.

AA> Do you know what OF devices are connected to that FlowVisor?

Just one NEC PF5820 at the moment.

jbsbbn avatar Jul 08 '13 18:07 jbsbbn

Type 57 may be the vendor message NEC uses for big stats replies. I'm not 100% sure, as I haven't been recording their message types, but they use a lot of custom message types as extensions to support their flowtable being larger than what a lot of OF messages can handle by default (various datatypes are not large enough, etc.).

nbastin avatar Jul 08 '13 18:07 nbastin

Sure but the type should be OFPT_VENDOR and not 57 or something crazy.

On Jul 8, 2013, at 11:06 AM, Nick Bastin [email protected] wrote:

Type 57 may be the vendor message NEC uses for big stats replies. I'm not 100% sure, as I haven't been recording their message types, but they use a lot of custom message types as extensions to support their flowtable being larger than what a lot of OF messages can handle by default (various datatypes are not large enough, etc.).

— Reply to this email directly or view it on GitHub.

alshabib avatar Jul 08 '13 20:07 alshabib

Sure, but even if the switch is doing something totally wrong with the type (which I don't know enough about this to say whether it is or not), FV shouldn't crash as a result.

jbsbbn avatar Jul 09 '13 15:07 jbsbbn

Agreed, I am filing a report on OpenFlowJ. Would you happen to have a packet capture?

alshabib avatar Jul 11 '13 22:07 alshabib

Sorry, I thought I'd mentioned this earlier: No, no one happened to be running a packet capture at the time of the crash.

Can you reproduce the bug? Can you get a package capture that way?

Can you do anything to ameliorate this on the FV side, so that FV doesn't die even when it encounters a bug in openflowj?

jbsbbn avatar Jul 12 '13 01:07 jbsbbn

Sure, we shouldn't crash when ofj crashes although if a packet isn't parsed correctly for some reason it could cause incorrect fv behaviour for some pathological cases. I don't know what the correct behaviour is here. Need to think about that some more.

Ali Al-Shabibi (sent from handheld)

On 11 juil. 2013, at 18:07, Josh Smift [email protected] wrote:

Sorry, I thought I'd mentioned this earlier: No, no one happened to be running a packet capture at the time of the crash.

Can you reproduce the bug? Can you get a package capture that way?

Can you do anything to ameliorate this on the FV side, so that FV doesn't die even when it encounters a bug in openflowj?

— Reply to this email directly or view it on GitHub.

alshabib avatar Jul 12 '13 01:07 alshabib

Drop the packet. You don't know how to slice it, and you're promising to slice the control plane. There's nothing you can do except drop it and log it.

nbastin avatar Jul 12 '13 01:07 nbastin