Monday, 28 January 2008
Server locked! What did we do to deserve this?!
Dear CardMeeting users,
Sorry for today's outage, it seems Java locked up on us.
I've done as much investigating as I could to try and determine what happened, but it seems like Java just had one of those 1-in-10000 chance things where it locks up completely. I read some bug reports and couldn't find any exact matches to the problem to say whether the lockup was a Java, a Linux, or a hardware problem. Hmmm, very disturbing. I was able to do a thread dump and look at the trace, and every thread was BLOCKED except for one thread which reported as being IN_VM:
Thread 22936: (state = IN_VM) - java.lang.AbstractStringBuilder.(int) @bci=6, line=45 (Compiled frame) - java.lang.StringBuffer. (int) @bci=2, line=91 (Compiled frame) - com.woldrich.dcards.protocol.ProtocolEngine.readNonBlankLine(java.io.BufferedReader) @bci=8 (Compiled frame) - com.woldrich.dcards.protocol.ProtocolEngine.access$100(com.woldrich.dcards.protocol.ProtocolEngine, java.io.BufferedReader) @bci=2 (Compiled frame) - com.woldrich.dcards.protocol.ProtocolEngine$1.run() @bci=40 (Compiled frame) - edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call() @bci=4 (Interpreted frame) - edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run() @bci=41 (Interpreted frame) - edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(java.lang.Runnable) @bci=59 (Interpreted frame) - edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=28 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=619 (Interpreted frame)
Strange place to be stuck, huh? Perhaps something related to strings is buggy in Java?? I hope not, that's pretty fundamental stuff. Probably more likely to be something freaky-deaky with garbage collection or memory. The code that is stuck IN_VM is in the middle of construction; memory management is a likely culprit. Then again, this could just as easily be a sign of a failing part on a DIMM.
JDK 1.6.0_04 was recently released, and I'm going to start testing with that. Since I've never seen this kind of VM lockup before in all the time I've hosted CardMeeting, I doubt we'll see this problem again unless hardware is the culprit. But, I want to take any updates I can get for Java, and perhaps 1.6.0_04 will lend something to server stability.
Sorry again for the outage. I'll try to setup some management or heartbeat monitors on the servers in the future so that I get some more timely notice about when servers go down or lock.
Thanks,
Dave Woldrich
