In this article, I'll discuss a tooless technique used in tuning java application for high performance.
Take my wife - please!
I like talking about stuff that's unnew, unsexy and unappreciated [insert joke about wife here]. Often, this is where I uncover the most interesting stuff that's been overlooked (e.g. your wife). That old Unix command kill is just that - a simple command that has a very interesting application in the java performance tuning space. In case you don't know, the inappropriately named Unix kill command sends a signal to a Unix process. For example, kill -9 pid, terminates the process referred to by pid. However, kill -3 java_pid is interpreted by JVMs as an instruction to perform a thread dump. JVMs respond by printing a snapshot of each thread's execution point all the while leaving the JVM process running. It turns out that this is all you really need to tune java application.
Asking the right question
This is not what the tool vendors want you to hear. But the truth of the matter is that many of these performance tuning tools miss the mark when it comes to giving developers what they really need. Tools answers this question: "exactly how much time is spent in each method of an invocation stack?" While this has some value, a better question to ask is: "what's blocking my application from running fast?" If we don't ask the right question, we won't get the right answer.
This is not what the tool vendors want you to hear. But the truth of the matter is that many of these performance tuning tools miss the mark when it comes to giving developers what they really need. Tools answers this question: "exactly how much time is spent in each method of an invocation stack?" While this has some value, a better question to ask is: "what's blocking my application from running fast?" If we don't ask the right question, we won't get the right answer.
Bound to under perform
In performance tuning, we don't add to an application to make it run faster. In fact it is quite the opposite; we remove things that get in the way. Applications inherently want to run as fast as hardware allows but are bogged down by barriers. These need to be uncovered and removed. There are countless reasons why an application may under perform but at the root is always a portion of the application that is over-consuming a resource. This resource can be the CPU, I/O resources (network, file, database), heap or even a shared lock. Regardless of the root cause, the Unix kill command will always point out the problem. So how can it do this? Let's take a look.
Using the kill -3 command
Say we have a multi-threaded java application with throughput requirements of 200 transactions-per-second. However, the application can only handle 100 transactions-per-second. Load tests reveal that the CPU usage is only at 50% capacity. Yet, the application does not scale beyond this frontier. Since the CPU is only at 50% capacity, there is a non-CPU resource which is being over consumed and acting as a bottleneck. Via the kill -3 command, we issue successive thread dumps in 30 second intervals to find out what this bottleneck is. Here's a thread dump snippet example (performed with BEA's jrockit JVM).
------------------------------------------------------------
Thread: '1'" id=57 idx=0x100 tid=7937 prio=5 alive, in native, blocked, daemon
-- Blocked trying to get lock: java/util/Collections$SynchronizedMap@0x2a8[fat lock] jrockit/vm/Threads.waitForUnblockSignal()V(Native Method) at
jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1610)[optimized] at
jrockit/vm/Locks.lockFat(Locks.java:1710)[optimized] at
jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1274)[optimized] at
com/corp/base/SessionStore.appendToTransaction(SessionStore.java:351)[inlined]
Thread: '2'" id=58 idx=0x110 tid=7401 prio=5 alive, in native, blocked, daemon
-- Blocked trying to get lock: java/util/Collections$SynchronizedMap@0x2a8[fat lock] jrockit/vm/Threads.waitForUnblockSignal()V(Native Method) at
jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1610)[optimized] at
jrockit/vm/Locks.lockFat(Locks.java:1710)[optimized] at
jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1274)[optimized] at
com/corp/base/SessionStore.appendToTransaction(SessionStore.java:351)[inlined]
Thread: '3'" id=59 idx=0x120 tid=7405 prio=5 alive, in native, blocked, daemon
-- Blocked trying to get lock: java/util/Collections$SynchronizedMap@0x2a8[fat lock] jrockit/vm/Threads.waitForUnblockSignal()V(Native Method) at
jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1610)[optimized] at
jrockit/vm/Locks.lockFat(Locks.java:1710)[optimized] at
jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1274)[optimized] at
com/corp/base/SessionStore.appendToTransaction(SessionStore.java:351)[inlined]
------------------------------------------------------------
Thread dump #2
------------------------------------------------------------
Thread: '1'" id=57 idx=0x100 tid=7937 prio=5 alive, in native, blocked, daemon
-- Blocked trying to get lock: java/util/Collections$SynchronizedMap@0x2a8[fat lock] jrockit/vm/Threads.waitForUnblockSignal()V(Native Method) at
jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1610)[optimized] at
jrockit/vm/Locks.lockFat(Locks.java:1710)[optimized] at
jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1274)[optimized] at
com/corp/base/SessionStore.appendToTransaction(SessionStore.java:351)[inlined]
Thread: '2'" id=58 idx=0x110 tid=7401 prio=5 alive, in native, blocked, daemon
-- Blocked trying to get lock: java/util/Collections$SynchronizedMap@0x2a8[fat lock] jrockit/vm/Threads.waitForUnblockSignal()V(Native Method) at
jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1610)[optimized] at
jrockit/vm/Locks.lockFat(Locks.java:1710)[optimized] at
jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1274)[optimized] at
com/corp/base/SessionStore.appendToTransaction(SessionStore.java:351)[inlined]
Thread: '3'" id=59 idx=0x120 tid=7405 prio=5 alive, in native, alive, daemon
java/io/OutputStreamWriter.write(OutputStreamWriter.java:204)[optimized] at
java/io/Writer.write(Writer.java:126)[optimized] at
org/jdom/output/XMLOutputter.printElement(XMLOutputter.java:918)[optimized] at
com/corp/util/DocumentBuilder.extractDocument(DocumentBuilder.java:195)[inlined]
------------------------------------------------------------
Thread: '1'" id=57 idx=0x100 tid=7937 prio=5 alive, in native, blocked, daemon
-- Blocked trying to get lock: java/util/Collections$SynchronizedMap@0x2a8[fat lock] jrockit/vm/Threads.waitForUnblockSignal()V(Native Method) at
jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1610)[optimized] at
jrockit/vm/Locks.lockFat(Locks.java:1710)[optimized] at
jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1274)[optimized] at
com/corp/base/SessionStore.appendToTransaction(SessionStore.java:351)[inlined]
Thread: '2'" id=58 idx=0x110 tid=7401 prio=5 alive, in native, blocked, daemon
-- Blocked trying to get lock: java/util/Collections$SynchronizedMap@0x2a8[fat lock] jrockit/vm/Threads.waitForUnblockSignal()V(Native Method) at
jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1610)[optimized] at
jrockit/vm/Locks.lockFat(Locks.java:1710)[optimized] at
jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1274)[optimized] at
com/corp/base/SessionStore.appendToTransaction(SessionStore.java:351)[inlined]
Thread: '3'" id=59 idx=0x120 tid=7405 prio=5 alive, in native, blocked, daemon
-- Blocked trying to get lock: java/util/Collections$SynchronizedMap@0x2a8[fat lock] jrockit/vm/Threads.waitForUnblockSignal()V(Native Method) at
jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1610)[optimized] at
jrockit/vm/Locks.lockFat(Locks.java:1710)[optimized] at
jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1274)[optimized] at
com/corp/base/SessionStore.appendToTransaction(SessionStore.java:351)[inlined]
------------------------------------------------------------
Thread dump #2
------------------------------------------------------------
Thread: '1'" id=57 idx=0x100 tid=7937 prio=5 alive, in native, blocked, daemon
-- Blocked trying to get lock: java/util/Collections$SynchronizedMap@0x2a8[fat lock] jrockit/vm/Threads.waitForUnblockSignal()V(Native Method) at
jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1610)[optimized] at
jrockit/vm/Locks.lockFat(Locks.java:1710)[optimized] at
jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1274)[optimized] at
com/corp/base/SessionStore.appendToTransaction(SessionStore.java:351)[inlined]
Thread: '2'" id=58 idx=0x110 tid=7401 prio=5 alive, in native, blocked, daemon
-- Blocked trying to get lock: java/util/Collections$SynchronizedMap@0x2a8[fat lock] jrockit/vm/Threads.waitForUnblockSignal()V(Native Method) at
jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1610)[optimized] at
jrockit/vm/Locks.lockFat(Locks.java:1710)[optimized] at
jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1274)[optimized] at
com/corp/base/SessionStore.appendToTransaction(SessionStore.java:351)[inlined]
Thread: '3'" id=59 idx=0x120 tid=7405 prio=5 alive, in native, alive, daemon
java/io/OutputStreamWriter.write(OutputStreamWriter.java:204)[optimized] at
java/io/Writer.write(Writer.java:126)[optimized] at
org/jdom/output/XMLOutputter.printElement(XMLOutputter.java:918)[optimized] at
com/corp/util/DocumentBuilder.extractDocument(DocumentBuilder.java:195)[inlined]
------------------------------------------------------------
As highlighted in red, a pattern emerges; there are many threads stuck waiting for a shared lock. For the sake of brevity, the guilty thread is not shown in this snippet but this particular bottleneck is caused by a method holding on to a lock for too long a period. Being a critical path in the application, it needs to be modified if we are to break through the throughput barrier.
Thread dumps are like pictures and do not show how long a thread is stuck in a particular place. But using successive thread dumps adds the dimension of time to the picture. If successive thread dumps present the same thread blocking in the same location, as was the case above, it points to a performance bottleneck.
Thread dumps are like pictures and do not show how long a thread is stuck in a particular place. But using successive thread dumps adds the dimension of time to the picture. If successive thread dumps present the same thread blocking in the same location, as was the case above, it points to a performance bottleneck.
Removing this barrier will allow the application to scale up to the next level. At that point, new bottlenecks will emerge. This process can continue until the costs no longer justify the benefits.
Conclusion
Performance tuning must be viewed holistically. Threads don't execute in a vacuum but rather in an ecosystem of threads each competing for shared resources in infinite timing combinations. The kill command, in all its simplicity, works because it only cares to answer one question: what sticks out like a sore thumb? This is why it such an effective performance tuning tool. In the end, no other questions matter.
You may wonder if this technique really works in the real world. In fact, I have used it scale up applications considerably. With no fancy GUI, no costly license, the kill command is the killer app of java performance tuning.
You may wonder if this technique really works in the real world. In fact, I have used it scale up applications considerably. With no fancy GUI, no costly license, the kill command is the killer app of java performance tuning.
2 comments:
Why not just take some free java profiler and use sampling mode? There are lots of them.
However, there is one case where this technique is useful, namely when the application is started without -Xdebug, but I haven't seen people who have a good reason to do that for a long time.
One of the most useful tools I am really missing is a monitor statistic, which class and object have how many blocked threads for how long. This will make that kind of debugging quite easy. You can also see which synchronized blocks are actually painful and which not.
Even with sampling profilers I don't know any who can aggregate it by lock instead of by block.
Post a Comment