Saturday, August 4, 2018

Stack Trace got eaten up in the log

During the production troubleshooting today, we found out in the log that NPE was logged occasionally, but without detailed stack trace. We've made sure that we always log the stack trace in our code. So, what was going on?

After searching back in our log history (we have a log aggregation service that keeps log history), we found there were NPEs with the full stack trace, then after some time, the stack trace was gone.

In this end, this turns out to be caused by a Hotspot JIT compiler optimization, when a particular exception keeps being thrown up to a threshold, the same exception stack trace will not be filled anymore due to performance impact of throwing exceptions.

The solution is to turn off this optimization by providing this JVM option:

 -XX:-OmitStackTraceInFastThrow

References:

1. Stackoverflow post on "NullPointerException in Java with no StackTrace"
2. A more detailed analysis with a demo app




Friday, August 3, 2018

scheduled task silently died with ScheduledThreadPoolExecutor

We had a very annoying bug lately where a scheduled job silently stopped executing without our knowledge. After debugging and carefully reading Javadoc again, it turns out that the task throws an Exception thus suppressing the subsequent executions of the task.

Javadoc for ScheduledExecutorService.scheduleWithFixedDelay (scheduleAtFixedRate has the same issue):

ScheduledFuture<?> scheduleWithFixedDelay(Runnable command,
                                          long initialDelay,
                                          long delay,
                                          TimeUnit unit)
Creates and executes a periodic action that becomes enabled first after the given initial delay, and subsequently with the given delay between the termination of one execution and the commencement of the next. If any execution of the task encounters an exception, subsequent executions are suppressed. Otherwise, the task will only terminate via cancellation or termination of the executor.

So, the quick fix is to add a try-catch block around the task execution logic to catch Exception. I have created a simple demo app to show this issue and how the fix works with some analysis why it happens:

https://github.com/guozheng/scheduledjob

The lesson learned is that we need to read doc more carefully and read source code to understand the library behavior closely.