Performance bugs can be difficult to diagnose. If the bug is intermittent, it gets even tougher.
Fortunately, Java provides tools like VisualVM for monitoring processes, but things get more complicated when you are dealing with performance issues in a distributed system. Our application consists of many "job processes" which distribute the work of building large reports. During our stress tests, we'd occasionally see a report run with very poor performance. We needed straightforward way to capture performance information for all of the processes at once.
perf4j provides a simple StopWatch mechanism. Wiring it into our project is straightforward—we added the dependency to our pom.xml...
...then wired it up to log to a file in our logback configuration. The additivity flag on the perf4j logger ensures that these log messages will go only to perf.log and not the console or our regular application log.
Now at various points in the code, we can start and stop the watch to time events.
The result is log statements formatted as follows.
The fields here are:
- tag - the event
- start - timestamp when the stopwatch started
- time - time in milliseconds for the event
Now we have performance data in log files, a good first step. The perf4j jar comes with a parser to calculate the avg/min/max time for each tag, although this solution doesn't scale with multiple processs spread across servers.
Enter logback's DBAppender, which allows us to direct the timing logs to a central database. Drivers for every popular RDBMS are provided. For this application, we're using Oracle so the setup looks something like this (note the use of system properties for the connection info):
...and switch our perf4j logger to log use this appender instead of a file.
DBAppender logs to the following tables.
logging_event contains one row per log event. The key fields are:
- formatted_message - in our case, the logback message described above
- thread_name - name of the thread that logged the message
- event_id - sequence number for the event
logging_event_property has a 1:many relationship with logging_event, this table contains one row for each context property (MDC) you've set on the logger. Our application has a single parseable MDC property "jobInfo", containing the type and id of the report being run, as well as the id of the process running it.
Now we have every process logging to a central place, we can use SQL to filter, group, and sort in any number of ways. In the example below, the inner query parses the log and MDC strings. The outer query sorts to keep the same reports together and filters "SQL.report" events with a time > 1 second.
With the help of these tools we were able to quickly generate reports to isolate the performance issue.
It's worth noting that logging to a database itself hurts performance. For our scenario, this was ok because we were looking for relative performance timings inside a few known hotspots.
The logback documentation also suggests using pooled connections, but in general this is probably not something you'd want to keep on in production indefinitely. Fortunately, turning it off is straightforward.
Ed is a lead engineer for the revenue cycle team at Advisory Board. Since joining in March 2011, he has been working on products to help hospitals identify risk-based payment opportunities.