Tag Archives: influxdb

Collecting java.util.logging to log4j2

Everybody wants to write a log. And in Java everybody wants to write their own logging framework or at least use of the many different ones. Then someone comes up with logging framework framework such as SLF4J.

OK but what was I about to say. As so many times, I had a piece of Java software writing a log file using Log4J2. I was using some libs/someone elses code that uses java.util.logging to write their log. I wanted to capture those logs and include them in my Log4J2 log file for debugging, error resolution or whatever.

This case was when trying to log errors from the InfluxDB Java driver. The driver uses java.util.logging for minimal external dependencies or something. I used Log4J2 in my app.

So the usual question of how do you merge java.util.logging code, that you do not control, with your own code using Log4J2 to produce a single unified log file?

Most Googling would tell me all about SLF4J etc. I did not want yet-another framework on top of existing frameworks, and yet some more (transitive) dependencies and all sorts of weird stuff. Because I am old and naughty and don’t like too many abstractions just because.

So the code to do this with zero external dependencies.

First a log Handler object for java.util.logging to write to Log4J2:

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

import java.util.logging.Handler;
import java.util.logging.Level;
import java.util.logging.LogRecord;

/**
* @author Daddy Bigbelly.
*/
public class JekkuHandler extends Handler {
//notice that this is the Log4J2 logger here, inside a java.util.logging Handler object
private static final Logger log = LogManager.getLogger();

  @Override
  public void publish(LogRecord record) {
    Level level = record.getLevel();
    if (level.intValue() == Level.SEVERE.intValue()) {
      log.error(record.getMessage(), record.getThrown());
    } else if (level.intValue() >= Level.INFO.intValue()) {
      log.info(record.getMessage(), record.getThrown());
    } else {
      log.debug(record.getMessage(), record.getThrown());
    }
  }

  @Override
  public void flush() {}

  @Override
  public void close() throws SecurityException {}
}

Next setting it up and using it, with the InfluxDB Java driver as an example:

import org.influxdb.InfluxDB;
import org.influxdb.InfluxDBFactory;
import org.influxdb.dto.BatchPoints;
import org.influxdb.dto.Point;
import org.influxdb.dto.Query;
import org.influxdb.impl.BatchProcessor;

import java.io.IOException;
import java.util.concurrent.TimeUnit;
import java.util.logging.ConsoleHandler;
import java.util.logging.FileHandler;
import java.util.logging.Formatter;
import java.util.logging.Handler;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.util.logging.SimpleFormatter;

/**
* @author Daddy Bigbelly.
*/

public class LogCaptureExample {
  public static void main(String[] args) throws Exception {
    //oh no the root password is there
    InfluxDB db = InfluxDBFactory.connect("http://myinfluxdbhost:8086", "root", "root");
    String dbName = "aTimeSeries";
    db.createDatabase(dbName);
    db.enableBatch(2000, 1, TimeUnit.SECONDS);

    //if you look at the influxdb driver code for batchprocessor, 
    //where we wanted to capture the log from, you see it using the classname to set up the logger. 
    //so we get the classname here and use it to hijack the writes for that logger (the one we want to capture)
    System.out.println(BatchProcessor.class.getName());
    Logger logger = Logger.getLogger("org.influxdb.impl.BatchProcessor");
    Handler handler = new JekkuHandler();
    logger.addHandler(handler);

    //this runs forever, but the batch mode can throw an error if the network drops.
    //so disconnect network to test this in middle of execution
    while (true) {
      Point point1 = Point.measurement("cpu")
        .time(System.currentTimeMillis(), TimeUnit.MILLISECONDS)
        .addField("idle", 90L)
        .addField("user", 9L)
        .addField("system", 1L)
        .build();
      db.write(dbName, "autogen", point1);
    }
  }
}

You could probably quite easily configure a global java.util.logger that would capture all logging written with java.util.logging this way. I did not need it so its not here.

In a similar way, you should be able to capture java.util.logging to any other log framework just by changing where the custom Handler writes the logs to.

Well there you go. Was that as exciting for you as it was for me?

Performance testing with InfluxDB + Grafana + Telegraf, Part 3

This time running the test with the fixed SNMP based router monitoring and with the client-server monitoring as well. Notice in the following graphs how I learned to also put two scales on a single chart, combine line and bar charts, and add units to all the charts. OOoooo, prettyy….

First up, the number of sessions, the update frequency delays, and the connection (SSL handshake) errors from the test start:

errors_start

Here the “errors” graph shows the overall number of connection errors so far in the test (the orange line) and the number of errors for each second (the green bars). The number of sessions is growing at relatively steady pace, with slight drop in rate over time. At around 16:46:45 we start getting errors at a steady rate, which is at around 1300 sessions and at the same time the tester starts seeing bigger delays.

The router CPU load goes up pretty high quite fast:

nw_cpu_raw_start

At the problem start point of 16:46:45 the router CPU load starts to go over 80% quite often. This is likely a good part of the cause, although I would expected more properties to be involved since this is still not at 100% (although looking at the router internal graph below, maybe it is at 100%, the forced 5s averaging just hides it).

Notice another interesting part on the left. The test starts from about 20% CPU load. Why is this? I speculate it is due to the other traffic over the same network. You know, “other users” streaming YouTube and playing online games, etc., while I run my tests. So this is actually the “background noise” of the system. It is also probably why in this test run I only got to around 3800 clients when I have previously gotten close to 4200 with less background noise.

Looking at the router internal CPU graph:

router_start

It is actually quite high but in peaks. With the finest granularity of 5 seconds, we are seeing much of this averaged much lower in the Grafana chart above. So maybe the CPU load on the router is actually the reason to see some of those connection errors. The minimum of 5 second averaging the SNMP agent on the router gives us is just partly hiding it.

A couple of other, slightly less relevant, but still quite interesting charts I collected during the run:

bandwidth_overload

This is the “br0”, or interface number 7, from the router (that I speculate is the wireless part). That is, the amount of traffic in/out on this interface. Notice how the “out” part (the orange line) actually has a much bigger scale on the left, while the “in” part (green) has a much smaller scale on the right. At first I wondered how can this be. I would expect much more downloading (in) to happen over this interface than uploading (out).

With my tinfoil hat on I immediately thought, maybe the NSA or the chinese are downloading all my secrets and streaming video of me picking my nose 24/7 from my compewters? But then I realized this is from the router viewpoint, so “out” probably means the router is sending data to the WLAN clients and “in” means it is receiving data from them. So this would be opposite of my assumption, and “in” is actually uploads and “out” downloads. Which seems more correct and I can remove the hat for now.

Anyway, slightly interestingly the upload part starts growing at a faster rate compared to the download part when I run the test. Most likely due to the difference in scale but shows how the traffic for upload increases proportionally more. Although in absolute terms I believe the increase is the same, as it is the application server and client sending data to each other over the wireless, so it is up/download over the same channel for the same data.

Just for the kicks, the JVM memory use I collected:

jvm_memory

Nothing special here, but notice again how having the actual correct units on the y-axis is nice.

Now to take a look at the end part of the test. The errors, update frequency delays and router CPU load:

network_cpu_end

The router CPU in the above is actually hitting 100% in a couple of spots, which is quite heavy since these CPU load measures are 5 second averages as explained before. This is why Grafana is upping the graph top boundary to 125%. I should probably have set the graph maximum upper bound in Grafana at 100% but whatever. At the end ,when the test stops creating new connections, and old ones die out, the router load also drops back to the level of the background noise.

Which we can also see in the router internal chart:

router_cpu_end

We can do from this an interesting comparison using the direct measures (the ones directly queried over from specific SNMP OIDs) vs derived ones (derived from the several raw measures):

nw_direct_end

The the scale different due to auto-scaling changing the raw chart to 125% top while the direct never goes near 100% and is thus topped at 100%. Anyway, this comparison again shows how the longer averaging period hides much of the important information. This is especially visible in the last part of the period, although the whole period is pretty obvious. And if I forgot before the “CPU busy” value (blue line in the “Router CPU raw”) metric is the sum of all the other metrics in that chart (user + system + nice + sirq). Anyway, in any case, the “direct” chart is averaged over longer term (likely over a minute as I speculated in my previous posts) and thus more information loss is evident.

The router RAM part is pretty constant and was this during all my tests. Thus I haven’t put much of it in these posts but there it is for once in the above chart as well..

Then we take the server CPU load just to see if was still low enough not to cause any of these issues:

server_cup_end

And no problem, it never goes over 50%. Interesting spikes though.

We can also look at the whole picture at once for the whole test run. The test runs for about 10 minutes and gets up to about 3800 concurrent sessions:

whole_sessions

and the network errors, router CPU load and update frequency delays in client and server:

whole_cpu_raw

Cute. Are we missing anything? Yes, we do not measure the load on the client side where the tester is running. This could also have a very significant impact. Here is an example of the top running processes at the client machine during the test:

tasks

What are these? Well, Firefox is running Grafana, which seems to really suck on the CPU when set up with complex graphs and continous refreshes over 5 second intervals. IntelliJ is the IDE I am using to run the tests. The “java” process is likely related to the same, being a part of the performance tests. Why are they showing up as two, both sucking on the CPU as a separate process at relatively high intensity? No idea (arrr..), but maybe “java” is for the actual tester and IntelliJ is where it is run from, and it needs to track the tester process, capture prints, etc. Finally, “Python” is the SNMP poller that I run to capture the router metrics.

So what do I think I should have learned from this:

  • I should architect the system to run monitoring from one node, logging database on one node, graphing UI from another node, …
  • I should track all elements of the system, not just the client and the server
  • Long averaging periods for measurements lose important information for detailed analysis of the performance tests vs functionality (e.g., reasons for network errors as sessions start piling).
  • However, I should maybe not spend excessive time building shiny monitoring systems when the potential cause has become clear. Perhaps if I have one ready as I now do..
  • I should architect my test system to become a distributed system, easily deployed on several nodes. This should allow avoiding process resource limits (e.g., threads per process), kernel limits (e.g., file descriptors), and provide more realistic loads over several distributed nodes
  • I should design in monitoring features that are useful in both production and testing

What next? Well there are a few missing parts. One is that I should monitor the tester clients as noted above. Then another not-so-obvious point is that I should synchronize the clocks of all systems as close as possible. Otherwise I can never link properties such as network errors to other metrics from other nodes. Something like NTP should be used. But is it a bit complex to deploy effectively. In fact, distributed computing in general is a bit complex so I should probably look into building a nice shiny distributed performance test framework and NTP synchronization sometime in the future..

Another interesting aspect here would be to see how much the longer averaging period makes it harder to find actual causes, in a scenario where this could be realistically controlled with other factors.

Performance testing with InfluxDB + Grafana + Telegraf, Part 2

Previously I ran my performance test in a local test environment. However, the actual server should run in the Amazon EC2 cloud on one of the free tier micro instances (yes, I am cheap but it is more of a hobby project bwaabwaa.. 🙂 ). So I installed the server on my EC2 micro instance and ran the same tests against. The test client(s) on my laptop, the server in EC2..

So what did I expect to get? Maybe a bit slower but mainly the same performance. Well, lets see..

Here is the start of the session:

init_2freq

Couple of things to note. I added a chart showing how many different IP addresses the sessions are coming from (middle line here). This is just one due to hitting them all from my laptop. In the production version this should be more interesting. Also, the rate at which the tester is able to create new sessions and keep them running is much bumbier and slower than when running the same tests locally. Finally, there is the fact that the tester is observing a much bigger update frequency variation.

To examine if this delay variation is due to server load or network connections, I tried turning off the tester frequency line and show only the server line:

init_1freq

And sure enough, on the server end there is very little variation. So the variation is due to network load/speed. Not a real problem here since most likely there would not be hundreds of clients connection over the same IP line/NAT router. However, the rate at which my simple tester and single IP approach scales up is much slower and gets slower over time:

1300sessions

So I run it for about 1300 clients as shown above. For a more extensive tests I should really provision some short-term EC2 micro instances and use those to load test my server instance. But lets look at the resource use for now:

1300sessions_mem

So the EC2 micro instance has a maximum of 1GB memory, out of which the operating system takes its own chunk. I started the server JVM in this case without remembering to set the JVM heap limit specifically, so in this case it is set to about 256MB. However, from my previous tests this was enough for pretty much the 4000 clients I tested with so I will just go with that for now.

And how about the system resources? Lets see:

byte_count

CPU load never goes above 30% so that is fine. System has memory left so I can allocate more for the JVM if I need so fine. Actually it has more than shown above as I learned looking at this that there is a bunch of cached and buffered memory that is also available although not listed as free. At least for “top”.. 🙂

But more interestingly, the “no title” chart at the bottom is now again the network usage chart that on my OSX local instance did not work. In the EC2 environment Telegraf seems to be able to capture this data. This is very useful as in EC2 usage they also charge you for your network traffic. So I need to be able to monitor it and to figure out how much network traffic I will be generating. This chart shows I have initially used much more inbound traffic (uploading my server JAR files etc). However, as the number of active clients rises to 1300+, the amount of transferred “data out” also rises quite fast and passes “data in” towards the end. This tells me if the app ever got really popular I would probably be hitting the free tier limits here. But not a real problem at this point, fortunately or unfortunately (I guess more unfortunately 🙂 ).

That is it for the EC2 experiment at this time. However, one final experiment. This time I tried in my local network with the server on one host and the tester on another. Meaning the traffic actually has to hit the router. Would that be interesting? I figured not but there was something..

It looked like this:

mini

What is interesting here? The fact that the session rate starts to slow down already, and the frequency variation for the tester gets big fast after about 2000 active sessions. So I guess the network issue is not so much just for EC2 but for anything passing beyond localhost. Conforting that, in the sense that the connection to the much further EC2 instance does not seem to make such as big difference as I initially thought from my EC2 tests.

Finally, an interesting part to investigate in this would be to run this local network test while also using SNMP to query the router for various metrics. As I have previously also written a Python script to do that, I might try that at one point. Just because I am too curious.. But not quite now. Cheers.

Performance testing with InfluxDB + Grafana + Telegraf, Part 1

Previously I played around with ElasticSearch + Kibana, which was nice for visualizing some activities on my Minecraft server. I then proceeded to try the same for some server performance metrics data such as CPU and memory use (collected using some Python scripts I created with the psutil library). ES + Kibana are absolutely great tools for analyzing textual data such as logs. Not so good and effective for numerical time-series type data I would say. Maybe this is why it is so often combined with stuff like Logstash etc.

Anyway, after looking around, I then found InfluxDB and Grafana as more suitable alternatives for my performance data. And what is even nicer is, there is now a nice and shiny tool from the InfluxDB folks called Telegraf that collects all those system metrics for me. And since shiny pictures are nice, I thought I would post some of those here.

In this occasion, I am testing a simple backend server for an Android app. Clients register to the server, which then sends updates for subscribed sensors at 1 second interval to each client. The client is the Android app in this case.. My tests run from a single laptop and test how much load the server can handle, by continuously registering new clients to receive more updates.

Off we go, with the initial tests. I first implemented my test client using plain Java SSL sockets. The app uses a custom binary protocol and it is just cost-effective to keep the connection open and push the data through. Which is why the plain socket approach. I used to types of probes to collect the data. The Telegraf plugins for basic system resources, such as CPU, memory and network traffic. And a few custom probes I inserted into the application and into the tester to collect additional data.

Anyway, pics you say:

socket_sessions

This one shows the custom metrics I collected from the server/tester. The memory chart contains four different values. The “mem_used” metric measures the amount of reserved RAM in the JVM. The “mem_total” measures the total amount of RAM the JVM has allocated of all the possible RAM it has been given. In this case, I gave the JVM a max value of 1G RAM to allocate. This chart shows the JVM never allocates more than 250MB of this, so this is plenty. This is reflected here in the “mem_free_max” metric, which shows how much more at the most the server could allocate beyond what the JVM has already allocated.

The session count metric is a measure of how many clients have connected to the server a time (active sessions). The “frequency” metric here is a measure of how much time on average is at each point taken for the server to send the requested data updates to the clients. This chart shows the deviation from the expected 1000 milliseconds is at most 4 milliseconds, so well within range.

These “custom” metrics are all collected using my own measurement agents I deployed both at the server and tester end. With the server, I can also use there during production deployment to keep monitoring the performance, so they are useful beyond performance testing as well.

In this initial try with the plain sockets approach, I got up to around 2000 client connections. At this point, I ran into the operating system thread limit. I tried to get around that with another approach but more on that later.

For metrics on the system resources, I used the Telegraf tool, which had the option to directly collect operating system (and various other) metrics and push them into InfluxDB. I forgot to turn Telegraf on at first when doing this example so it is a bit short. Anyway, visualization:

socket_cpu

Here, the top row shows CPU use percentage for each of the four cores on the system. The middle row shows the average of these for total system CPU load. The bottom row is for network traffic as bytes in/out. Unfortunately this did not seem to work on OSX so it is just a flatline for now.

In any case, 2000 clients seems plenty enough to test for my app that has never seen more than a fraction of that in installs. But since I am always too curious, I wanted to try to implement this properly. So I went with Netty (as I needed both Java NIO and SSL support which is horrible to implement by yourself, thank you very much Java engineers..).

So, using the same visualizations, here are some graphs.

start_sessions

This one shows the start of the test execution. Notice the flatline part on the left when the test is started? The flatline is the measurement agents showing the server JVM statistics before the test clients are started up. Interestingly, the JVM starts by allocating about 256MB of memory out of the possible 1GB. Shortly after more load is put on the JVM, it goes down much closer to the actual usage limit.

As for the frequency measure, it is a bit off at the start which is quite normal I would guess for the first measure to calibrate itself. What is more noticeable here is that I added a second frequency measure. Now there are two overlapping lines, one for the delay observed at the server end, and one for the tester end. This gives me a way to see if the delay is due to too much load on the server or too much load on the network (and avoids need for time synchronization such as NTP or PTP when I distribute this). It also has the benefit that I can keep observing the server delay using the same agent in production.

After a longer run, this is how far I end up at:

end4k_sessions

At around 4000 client connections I hit another limit. This time I have passed the thread count limit but I hit a limit on the number of available open file descriptors on the operating system. File descriptors are also used for sockets, and the OS has a limit on how many can be opened at once. Apparently this can be modified using various tricks involving ulimit and various kernel paremeter tuning on OSX. However, I do not wish to mess my personal laptop so I give up on this and take the 4000 as enough for now. Otherwise I would take a virtual machine or several, install Ubuntu and mess them up with the file limits as a VM is easy to redo. Then make several and bombard my server with all of them. But for now this is fine.

For the Telegraf metrics we get this:

end4k_cpu

So the CPU load goes up steadily towards the end and suddenly drops at the end. Which is kind of weird as I would expect it to really drop after the test finishes, not 30 seconds before it. Either there is some issue with different ways the agents take time or there are some delays in processing some metrics. I should probably investigate further. However, this would not really affect my performance test so I am fine with that for now.

Anything else? Well, there are a few interesting bits to show. Lets see.

Here is a frequency spike that hides all other variation due to being so large:

freq_spike

We can drill down on the flatline to see the variation there much like we did in Kibana (click and drag to select on the chart):

freq_spike_focused

Of course, it is a valid question here if the variation at this very fine granularity makes a difference or if the large spike is the only thing of real interest here. But is illustrates how these fancy tools can be nice.

Another thing of interest. During some test runs, I managed to get some weird large spikes in the frequency (as well as session registration):

logspike_session

These spikes are the same for both the tester and server frequency, and the number of new sessions that get through at this time are also flatlining. So something must hang the system for close to 30 seconds here. I would suspect a some network lag here but I believe I was running both the tester and server on localhost so should not be that (although InfluxDB is on another host). Also, the fact that the server spike is identical (overlapping line) with the tester spike is against this. So a look at the CPU load for this period:

logspike_cpu

There does seem to be some correlation with the highest CPU spikes and the lag spikes in update frequencies. So I tried to think what I had changed or what could be causing this. I figured it was probably the fact that I had added some pretty verbose logging and enabled log rotation + compression. So I figured maybe logs are filling up fast and getting rotated + compressed on those points. So I reduced logging to sane levels and the spikes never came back. Maybe it was that, or maybe it was the sun spots. But its gone so fine with me..

So that is some basic example of how I used this toolset for performance testing. Now, in practice I would not run the server on localhost, nor would the client apps be running on localhost (or the same host as the server is on). So I should actually test it on the real server environment. About that next…