Performance testing with InfluxDB + Grafana + Telegraf, Part 2

Previously I ran my performance test in a local test environment. However, the actual server should run in the Amazon EC2 cloud on one of the free tier micro instances (yes, I am cheap but it is more of a hobby project bwaabwaa.. 🙂 ). So I installed the server on my EC2 micro instance and ran the same tests against. The test client(s) on my laptop, the server in EC2..

So what did I expect to get? Maybe a bit slower but mainly the same performance. Well, lets see..

Here is the start of the session:

init_2freq

Couple of things to note. I added a chart showing how many different IP addresses the sessions are coming from (middle line here). This is just one due to hitting them all from my laptop. In the production version this should be more interesting. Also, the rate at which the tester is able to create new sessions and keep them running is much bumbier and slower than when running the same tests locally. Finally, there is the fact that the tester is observing a much bigger update frequency variation.

To examine if this delay variation is due to server load or network connections, I tried turning off the tester frequency line and show only the server line:

init_1freq

And sure enough, on the server end there is very little variation. So the variation is due to network load/speed. Not a real problem here since most likely there would not be hundreds of clients connection over the same IP line/NAT router. However, the rate at which my simple tester and single IP approach scales up is much slower and gets slower over time:

1300sessions

So I run it for about 1300 clients as shown above. For a more extensive tests I should really provision some short-term EC2 micro instances and use those to load test my server instance. But lets look at the resource use for now:

1300sessions_mem

So the EC2 micro instance has a maximum of 1GB memory, out of which the operating system takes its own chunk. I started the server JVM in this case without remembering to set the JVM heap limit specifically, so in this case it is set to about 256MB. However, from my previous tests this was enough for pretty much the 4000 clients I tested with so I will just go with that for now.

And how about the system resources? Lets see:

byte_count

CPU load never goes above 30% so that is fine. System has memory left so I can allocate more for the JVM if I need so fine. Actually it has more than shown above as I learned looking at this that there is a bunch of cached and buffered memory that is also available although not listed as free. At least for “top”.. 🙂

But more interestingly, the “no title” chart at the bottom is now again the network usage chart that on my OSX local instance did not work. In the EC2 environment Telegraf seems to be able to capture this data. This is very useful as in EC2 usage they also charge you for your network traffic. So I need to be able to monitor it and to figure out how much network traffic I will be generating. This chart shows I have initially used much more inbound traffic (uploading my server JAR files etc). However, as the number of active clients rises to 1300+, the amount of transferred “data out” also rises quite fast and passes “data in” towards the end. This tells me if the app ever got really popular I would probably be hitting the free tier limits here. But not a real problem at this point, fortunately or unfortunately (I guess more unfortunately 🙂 ).

That is it for the EC2 experiment at this time. However, one final experiment. This time I tried in my local network with the server on one host and the tester on another. Meaning the traffic actually has to hit the router. Would that be interesting? I figured not but there was something..

It looked like this:

mini

What is interesting here? The fact that the session rate starts to slow down already, and the frequency variation for the tester gets big fast after about 2000 active sessions. So I guess the network issue is not so much just for EC2 but for anything passing beyond localhost. Conforting that, in the sense that the connection to the much further EC2 instance does not seem to make such as big difference as I initially thought from my EC2 tests.

Finally, an interesting part to investigate in this would be to run this local network test while also using SNMP to query the router for various metrics. As I have previously also written a Python script to do that, I might try that at one point. Just because I am too curious.. But not quite now. Cheers.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s