Tag Archives: visualization

Explaining Machine Learning Classifiers with LIME

Machine learning algorithms can produce impressive results in classification, prediction, anomaly detection, and many other hard problems. Understanding what the results are based on is often complicated, since many algorithms are black boxes with little visibility into their inner working. Explainable AI is a term referring to techniques for providing human-understandable explanations of the ML algorithm outputs.

Explainable AI is interesting for many reasons, including being able to reason about the algorithms used, the data we have to train them, and to understand better how to test the system using such algorithms.

LIME, or Local Interpretable Model-Agnostic Explanations is one technique that seems to have gotten attention lately in this area. The idea of LIME is to give it a single datapoint, and the ML algorithm to use, and it will try to build understandable explanation for the output of the ML algorithm for that specific datapoint. Such as "because this person was found to be sneezing and coughing (datapoint features), there is a high probability they have a flu (ML output)".

There are plenty of introductory articles around for LIME, but I felt I needed something more concrete. So I tried it out on a few classifiers and datasets / datapoints to see.

For the impatient, I can summarize LIME seem interesting and going in the right direction, but I still found the details confusing to interpret. It didn’t really make me very confident on the explanations. There seems to be still ways to go for easy to understand, and high confidence explanations.

Experiment Setups

Overview

There are three sections to my experiments in the following. First, I try explaining output from three different ML algorithms specifically designed for tabular data. Seconds, I try explaining the output of a generic neural network architecture. Third, I try a regression problem as opposed to the first two, which examine a classification problem. Each of the three sections uses LIME to explain a few datapoints, each from different datasets for variety.

Inverted Values

As a little experiment, I took a single feature that was ranked as having a high contribution to the explanation for a datapoint by LIME, for each ML algorithm in my experiments, and inverted their value. I then re-ran the ML algorithm and LIME on this same datapoint, with the single value changed, and compared the explanation.

The inverted feature was in each case a binary categorical feature, making the inversion process obvious (e.g, change gender from male to female or the other way around). The point with this was just to see if changing the value of a feature that LIME weights highly results in large changes in the ML algorithm outputs and associted LIME weights themselves.

Datasets and Features

The datasets used in different sections:

  • Titanic: What features contribute to a specific person classified as survivor or not?
  • Heart disease UCI: What features contribute to a specific person being classified at risk of heart disease?
  • Ames housing dataset: What features contribute positively to predicted house price, and what negatively?

Algorithms applied:

  • Titanic: classifiers from LGBM, CatBoost, XGBoost
  • Heart disease UCI: Keras multi-layer perceptron NN architecture
  • Ames housing dataset: regressor from XGBoost

Tree Boosting Classifiers

Some of the most popular classifiers I see with tabular data are gradient boosted decision tree based ones; LGBM, Catboost, and XGBoost. Many others exist that I also use at times, such as Naive Bayes, Random Forest, and Logistic Regression. Hoever, LGBM, Catboost, and XGBoost are ones I often try first these days for tabular data. So I try using LIME to explain a few datapoints for each of these ML algorithms in this section. I expect a similar evaluation for other ML algorithms should follow a quite similar process.

For this section, I use the Titanic dataset. The goal with this dataset is to predict who would survive the shipwreck and who would not. Its features:

  1. survival: 0 = No, 1 = Yes
  2. pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
  3. sex: Sex
  4. Age: Age in years
  5. sibsp: number of siblings / spouses aboard the Titanic
  6. parch: number of parents / children aboard the Titanic
  7. ticket: Ticket number
  8. fare: Passenger fare
  9. cabin: Cabin number
  10. embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

The actual notebook code is available my Github as well as in a Kaggle notebook.

Each of the three boosting models (LGBM, Catboost, XGBoost) provides access to their internal statistics as a form of feature weights. For details, check some articles and documentation. These types of model feature weights provide a more holistic view of the model workings, over all datapoints as opposed to the single datapoint that LIME tries to explain. So in the following, I will show these feature weights for comparison where available.

However, there is also some very good criticism on using these types of classifier internal statistics for feature importances, noting it might also be meaningful to compare with other techniques such as permutation importance and drop-column importance. As such, I calculate also permutation importance for each of the three boosters here, as well as later for the Keras NN classifier.

LGBM

Feature Weights from Classifier / Permutations

The following figure illustrates the weights given by the model itself, when I trained it on the Titanic dataset, via the classifier feature_importances_ attribute.

LGBM Feature Weights

And the ones illustrated by the following figure are the ones given by the SKLearn’s permutation importance function for the same classifier.

LGBM Permutation Weights

Comparing the two above, the model statistics based weights, and the permutation based weights, there is quite a difference in what they rank higher. Something interesting to keep in mind for LIME next.

Datapoint 1

The following figure illustrates the LIME explanations (figures are from LIME itself) for the first item in the test set for Titanic data:

LIME LGBM 1

The figure shows to versions of the same datapoint. The one on the left is the original data from the dataset. The one on the right has the sex attribute changed to the opposite gender. This is the invertion of the highly ranked LIME feature I mentioned before.

Now, compare these LIME visualizations/explanations for these two datapoint variants, to the global feature importances above (from model internal statistics and permutation score). The top features presented by LIME closely match those given by the global permutation importance as top features. In fact, it is almost an exact match.

Beyond that, the left side of the figure illustrates one of my main confusions about LIME in general. The prediction of the classifier for this datapoint is:

  • Not survived: 71% probability
  • Survived: 29% probability

I would expect the LIME feature weights to show highest contributions then for the not survived classification. But it shows much higher weights for survived. By far "Sex=male" seems to be the heaviest weight for any variable given by LIME here, and it is shown as pointing towards survived. Similarly, the overall LIME feature weights in the left hand figure are

  • Not survived: 0.17+0.09+0.03+0.00=0.29
  • Survived: 0.31+0.15+0.07+0.03+0.02+0.01=0.59

Funny how the not survived weights sum up the exact prediction value for survived. I might think I am looking at it the wrong way, but further explanations I tried with other datapoints seem to indicate otherwise. Starting with the right part of the above figure.

The right side of the above figure, with the gender inverted, also shows the sex attribute as the highest contibutor. But now, the title has risen much higher. So perhaps it is telling that a female master has a higher change of survival? I don’t know, but certainly the predictions of the classifier changed to:

  • Not survived: 43%
  • Survived: 57%

Similarly, passenger class (Pclass) value has jumped from weighting on survival to weighting on non-survival. The sums of LIME feature weights in the inverted case do not seem too different overall, but the prediction has changed by quite a bit. It seems complicated.

Datapoint 2

LIME explanation for the second datapoint in the test set:

LIME LGBM 2

For this one, the ML prediction for the left side datapoint variant seems to indicate even more strongly that the predicted survival chance is low, but the LIME feature weights point even stronger into the opposite direction (survived).

The right side figure here illustrates a bit how silly my changes are (inverting only gender). The combination of female with mr should never happen in real data. But regardless of the sanity of some of the value combinations, I would expect the explanation to reflect the prediction equally well. After all, LIME is designed to explain a given prediction with given features, however crazy those features might be. On the right hand side figure the feature weights at least seem to match a bit better to the prediction vs on the left side, but then why is it no always matching in the same way?

An interesting point is also how the gender seems to always weight heavily towards survival in both cases here. Perhaps it is due to the combinatorics of the other feature values, but given how the LIME weights vs predictions seem to vary across datapoints, I wouldn’t be so sure.

Catboost

Feature Weights from Classifier / Permutations

Model feature weights based on model internals:

Catboost Feature Weights

Based on permutations:

Catboost Permutation Weights

Interestingly, parch shows negative contribution.

Datapoint 1

First datapoint using Catboost:

LIME Catboost 1

In this case, both the LIME weights for the left (original datapoint) and right (inverted gender) side seem to be more in line with the predictions. Which sort of shows that I cannot only blame myself for interpreting the figures wrong, since they sometimes seem to match the intuition, and other times not..

As opposed to the LGBM case/section above, in this case (for Catboost) the top LIME features actually seem to follow almost exactly the feature weights from the model internal statistics. For LGBM it was the other way around, they were not following the internal weights but rather the permutation weights. Confusing as everything else about these weights, yes.

Datapoint 2

The second datapoint using Catboost:

LIME Catboost 2

In this case, LIME is giving very high weights for variables on the side of survived, while the actual classifier is almost fully predicting non-survival. Uh oh..

XGBoost

Feature Weights from Classifier / Permutations

Model feature weights based on model internal statistics:

XGB Feature Weights

Based on permutations:

XGB Permutation Weights

Datapoint 1

First datapoint explained for XGBoost:

LIME XGB 1

In this case, the left one seems to indicate not-survived on the weights quite heavily, but the actual predictions are quite even on survived and not survived. On the right side the weights vs predictions are more in line with LIME feature weights, seeming to match prediction.

As for LIME weights vs the global predictions from model internals and permutations, in this case they seem to be mixed. Some LIME top features are shared with the top feature weights for model internals, some are shared with permutations. Compared to the previous sections, the LIME weights vs model and permutation weights seem to be all over the place. Which might be some attribute of the algorithms in case of the internal feature weights but I would expect LIME to be more consistent with regards to permutation weights, as that algorithm never changes.

Datapoint 2

Second datapoint:

LIME XGB 2

Here, the left one seems to indicate much more of survival on the weights, and non-survival in actual prediction. On the right side, the weights and predictions seem more in line again.

Explaining a Keras NN Classifier

This section uses a different dataset of the Cleveland Heart Disease risk. The inverted variable in this case is not gender but the cp variable, since it seemed to be the highest scoring categorical variable for LIME on the datapoints I looked at. It also has 4 values, not 2, but in any case, I expect changing a high scoring variable to show some impact.

Features:

  1. age: age in years
  2. sex: (1 = male; 0 = female)
  3. cp: chest pain type (4 values)
  4. trestbps: resting blood pressure in mm Hg on admission to the hospital
  5. chol: serum cholestoral in mg/dl
  6. fbs: fasting blood sugar > 120 mg/dl
  7. restecg: resting electrocardiographic results (values 0,1,2)
  8. *maximum heart rate achieved
  9. *exercise induced angina: (1 = yes; 0 = no)
  10. oldpeak: ST depression induced by exercise relative to rest
  11. slope: the slope of the peak exercise ST segment
  12. number of major vessels (0-3) colored by flourosopy
  13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

Feature Weights from Permutations

Keras does not provide feature weights based on model internal statistics, being a generic neural networks framework, as opposed to specific algorithms such as the boosters above. But permutation based feature weighting is always an option:

Keras Permutation Weights

Training Curves

Training curves are always nice, so here you go:

Keras Training

Datapoint 1

First datapoint explained by LIME for Keras:

LIME Keras 1

This one is predicting almost fully no risk on both datapoints. Yet the weights seem to be indicating almost fully on the risk of heart side.

The LIME weights compared to the global permutation weights share the same top 1-2 features, with some changes after.

Datapoint 2

Second datapoint explained by LIME for Keras:

LIME Keras 2

In this case, the predictions and weights are more mixed on both sides. The right side seems to have the weights much more on the no risk side than the left side, yet the change between the two is that the prediction has shifted more towards the risk of heart side.

In this case, the features are quite different from the first datapoint, and also from the global weights given by permutation importance. Since LIME aims to explain single datapoints and not the global model, I don’t see an issue with this. However, I do see an issue in not being able to map the LIME weights to the predictions in any reasonable way. Not consistently at least.

Explaining an XGBoost Regressor

Features in the Ames Housing Dataset used in this section:

  • SalePrice – the property’s sale price in dollars. This is the target variable that you’re trying to predict.
  • Utilities: Type of utilities available
  • OverallQual: Overall material and finish quality
  • GrLivArea: Above grade (ground) living area square feet
  • ExterQual: Exterior material quality
  • Functional: Home functionality rating
  • KitchenQual: Kitchen quality
  • FireplaceQu: Fireplace quality
  • GarageCars: Size of garage in car capacity
  • YearRemodAdd: Remodel date
  • GarageArea: Size of garage in square feet

Datapoint 1

LIME XGBReg 1

As discussed here, LIME results seem more intuitive to reason about for classification than regression. For regression it should show some relative value of how the feature values contribute to the predicted regression value. In this case how the specific features values are predicted to impact the house price.

But as mentioned, the meaning of this is a bit unclear. For example, what does it mean for something to be positively weighted? Or negatively? Regards to what? This would require more investigation, but I will stick on more details for classification in this post.

Data Distribution

Just out of interest, here is a description of the data distribution for the features shown above.

XGBReg Distribution

One could perhaps make some analysis of how the feature value distributions are with regards to the LIME weights for those variables, and use those as a means to analyze the LIME results further in relation to the predicted price. Maybe someday someone will.. ūüôā

Conclusions

Compared to all the global feature weights given by the model internal statistics, and the permutations, the results are often sharing some of the top features. And comparing explanations for different datapoints using the same algorithm, there appears to be some changes in which features LIME ranks highest per datapoint. Overall, this all makes sense considering what LIME is supposed to be. Explaining individual data points, where the globally important features likely often (and on average ahould) rank high, but where single points can vary.

LIME in general seems like a good way to visualize feature importances for a datapoint. I like how the features are presented as weighting in one direction vs other. The idea of trying values close to a point to come up with an explanation also seems to make sense. However, many of the results I saw in above experiments do not quite seem to make sense. The weights presented often seem to be opposed to the actual predictions.

This book chapter hosts some good discussion on the limitations of LIME, and maybe it explains some of this. The book chapters ends up says to use great care in applying LIME, and how the LIME parameters impact the results and explanations given. Which seems in line with what I see above.

Also, many of the articles I linked in the beginning simply gloss over the interpretation of the results, whether they make sense, or make seemingly strange assumptions. Such as this one, which gives me the impression that the explanation weights would change depending on what is the higher predicted probability by the classifier. For me, this does not seem to be what the visualizations show.

More useful would be maybe to understand the limitations and not expect it to be too great, even if I feel like I don’t necessarily get all the details. I expect it is either poorly explained, or I did get the details and it is just very limited. This is perhaps coming from the background of LIME itself, where the academics must sell their results as the greatest and best in every case, and put aside their limitations. This is how you get your papers accepted, and cited, leading to more grants, and better tenure positions..

I would not really use LIME. Mostly because I cannot see myself trusting the results very much, no matter what the sales arguments. But overall, it seems like interesting work and perhaps something simpler (to use) will be available someday, where I feel like having more trust in the results. Or maybe the problem is just complicated. But as I said, these all seem like useful steps into the direction of improving the approaches and making them more usable. Along these lines, it is also nice to see these being integrated as part of ML platform offerings and services.

There are other interesting methods for similar approaches as well. SHAP is one that seems very popular, and Eli5 is another. Some even say LIME is a subset of SHAP, which should be more complete vs the sampling approach taken by LIME. Perhaps it would be worth the effort to make a comparison some day..

That’s all for this time. Cheers.

Harharhar said the Santa when capturing browser test metrics with Webdriver and LittleMobProxy

In the modern age of big data and all that, it is trendy to capture as much data we can. So this is an attempt at capturing data on a set of web browsing tests I run on Selenium WebDriver. This is with Java using Selenium Webdriver and LittleMobProxy.

What I do here is configure an instance of LittleMobProxy to capture the traffic to/from a webserver in our tests. The captured data is written to a HAR file (HTTP archive file), the HAR file is parsed and the contents are used for whatever you like. In this case I dump them to InfluxDB and show some graphs on the generated sessions using Grafana.

This can be useful to see how much bandwidth your website uses, how many requests end up being generated, what elements are slowest to load, how your server caching configurations affect everything, and so on. I have used it to provide data for overall network performance analysis by simulating a set of browsers and capturing the data on their sessions.

First up, start the LittleMobProxy instance, create a WebDriver instance, and configure the WebDriver instance to use the LittleMobProxy instance:

    // start the proxy
    BrowserMobProxy proxy = new BrowserMobProxyServer();
    proxy.start(0);
    // get the Selenium proxy object
    Proxy seleniumProxy = ClientUtil.createSeleniumProxy(proxy);

    // configure it as a desired capability
    DesiredCapabilities capabilities = new DesiredCapabilities();
    capabilities.setCapability(CapabilityType.PROXY, seleniumProxy);

    driver = new ChromeDriver(capabilities);
    proxy.newHar("my_website_data");

Then we run some scripts on our website. The following is just a simple example as I do not wish to put here bots for browsing common website. I have prototyped some on browsing various videos on YouTube and on browsing different news portals. Generally this might be against the terms of service on public websites, so either use your own service you are testing (with a test service instance) or download a Wikipedia dump or something similar to run your tests on. Example code:

  public List<WebElement> listArticles() {
    List<WebElement> elements = driver.findElements(By.className("news"));
    List<WebElement> invisibles = new ArrayList<>();
    for (WebElement element : elements) {
      if (!element.isDisplayed()) {
        System.out.println("Not displayed:"+element);
        invisibles.add(element);
      }
    }
    elements.removeAll(invisibles);
    List<WebElement> articles = new ArrayList<>();
    List<String> descs = new ArrayList<>();
    for (WebElement element : elements) {
      List<WebElement> links = element.findElements(By.tagName("a"));
      for (WebElement link : links) {
        //we only take long links as this website has some "features" in the content portal causing pointless short links.
        //This also removes the "share" buttons for facebook, twitter, etc. which we do not want to hit
        //A better alternative might be to avoid links leading out of the domain 
        //(if you can figure them out..)
        //this is likely true for the ads as well..
        if (link.getText().length() > 20) {
          descs.add(link.getText());
          articles.add(link);
        }
      }
    }
    return articles;
  }

  public void openRandomArticle() throws Exception {
    List<WebElement> articles = listArticles();
    //sometimes our randomized user might hit a seemingly dead end on the article tree, 
    //in which case we just to the news portal main page ("base" here)
    if (articles.size() == 0) {
      driver.get(base);
      Thread.sleep(2000);
      articles = listArticles();
    }
    //this is a random choice of the previously filtered article list
    WebElement link = TestUtils.oneOf(articles);
    Actions actions = new Actions(driver);
    actions.moveToElement(link);
    actions.click();
    actions.perform();

    Har har = proxy.getHar();
    har.writeTo(new File(‚Äúmy_website.har‚ÄĚ));
    //if we just want to print it, we can do this..
    WebTestUtils.printHar(har);
    //or to drop stats in a database, do something like this
    WebTestUtils.influxHar(har);
  }

The code to access the data in the HAR file:

  public static void printHar(Har har) {
    HarLog log = har.getLog();
    List<HarPage> pages = log.getPages();
    for (HarPage page : pages) {
      String id = page.getId();
      String title = page.getTitle();
      Date started = page.getStartedDateTime();
      System.out.println("page: id=" + id + ", title=" + title + ", started=" + started);
    }
    List<HarEntry> entries = log.getEntries();
    for (HarEntry entry : entries) {
      String pageref = entry.getPageref();
      long time = entry.getTime();

      HarRequest request = entry.getRequest();
      long requestBodySize = request.getBodySize();
      long requestHeadersSize = request.getHeadersSize();
      String url = request.getUrl();

      HarResponse response = entry.getResponse();
      long responseBodySize = response.getBodySize();
      long responseHeadersSize = response.getHeadersSize();
      int status = response.getStatus();

      System.out.println("entry: pageref=" + pageref + ", time=" + time + ", reqBS=" + requestBodySize + ", reqHS=" + requestHeadersSize +
              ", resBS=" + responseBodySize + ", resHS=" + responseHeadersSize + ", status=" + status + ", url="+url);
    }
  }

So we can use this in many ways. Above I have just printed out some of the basic stats. Some example information available can be found on the internet, e.g. https://confluence.atlassian.com/display/KB/Generating+HAR+files+and+Analysing+Web+Requests. In the following I show some simple data from browsing a local newssite, visualized in Grafana using InfluxDB as a backend:

Here is some example code to write some of the HAR stats to InfluxDB:

  public static void influxHar(Har har) {
    HarLog harLog = har.getLog();
    List<HarPage> pages = harLog.getPages();
    for (HarPage page : pages) {
      String id = page.getId();
      String title = page.getTitle();
      Date started = page.getStartedDateTime();
      System.out.println("page: id=" + id + ", title=" + title + ", started=" + started);
    }
    List<HarEntry> entries = harLog.getEntries();
    long now = System.currentTimeMillis();
    int counter = 0;
    for (int i = index ; i < entries.size() ; i++) {
      HarEntry entry = entries.get(i);
      String pageref = entry.getPageref();
      long loadTime = entry.getTime();

      HarRequest request = entry.getRequest();
      if (request == null) {
        log.debug("Null request, skipping HAR entry");
        continue;
      }
      HarResponse response = entry.getResponse();
      if (response == null) {
        log.debug("Null response, skipping HAR entry");
        continue;
      }

      Map<String, Long> data = new HashMap<>();
      data.put("loadtime", loadTime);
      data.put("req_head", request.getHeadersSize());
      data.put("req_body", request.getBodySize());
      data.put("resp_head", response.getHeadersSize());
      data.put("resp_body", response.getBodySize());
      InFlux.store("browser_stat", now, data);
      counter++;
    }
    index += counter;
    System.out.println("count:"+counter);
  }

And the code to write the data into InfluxDB..

  private static InfluxDB db;

  static {
    if (Config.INFLUX_ENABLED) {
      db = InfluxDBFactory.connect(Config.INFLUX_URL, Config.INFLUX_USER, Config.INFLUX_PW);
      db.enableBatch(2000, 1, TimeUnit.SECONDS);
      db.createDatabase(Config.INFLUX_DB);
    }
  }

  public static void store(String name, long time, Map<String, Long> data) {
    if (!Config.INFLUX_ENABLED) return;
    Point.Builder builder = Point.measurement(name)
            .time(time, TimeUnit.MILLISECONDS)
            .tag("tom", name);
    for (String key : data.keySet()) {
      builder.field(key, data.get(key));
    }
    Point point = builder.build();
    log.debug("storing:"+point);
    //you should have enabled batch mode (as shown above) on the driver or this will bottleneck
    db.write(Config.INFLUX_DB, "default", point);
  }

And here are some example data visualized with Grafana for some metrics I collected this way:

Untitled

In the lower line chart, the number of elements loaded on click is shown. This refers to how many HTTP requests are generated per a WebDriver click on the website. The upper line/plot chart shows the minimum, maximum and average load times for different requests/responses. That is, how much time it took for the server to send back the HTTP responses for the clicks.

This shows how the first few page loads generated high number of HTTP requests/responses. After this, the amount goes down and stays quite steady at a lower level.  I assume this is due to the browser having cached much of the static content and not needing to request it all the time. Occasionally our simulated random browser user enters a slightly less explored path on the webpage, causing a small short term spike.

This nicely shows how modern websites end up generating surprisingly large numbers of requests. Also this shows how some requests are pretty slow to respond, so this might be a useful point to investigate for optimizing overall response time.

That’s it. Not too complicated but I find it rather interesting. Also does not require too many modifications to the existing WebDriver tests, just taking into use the proxy component, parsing the HAR file and writing to the database.

Performance testing with InfluxDB + Grafana + Telegraf, Part 3

This time running the test with the fixed SNMP based router monitoring and with the client-server monitoring as well. Notice in the following graphs how I learned to also put two scales on a single chart, combine line and bar charts, and add units to all the charts. OOoooo, prettyy….

First up, the number of sessions, the update frequency delays, and the connection (SSL handshake) errors from the test start:

errors_start

Here the “errors” graph shows the overall number of connection errors so far in the test (the orange line) and the number of errors for each¬†second (the green bars). The number of sessions is growing at relatively steady pace, with slight drop in rate over time. At around 16:46:45 we start getting errors at a steady rate, which is at around 1300 sessions and at the same time the tester starts seeing bigger delays.

The router CPU load goes up pretty high quite fast:

nw_cpu_raw_start

At the problem start point of 16:46:45 the router CPU load starts to go over 80% quite often. This is likely a good part of the cause, although I would expected more properties to be involved since this is still not at 100% (although looking at the router internal graph below, maybe it is at 100%, the forced 5s averaging just hides it).

Notice another interesting part on the left. The test starts from about 20% CPU load. Why is this? I speculate it is due to the other traffic over the same network. You know, “other users” streaming YouTube and playing online games, etc., while I run my tests. So this is actually the “background noise” of the system. It is also probably why in this test run I only got to around 3800 clients when I have previously gotten close to 4200 with less background noise.

Looking at the router internal CPU graph:

router_start

It is actually quite high but in peaks. With the finest granularity of 5 seconds, we are seeing much of this averaged much lower in the Grafana chart above. So maybe the CPU load on the router is actually the reason to see some of those connection errors. The minimum of 5 second averaging the SNMP agent on the router gives us is just partly hiding it.

A couple of other, slightly less relevant, but still quite interesting charts I collected during the run:

bandwidth_overload

This is the “br0”, or interface number 7, from the router (that I speculate is the wireless part). That is, the amount of traffic in/out on this interface. Notice how the “out” part (the orange line) actually has a much bigger scale on the left, while the “in” part (green) has a much smaller scale on the right. At first I wondered how can this be. I would expect much more downloading (in) to happen over this interface than uploading (out).

With my tinfoil hat on I immediately thought,¬†maybe the NSA or the chinese are downloading all my secrets and streaming video of me picking my nose 24/7 from my compewters? But then I realized this is from the router viewpoint, so “out” probably means the router¬†is sending data to the WLAN clients and “in” means it is receiving data from them. So this would be opposite of my assumption, and “in” is actually uploads and “out” downloads. Which seems more correct and I can remove the hat for now.

Anyway, slightly interestingly the upload part starts growing at a faster rate compared to the download part when I run the test. Most likely due to the difference in scale but shows how the traffic for upload increases proportionally more. Although in absolute terms I believe the increase is the same, as it is the application server and client sending data to each other over the wireless, so it is up/download over the same channel for the same data.

Just for the kicks, the JVM memory use I collected:

jvm_memory

Nothing special here, but notice again how having the actual correct units on the y-axis is nice.

Now to take a look at the end part of the test. The errors, update frequency delays and router CPU load:

network_cpu_end

The router CPU in the above is actually hitting 100% in a couple of spots, which is quite heavy since these CPU load measures are 5 second averages as explained before. This is why Grafana is upping the graph top boundary to 125%. I should probably have set the graph maximum upper bound in Grafana at 100% but whatever. At the end ,when the test stops creating new connections, and old ones die out, the router load also drops back to the level of the background noise.

Which we can also see in the router internal chart:

router_cpu_end

We can do from this an interesting comparison using the direct measures (the ones directly queried over from specific SNMP OIDs) vs derived ones (derived from the several raw measures):

nw_direct_end

The the scale different due to auto-scaling changing the raw chart to 125% top while the direct never goes near 100% and is thus topped at 100%. Anyway, this comparison again shows how the longer averaging period hides much of the important information. This is especially visible in the last part of the period, although the whole period is pretty obvious. And if I forgot before the “CPU busy” value (blue line in the “Router CPU raw”) metric¬†is the sum of all the other metrics in that chart¬†(user + system + nice + sirq). Anyway, in any case, the “direct” chart is averaged over longer term (likely over a minute as I speculated¬†in my previous posts) and thus more information loss is evident.

The router RAM part is pretty constant and was this during all my tests. Thus I haven’t put much of it in these posts but there it is for once in the above chart as well..

Then we take the server CPU load just to see if was still low enough not to cause any of these issues:

server_cup_end

And no problem, it never goes over 50%. Interesting spikes though.

We can also look at the whole picture at once for the whole test run. The test runs for about 10 minutes and gets up to about 3800 concurrent sessions:

whole_sessions

and the network errors, router CPU load and update frequency delays in client and server:

whole_cpu_raw

Cute. Are we missing anything? Yes, we do not measure the load on the client side where the tester is running. This could also have a very significant impact. Here is an example of the top running processes at the client machine during the test:

tasks

What are these? Well, Firefox is running Grafana, which seems to really suck on the CPU when set up with complex graphs and continous refreshes over 5 second intervals. IntelliJ is the IDE I am using to run the tests. The “java” process is likely related to the same, being a part of the performance tests. Why are they showing up as two, both sucking on the CPU as a separate process at relatively high intensity? No idea (arrr..), but maybe “java” is for the actual tester and IntelliJ is where it is run from, and it needs to track the tester process, capture prints, etc. Finally, “Python” is the SNMP poller that I run to capture the router metrics.

So what do I think I should have learned from this:

  • I should architect the system to run monitoring from one node, logging database on one node, graphing UI from another node, …
  • I should track all elements of the system, not just the client and the server
  • Long averaging periods for measurements lose important information for detailed analysis of the performance tests vs functionality (e.g., reasons for network errors as sessions start piling).
  • However, I should maybe not spend excessive time building shiny monitoring systems when the potential cause has become clear. Perhaps if I have one ready as I now do..
  • I should architect my test system to become a distributed system, easily deployed on several nodes. This should allow avoiding process resource limits (e.g., threads per process), kernel limits (e.g., file descriptors), and provide more realistic loads over several distributed nodes
  • I should design in monitoring features that are useful in both production and testing

What next? Well there are a few missing parts. One is that I should monitor the tester clients as noted above. Then another not-so-obvious point is that I should synchronize the clocks of all systems as close as possible. Otherwise I can never link properties such as network errors to other metrics from other nodes. Something like NTP should be used. But is it a bit complex to deploy effectively. In fact, distributed computing in general is a bit complex so I should probably look into building a nice shiny distributed performance test framework and NTP synchronization sometime in the future..

Another interesting aspect here would be to see how much the longer averaging period makes it harder to find actual causes, in a scenario where this could be realistically controlled with other factors.

Performance testing with InfluxDB + Grafana + Telegraf, Part 1

Previously I played around with ElasticSearch + Kibana, which was nice for visualizing some activities on my Minecraft server. I then proceeded to try the same for some server performance metrics data such as CPU and memory use (collected using some Python scripts I created with the psutil library). ES + Kibana are absolutely great tools for analyzing textual data such as logs. Not so good and effective for numerical time-series type data I would say. Maybe this is why it is so often combined with stuff like Logstash etc.

Anyway, after looking around, I then found InfluxDB and Grafana as more suitable alternatives for my performance data. And what is even nicer is, there is now a nice and shiny tool from the InfluxDB folks called Telegraf that collects all those system metrics for me. And since shiny pictures are nice, I thought I would post some of those here.

In this occasion, I am testing a simple backend server for an Android app. Clients register to the server, which then sends updates for subscribed sensors at 1 second interval to each client. The client is the Android app in this case.. My tests run from a single laptop and test how much load the server can handle, by continuously registering new clients to receive more updates.

Off we go, with the initial tests. I first implemented my test client using plain Java SSL sockets. The app uses a custom binary protocol and it is just cost-effective to keep the connection open and push the data through. Which is why the plain socket approach. I used to types of probes to collect the data. The Telegraf plugins for basic system resources, such as CPU, memory and network traffic. And a few custom probes I inserted into the application and into the tester to collect additional data.

Anyway, pics you say:

socket_sessions

This one shows the custom metrics I collected from the server/tester. The memory chart contains four different values. The “mem_used”¬†metric measures the amount of reserved RAM in the JVM. The “mem_total” measures the total amount of RAM the JVM has allocated of all the possible RAM it has been given. In this case, I gave the JVM a max value of 1G RAM to allocate. This chart shows the JVM never allocates more than¬†250MB of this, so this is plenty. This is reflected here in the “mem_free_max” metric, which shows how much more at the most the server could allocate beyond what the JVM has already allocated.

The session count metric is a measure of how many clients have connected to the server a time (active sessions).¬†The “frequency” metric here is a measure of how much time on average is at each point taken for the server to send the requested data updates to the clients. This chart shows the deviation from the expected 1000 milliseconds is at most 4 milliseconds, so well within range.

These “custom” metrics are all collected using my own measurement agents I deployed both at the server and tester end. With the server, I can also use there during production¬†deployment to keep monitoring the performance, so they are useful beyond performance testing as well.

In this initial try with the plain sockets approach, I got up to around 2000 client connections. At this point, I ran into the operating system thread limit. I tried to get around that with another approach but more on that later.

For metrics on the system resources, I used the Telegraf tool, which had the option to directly collect operating system (and various other) metrics and push them into InfluxDB. I forgot to turn Telegraf on at first when doing this example so it is a bit short. Anyway, visualization:

socket_cpu

Here, the top row shows CPU use percentage for each of the four cores on the system. The middle row shows the average of these for total system CPU load. The bottom row is for network traffic as bytes in/out. Unfortunately this did not seem to work on OSX so it is just a flatline for now.

In any case, 2000 clients seems plenty enough to test for my app that has never seen more than a fraction of that in installs. But since I am always too curious, I wanted to try to implement this properly. So I went with Netty (as I needed both Java NIO and SSL support which is horrible to implement by yourself, thank you very much Java engineers..).

So, using the same visualizations, here are some graphs.

start_sessions

This one shows the start of the test execution. Notice the flatline part on the left when the test is started? The flatline is the measurement agents showing the server JVM statistics before the test clients are started up. Interestingly, the JVM starts by allocating about 256MB of memory out of the possible 1GB. Shortly after more load is put on the JVM, it goes down much closer to the actual usage limit.

As for the frequency measure, it is a bit off at the start which is quite normal I would guess for the first measure to calibrate itself. What is more noticeable here is that I added a second frequency measure. Now there are two overlapping lines, one for the delay observed at the server end, and one for the tester end. This gives me a way to see if the delay is due to too much load on the server or too much load on the network (and avoids need for time synchronization such as NTP or PTP when I distribute this). It also has the benefit that I can keep observing the server delay using the same agent in production.

After a longer run, this is how far I end up at:

end4k_sessions

At around 4000 client connections I hit another limit. This time I have passed the thread count limit but I hit a limit on the number of available open file descriptors on the operating system. File descriptors are also used for sockets, and the OS has a limit on how many can be opened at once. Apparently this can be modified using various tricks involving ulimit and various kernel paremeter tuning on OSX. However, I do not wish to mess my personal laptop so I give up on this and take the 4000 as enough for now. Otherwise I would take a virtual machine or several, install Ubuntu and mess them up with the file limits as a VM is easy to redo. Then make several and bombard my server with all of them. But for now this is fine.

For the Telegraf metrics we get this:

end4k_cpu

So the CPU load goes up steadily towards the end and suddenly drops at the end. Which is kind of weird as I would expect it to really drop after the test finishes, not 30 seconds before it. Either there is some issue with different ways the agents take time or there are some delays in processing some metrics. I should probably investigate further. However, this would not really affect my performance test so I am fine with that for now.

Anything else? Well, there are a few interesting bits to show. Lets see.

Here is a frequency spike that hides all other variation due to being so large:

freq_spike

We can drill down on the flatline to see the variation there much like we did in Kibana (click and drag to select on the chart):

freq_spike_focused

Of course, it is a valid question here if the variation at this very fine granularity makes a difference or if the large spike is the only thing of real interest here. But is illustrates how these fancy tools can be nice.

Another thing of interest. During some test runs, I managed to get some weird large spikes in the frequency (as well as session registration):

logspike_session

These spikes are the same for both the tester and server frequency, and the number of new sessions that get through at this time are also flatlining. So something must hang the system for close to 30 seconds here. I would suspect a some network lag here but I believe I was running both the tester and server on localhost so should not be that (although InfluxDB is on another host). Also, the fact that the server spike is identical (overlapping line) with the tester spike is against this. So a look at the CPU load for this period:

logspike_cpu

There does seem to be some correlation with the highest CPU spikes and the lag spikes in update frequencies. So I tried to think what I had changed or what could be causing this. I figured it was probably the fact that I had added some pretty verbose logging and enabled log rotation + compression. So I figured maybe logs are filling up fast and getting rotated + compressed on those points. So I reduced logging to sane levels and the spikes never came back. Maybe it was that, or maybe it was the sun spots. But its gone so fine with me..

So that is some basic example of how I used this toolset for performance testing. Now, in practice I would not run the server on localhost, nor would the client apps be running on localhost (or the same host as the server is on). So I should actually test it on the real server environment. About that next…