Predicting issue categories on Github

Practical examples of applying machine learning seem to be a bit difficult to find. So I tried to create one for a presentation I was doing on testing and data analytics. I made a review of works in the area, and just chose one for illustrate. This one tries to predict a target category to assign for an issue report. I used ARM mBed OS as a test target since it has issues available on Github and there were some people who work with it attending the presentation.

This demo “service” I created works by first training a predictive model based on a set of previous issue reports. I downloaded the reports from the issue repository. The amount of data available there was so small, I just downloaded the issues manually using the Github API that let me download the data for 100 issues at once. Automating the download should be pretty easy if needed. The amount of data is small, and there are a large number of categories to predict, so not the best for results, but serves as an example to illustrate the concept.

And no, there is no deep learning involved here, so not quite that trendy. I don’t think it is all that necessary for this purpose or this size of data. But could work better of course, if you do try, post the code so we can play as well.

The Github issues API allows me to download the issues in batches. For example, to download page 12 of closed issues, with 100 issues per page, the URL to request is https://api.github.com/repos/ARMmbed/mbed-os/issues?state=closed&page=12&per_page=100. The API seems to cut it down to 100 even if using bigger values than 100. Or I just didn’t quite use it right, whichever. The API docs describe the parameters quite clearly, I downloaded open and closed issues separately, even if I did not use the separation in any meaningful way in the end.

The code here is all in Python. The final classifier/prediction services code is available on my Github repository.

First build a set of stopwords to do some cleaning on the issue descriptions:

	stop_words = set(stopwords.words('english'))
	stop_words = stop_words.union(set(punctuation))
	stop_words.update(["''", "--", "**"])

The above code uses the common NLTK stopwords, a set of punctuation symbols, and a few commonly occurring symbol combinations I found in the data. Since later on I clean it up with another regular expression, probably just the NLTK stopwords would suffice here as well..

To preprocess the issue descriptions before applying machine learning algorightms:

def preprocess_report(body_o):
	#clean issue body text. leave only alphabetical and numerical characters and some specials such as +.,:/\
	body = re.sub('[^A-Za-z0-9 /\\\_+.,:\n]+', '', body_o)
	# replace URL separators with space so the parts of the url become separate words
	body = re.sub('[/\\\]', ' ', body)
	# finally lemmatize all words for the analysis
	lemmatizer = WordNetLemmatizer()
	# text tokens are basis for the features
	text_tokens = [lemmatizer.lemmatize(word) for word in word_tokenize(body.lower()) if word not in stop_words]
	return text_tokens

Above code is intended to remove all but standard alphanumeric characters from the text, remove stop words, and tokenize the remaining text into separate words. It also splits URL’s into parts as separate words. The lemmatization changes known words into their baseforms (e.g., “car” and “cars” become “car”). This just makes it easier for the machine learning algorithm to match words together. Another option is stemming, but lemmatization produces more human-friendly words so I use that.

I stored the downloaded issues as JSON files (as Github API gives) in the data directory. To read all these filenames for processing:

#names of files containing closed and open issues (at time of download)
closed_files = glob.glob("data/*-closed-*")
open_files = glob.glob("data/*-closed-*")

To process those files, I need to pick only the ones with an assigned “component” value. This is what is the training target label. The algorithm is trained to predict this “component” value from the issue description, so without this label, the piece of data is not useful for training.

def process_files(files):
	'''
	process the given set of files by collecting issue body text and labels.
	also cleans and lemmatizes the body text

	:param files: names of files to process
	:return: nothing
	'''
	global total

	for filename in files:
		with open(filename, encoding="utf-8") as json_data:
			print(filename)
			all_issues = json.load(json_data)
			for issue in all_issues:
				labels = issue["labels"]
				for label in labels:
					if label["name"].startswith("component:"):
						name = label["name"]
						body_o = issue["body"]
						text_tokens = preprocess_report(body_o)
						all_text_tokens.append((text_tokens))
						#component_labels are prediction targets
						component_labels.append(name)
						total += 1

print("total: ", total)
						

There is a limited number of such labeled data items, as many of the downloaded issues do not have this label assigned. The print at the end of the above code shows the total number of items with the “component” label given, and the number in this dataset is 1078.

Besides removing stop-words and otherwise cleaning up the documents for NLP, combining words sometimes makes sense. Pairs, triplets, and so on are sometimes meaningful. Typical example is words “new” and “york” in a document, versus “new york”. This would be an example of a bi-gram, combining two words into “new_york”. To do this, I use the gensim package:

import gensim

#https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
# Build the bigram and trigram models
bigram = gensim.models.Phrases(all_text_tokens, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[all_text_tokens], threshold=100)

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

#just to see it works
print(trigram_mod[bigram_mod[all_text_tokens[4]]])

#transform identified word pairs and triplets into bigrams and trigrams
text_tokens = [trigram_mod[bigram_mod[text]] for text in all_text_tokens]

#build whole documents from text tokens. some algorithms work on documents not tokens.
texts = [" ".join(tokens) for tokens in text_tokens]

The above code uses thresholds and minimum co-occurrence counts to avoid combining every possible word with every other possible word. So only word-pairs and triplets that commonly are found to occur together are used (replaced) in the document.

Use the Python data processing library Pandas to turn it into suitable format for the machine learning algorithms:

from pandas import DataFrame

df = DataFrame()

df["texts"] = texts
df["text_tokens"] = text_tokens
df["component"] = component_labels

print(df.head(2))

First to have a look at the data:

#how many issues are there in our data for all the target labels, assigned component counts
value_counts = df["component"].value_counts()
#print how many times each component/target label exists in the training data
print(value_counts)
#remove all targets for which we have less than 10 training samples.
#K-fold validation with 5 splits requires min 5 to have 1 in each split. This makes it 2, which is still tiny but at least it sorta runs
indices = df["component"].isin(value_counts[value_counts > 9].index)
#this is the weird syntax i never remember, them python tricks. i think it slices the dataframe to remove the items not in "indices" list
df = df.loc[indices, :]

The above code actually already does a bit more. It also filters the dataset to remove the rows with component values that only have less than 10 items. So this is the unfiltered list:

component: tools              162
component: hal                128
component: export             124
component: networking         118
component: drivers            110
component: rtos                88
component: filesystem          80
component: tls                 78
component: docs                60
component: usb                 54
component: ble                 38
component: events              14
component: cmsis               10
component: stdlib               4
component: rpc                  4
component: uvisor               2
component: greentea-client      2
component: compiler             2

And after filtering, the last four rows will have been removed. So in the end, the dataset will not have any rows with labelsl “rpc”, “uvisor”, “greentea-client”, or “compiler”. This is because I will later use stratified 5-fold cross-validation and this requires a minimum of 5 items of each. Filtering with minimum of 10 instances for a label, it is at least possible to have 2 of the least common “component” in each fold.

In a more realistic case, much more data would be needed to cover all categories, and I would also look at possibly combining some of the different categories. And rebuilding the model every now and then, depending on how much effort it is, how much new data comes in, etc.

To use the “component” values as target labels for machine learning, they need to be numerical (integers). This does the transformation:

from sklearn.preprocessing import LabelEncoder

# encode class values as integers
encoder = LabelEncoder()
encoded_label = encoder.fit_transform(df.component)

Just to see how the mapping of integer id’s to labels after label encoding looks:

unique, counts = np.unique(encoded_label, return_counts=True)
print(unique) #the set of unique encoded labels
print(counts) #the number of instances for each label

The result (first line = id, second line = number of items):

[ 0  1  2  3  4  5  6  7  8  9 10 11 12]
[ 38  10  60 110  14 124  80 128 118  88  78 162  54]

Mapping the labels to integers:

#which textual label/component name matches which integer label
le_name_mapping = dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))
print(le_name_mapping)

#which integer matches which textual label/component name
le_id_mapping = dict(zip(encoder.transform(encoder.classes_), encoder.classes_))
print(le_id_mapping)

So the first is to print “label: id” pairs, and the second to print “id: label” pairs. The first one looks like this:

'component: ble': 0, 
'component: cmsis': 1, 
'component: docs': 2, 
'component: drivers': 3, 
'component: events': 4, 
'component: export': 5, 
'component: filesystem': 6, 
'component: hal': 7, 
'component: networking': 8, 
'component: rtos': 9, 
'component: tls': 10, 
'component: tools': 11, 
'component: usb': 12

Now, to turn the text into suitable input for a machine learning algorithm, I transform the documents into their TF-IDF presentation. Well, if you go all deep learning with LSTM and the like, this may not be necessary. But don’t take my word for it, I am still trying to figure some of that out.

TF-IDF stands for term frequency (TF) – inverse document frequency (IDF). For example, if the word “bob” appears often in a document, it has a high term frequency for that document. Generally, one might consider such a word to describe that document well (or the concepts in the document). However, if the same word also appears commonly in all the documents (in the “corpus”), it is not really specific to that document, and not very representative of that document vs all other documents in the corpus. So IDF is used to modify the TF so that words that appear often in a document but less often in others in the corpus get a higher weight. And if the word appears often across many documents, it gets a lower weight. This is TF-IDF.

Traditional machine learning approaches also require a more fixed size set of input features. Since documents are of varying length, this can be a bit of an issue. Well, I believe some deep learning models also require this (e.g., CNN), while others less so (e.g., sequential models such as LSTM). Digressing. TF-IDF also (as far as I understand) results in a fixed length feature vector for all documents. Or read this on Stack Overflow and make your own mind up.

Anyway, to my understanding, the feature space (set of all features) after TF-IDF processing becomes the set of all unique words across all documents. Each of these is given a TF-IDF score for each document. For the words that do not exist in a document, the score is 0. And most documents don’t have all words in them, so this results in a very “sparse matrix”, where the zeroes are not really stored. That’s how you can actually process some reasonable sized set of documents in memory.

So, in any case, to convert the documents to TF-IDF presentation:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5)

#transfor all documents into TFIDF vectors.
#TF-IDF vectors are all same length, same word at same index, value is its TFIDF for that word in that document
features_transformed = vectorizer.fit_transform(features)

Above code fits the vectorizer to the corpus and then transforms all the documents to their TF-IDF representations. To my understanding (from SO), the fit part counts the word occurrences in the corpus, and the transform part uses these overall counts to transform each document into TF-IDF.

It is possible also to print out all the words the TF-IDF found in the corpus:

#the TFIDF feature names is a long list of all unique words found
print(vectorizer.get_feature_names())
feature_names = np.array(vectorizer.get_feature_names())
print(len(feature_names))

Now to train a classifier to predict the component based on a given document:

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

kfold = StratifiedKFold(n_splits=5) #5-fold cross validation

#the classifier to use, the parameters are selected based on a set i tried before
clf = RandomForestClassifier(n_estimators=50, min_samples_leaf=1, min_samples_split=5)

results = cross_val_score(clf, features_transformed, encoded_label, cv=kfold)

print("Baseline: %.2f%% (%.2f%%)" % (results.mean() * 100, results.std() * 100))

#fit the classifier on the TFIDF transformed word features, train it to predict the component
clf.fit(features_transformed, encoded_label)
probabilities = clf.predict_proba(features_transformed[0])
print(probabilities)

In the above I am using RandomForest classifier, with a set of parameters previously tuned. I am also using 5-fold cross validation, meaning the data is split into 5 different parts. The parts are “stratified”, meaning each fold has about the same percentage of each target label as the original set. This is why I removed the labels with less that 10 instances in the beginning, to have at least 2 for each class. Which is till super-tiny but thats what this example is about.

The last part of the code above also runs a prediction on one of the transformed documents just to try it out.

Now, to run predictions on previously unseen documents:

import requests

def predict_component(issue):
	'''
	use this to get a set of predictions for a given issue.

	:param issue: issue id from github.
	:return: list of tuples (component name, probability)
	'''
	#first load text for the given issue from github
	url = "https://api.github.com/repos/ARMmbed/mbed-os/issues/" + str(issue)
	r = requests.get(url)
	url_json = json.loads(r.content)
	print(url_json)
	#process the loaded issue data to format matching what the classifier is trained on
	issue_tokens = preprocess_report(url_json["body"])
	issue_tokens = trigram_mod[bigram_mod[issue_tokens]]
	issue_text = " ".join(issue_tokens)
	features_transformed = vectorizer.transform([issue_text])
	#and predict the probability of each component type
	probabilities = clf.predict_proba(features_transformed)
	result = []
	for idx in range(probabilities.shape[1]):
		name = le_id_mapping[idx]
		prob = (probabilities[0, idx]*100)
		prob_str = "%.2f%%" % prob
		print(name, ":", prob_str)
		result.append((name, prob_str))
	return result

This code takes as parameter an issue number for the ARM mBed Github repo. Downloads the issue data, preprocesses it similar to the training data (clean, tokenize, lemmatize, TF-IDF). This is then used as a set of features to predict the component, based on the model trained earlier.

The “predict_component” method/function can then be called from elsewhere. In my case, I made a simple web page to call it. As noted in the beginning of this post, you can find that webserver code, as well as all the code above on my Github repository.

That’s pretty much it. Not very complicated to put some Python lines one after another, but knowing which lines and in which order is perhaps what takes the time to learn. Having someone else around to do it for you if you are a domain expert (e.g., testing, software engineering or similar in this case) is handy, but it can also be useful to have some idea of what happens, or how the algorithms in general work.

Something I left out in all the above was the code to try out different classifiers and their parameters. So I will just put it below for reference.

First a few helper methods:

def top_tfidf_feats(row, features, top_n=25):
	''' Get top n tfidf values in row and return them with their corresponding feature names.'''
	topn_ids = np.argsort(row)[::-1][:top_n]
	top_feats = [(features[i], row[i]) for i in topn_ids]
	df = pd.DataFrame(top_feats)
	df.columns = ['feature', 'tfidf']
	return df

#this prints it for the first document in the set
arr = features_test_transformed[0].toarray()
top_tfidf_feats(arr[0], feature_names)

def show_most_informative_features(vectorizer, clf, n=20):
	feature_names = vectorizer.get_feature_names()
	coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
	top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
	for (coef_1, fn_1), (coef_2, fn_2) in top:
		print ("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))

In above code, “top_tfidf_feats” prints the top words with highest TF-IDF score for a document. So in a sense, it prints the words that TF-IDF has determined to be most uniquely representing that document.

The “show_most_informative_features” prints the top features that a given classifier has determined to be most descriptive/informative for distinguishing target labels. This only works for certain classifiers, which have such simple co-efficients (feature weights). Such as multinomial naive-bayes (MultinomialNB below).

Here is the code to actually try out the classifiers then:

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(features_train_transformed, labels_train)

from sklearn.metrics import accuracy_score

y_pred = clf.predict(features_test_transformed)
y_true = labels_test
acc_score = accuracy_score(y_true, y_pred)
print("MNB accuracy:"+str(acc_score))

show_most_informative_features(vectorizer, clf)

#try it out on a single document
probabilities = clf.predict_proba(features_test_transformed[0])
print(probabilities)

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

#set of parameters to try
estimators = [10, 20, 30, 40, 50]
min_splits = [5, 10, 20, 30, 40, 50]
min_leafs = [1, 2, 5, 10, 20, 30]

kfold = StratifiedKFold(n_splits=5) #5-fold cross validation

best_acc = 0.0
best_rf = None
for estimator in estimators:
	for min_split in min_splits:
		for min_leaf in min_leafs:
			print("estimators=", estimator, "min_split=", min_split, " min_leaf=", min_leaf)

			clf = RandomForestClassifier(n_estimators=estimator, min_samples_leaf=min_leaf, min_samples_split=min_split)

			results = cross_val_score(clf, features_transformed, encoded_label, cv=kfold)

			print("Baseline: %.2f%% (%.2f%%)" % (results.mean() * 100, results.std() * 100))

			if results.mean() > best_acc:
				best_acc = results.mean()
				best_rf = clf
				print("found better:", best_acc, ", ", best_rf)

print("best classifier:")
print(best_rf)

best_acc = 0
best_rf = None
for estimator in estimators:
	for min_split in min_splits:
		for min_leaf in min_leafs:
			print("estimators=", estimator, "min_split=", min_split, " min_leaf=", min_leaf)

			clf = RandomForestClassifier(n_estimators=estimator, min_samples_leaf=min_leaf, min_samples_split=min_split)
			clf.fit(features_train_transformed, labels_train)

			pred = clf.predict(features_test_transformed)

			accuracy = accuracy_score(labels_test, pred)

			print(accuracy)

			if accuracy > best_acc:
				best_acc = accuracy
				best_rf = clf
				print("found better:", best_acc, ", ", best_rf)
	

In the code above, I use loops to run through the parameters. There is also something called GridSearch in the Python libraries, as well as RandomSearch (for cases where trying all combos is expensive). But I prefer the ability to control the loops, print out whatever I like and all that.

The above code also shows two ways I tried to train/evaluate the RandomForest parameters. First is with k-fold, latter with single test-train split. I picked MultinomialNB and RandomForest because some internet searching gave me the impression they might work reasonably well for unbalanced class sets such as this one. Of course the final idea is always to try and see what works.. This worked quite fine for me. Or so it seems, machine learning seems to be always about iterating stuffs and learning and updating as you go. More data could change this all, or maybe finding some mistake, or having more domain or analytics knowledge, finding mismatching results, or anything really.

What the unbalanced refers to is the number of instances of different components in this dataset, some “components” have many bug repots, while others much less. For many learning algorithms this seems to be an issue. Some searches indicated RandomForest should be fairly robust for this type so this is also one reason I used it.

Running the above code to experiment with the parameters also produced some slightly concerning results. The accuracy for the classifier ranged from 30% to 95% with smallish parameters changes. I would guess that also speaks for the small dataset causing potential issues. Also re-running the same code would give different classifications for new (unseen) instances. Which is what you might expect when I am not setting the randomization seed. But then I would also expect the accuracy to vary somewhat, which it didn’t. So just don’t take this as more than an example of how you might apply ML for some SW testing related tasks. Take it to highlight the need to always learn more, try more things, and get a decent amount of data, evolve models constantly, etc. And post some comments on all the things you think are wrong in this post/code so we can verify the approach of learning and updating all the time :).

In any case, I hope the example is useful for giving an idea of one way how machine learning could be applied in software testing related aspects. Now write me some nice LSTM or whatever is the latest trend in deep learning models, figure out any issues in my code, or whatever, and post some comments. Cheers.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s