Category Archives: Uncategorized

Trying to learn ECDSA and GoLang

Recently I have been looking at the Naivecoin tutorial, and trying to implement it in Go to get an idea of how blockchains really work, and to learn some more Go as well. The tutorial code is in Javascript, and translating it to Go has been mostly straightforward. However, porting the third part with transactions was giving me some issues. I had trouble figuring how to port the signature part. This is tutorial the code in Javascript:

    const key = ec.keyFromPrivate(privateKey, 'hex');
    const signature: string = toHexString(key.sign(dataToSign).toDER());

This is nice and simple, just suitable for a high-level framework and tutorial. However, to implement it myself in Go, …

The JS code above takes the private key as a hex formatted string, and parses that into a Javascript PrivateKey object. This key object is then used to sign the “dataToSign”, the signature is formatted as something called “DER”, and the result is formatted as a hex string. What does all that mean?

The tutorial refers to Elliptic Curve Digital Signature Algorithm (ECDSA). The data to sign in this case is the SHA256 hash of the transaction ID. So how to do this in Go? Go has an ecdsa package with private keys, public keys, and functions to sign and verify data. Sounds good. But the documentation is quite sparse, so how do I know how to properly use it?

To try it, I decided to first write a program in Java using ECDSA signatures, and use it to compare to the results of the Go version I would write. This way I would have another point of reference to compare my results to, and to understand if I did something wrong. I seemed to find more information about the Java implementation, and since I am more familiar with Java in general..

So first to generate the keys to use for signatures in Java:

	public static String generateKey() throws Exception {
		KeyPairGenerator keyGen = KeyPairGenerator.getInstance("EC");
		SecureRandom random = SecureRandom.getInstance("SHA1PRNG");

		keyGen.initialize(256, random); //256 bit key size

		KeyPair pair = keyGen.generateKeyPair();
		ECPrivateKey priv = (ECPrivateKey) pair.getPrivate();
		PublicKey pub = pair.getPublic();

		//actually also need public key, but lets get to that later...
		return priv;

Above code starts with getting an “EC” key-pair generator, EC referring to Elliptic Curve. Then get a secure random number generator instance, in this case one based on SHA1 hash algorithm. Apparently this is fine, even if SHA1 is not recommended for everything these days. Not quite sure about the key size of 256 given, but maybe have to look at that later.. First to get this working.

The “priv.Encoded()” part turns the private key into a standard encoding format as a byte array. Base64 encode it for character representation, to copy to the Go version..

Next, to sign the data (or message, or whatever we want to sign..):

	public static byte[] signMsg(String msg, PrivateKey priv) throws Exception {
		Signature ecdsa = Signature.getInstance("SHA1withECDSA");


		byte[] strByte = msg.getBytes("UTF-8");

		byte[] realSig = ecdsa.sign();

		System.out.println("Signature: " + new BigInteger(1, realSig).toString(16));

		return realSig;

Above starts with gettings a Java instance of the ECDSA signature algorithm, with type “SHA1withECDSA”. I spent a good moment wondering what all this means, to be able to copy the functionality into the Go version. So long story short, first the data is hashed with SHA1 and then this hash is signed with ECDSA. Finally, the code above prints the signature bytes as a hexadecimal string (byte array->BigInteger->base 16 string). I can then simply copy-paste this hex-string to Go to see if I can get signature verification to work in Go vs Java. Brilliant.

First I tried to see that I can get the signature verification to work in Java:

	private static boolean verifySignature(PublicKey pubKey,String msg, byte[] signature) throws Exception {
		byte[] message = msg.getBytes("UTF-8");
		Signature ecdsa = Signature.getInstance("SHA1withECDSA");
		return ecdsa.verify(signature);

The code above takes the public key associated with the private key that was used to sign the data (called “msg” here). It creates the same type of ECDSA signature instance as the signature creation previously. This is used to verify the signature is valid for the given message (data). So signed with the private key, verified with the public key. And yes, it returns true for the signed message string, and false otherwise, so it works. So now knowing I got this to work, I can try the same in Go, using the signature, public key, and private key that was used in Java. But again, the question. How do I move these over?

Java seems to provide functions such as key.getEncoded(). This gives a byte array. We can then Base64 encode it to get a string (I believe Bitcoin etc. use Base56 but the same idea). So something like this:

		byte[] pubEncoded = pub.getEncoded();
		String encodedPublicKey = Base64.getEncoder().encodeToString(pubEncoded);
		String encodedPrivateKey = Base64.getEncoder().encodeToString(priv.getEncoded());

Maybe I could then take the output I just printed, and decode that into the key in Go? But what is the encoding? Well, the JDK docs say getEncoded() “Returns the key in its primary encoding format”. And what might that be? Well some internet searching and debugger runs later I come up with this (which works to re-create the keys in Java):

	public static PrivateKey base64ToPrivateKey(String encodedKey) throws Exception {
		byte[] decodedKey = Base64.getDecoder().decode(encodedKey);
		PKCS8EncodedKeySpec spec = new PKCS8EncodedKeySpec(decodedKey);
		KeyFactory factory = KeyFactory.getInstance("EC");
		PrivateKey privateKey = factory.generatePrivate(spec);
		return privateKey;

	public static PublicKey base64ToPublicKey(String encodedKey) throws Exception {
		byte[] decodedKey = Base64.getDecoder().decode(encodedKey);
		X509EncodedKeySpec spec = new X509EncodedKeySpec(decodedKey);
		KeyFactory factory = KeyFactory.getInstance("EC");
		return publicKey;

So the JDK encodes the private key in PKCS8 format, and the public key in some kind of X509 format. X509 seems to be related to certificates, and PKCS refers to “Public Key Cryptography Standards”, of which there are several. Both of these seem a bit complicated, as I was just looking to transfer the keys over. Since people can post those online for various crypto tools as short strings, it cannot be that difficult, can it?

I tried to look for ways to take PKCS8 and X509 data into Go and transform those into private and public keys. Did not get me too far with that. Instead, I figured there must be only a small part of the keys that is needed to reproduce them.

So I found that the private key has a single large number that is the important bit, and the public key can be calculated from the private key. And the public key in itself consists of two parameters, the x and y coordinates of a point (I assume on the elliptic curve). I browsed all over the internet trying to figure this all out, but did not keep records of all the sites I visited, so my references are kind of lost. However, here is one description that just so states the integer and point part. Anyway, please let me know of any good references for a non-mathematician like me to understand it if you have any.

To get the private key value into suitable format to pass around in Java:

	private static String getPrivateKeyAsHex(PrivateKey privateKey) {
		ECPrivateKey ecPrivateKey = (ECPrivateKey) privateKey;
		byte[] privateKeyBytes = ecPrivateKey.getS().toByteArray();
		String hex = bytesToHex(privateKeyBytes);
		return hex;

The “hex” string in the above code is the big integer value that forms the basis of the private key. This can now be passed, backed up, or whatever we desire. Of course, it should be kept private so no posting it on the internet.

For the public key:

	private static String getPublicKeyAsHex(PublicKey publicKey) {
		ECPublicKey ecPublicKey = (ECPublicKey) publicKey;
		ECPoint ecPoint = ecPublicKey.getW();

		byte[] affineXBytes = ecPoint.getAffineX().toByteArray();
		byte[] affineYBytes = ecPoint.getAffineY().toByteArray();

		String hexX = bytesToHex(affineXBytes);
		String hexY = bytesToHex(affineYBytes);

		return hexX+":"+hexY;

The above code takes the X and Y coordinates that make up the public key, combines them, and thus forms a single string that can be passed to get the X and Y for public key. A more sensible option would likely just create a single byte array with the length of the first part as first byte or two. Something like [byte count for X][bytes of X][bytes of Y]. But the string concatenation works for my simple example to try to understand it.

And then there is one more thing that needs to be encoded and passed between the implementations, which is the signature. Far above, I wrote the “signMsg()” method to build the signature. I also printed the signature bytes out as a hex-string. But what format is the signature in, and how do you translate it to another platform and verify it? It turns out Java gives the signatures in ASN.1 format. There is a good description of the format here. It’s not too complicated but how would I import that into Go again? I did not find any mention of this in the ECDSA package for Go. By searching with ASN.1 I did finally find an ASN.1 package for Go. But is there a way to do that without these (poorly documented) encodings?

Well, it turns out that ECDSA signatures can also be described by using just two large integers, which I refer to here as R and S. To get these in Java:

	public static byte[] signMsg(String msg, PrivateKey priv) throws Exception {
		Signature ecdsa = Signature.getInstance("SHA1withECDSA");


		byte[] strByte = msg.getBytes("UTF-8");

		byte[] realSig = ecdsa.sign();

		System.out.println("R: "+extractR(realSig));
		System.out.println("S: "+extractS(realSig));

		return realSig;

	public static BigInteger extractR(byte[] signature) throws Exception {
		int startR = (signature[1] & 0x80) != 0 ? 3 : 2;
		int lengthR = signature[startR + 1];
		return new BigInteger(Arrays.copyOfRange(signature, startR + 2, startR + 2 + lengthR));

	public static BigInteger extractS(byte[] signature) throws Exception {
		int startR = (signature[1] & 0x80) != 0 ? 3 : 2;
		int lengthR = signature[startR + 1];
		int startS = startR + 2 + lengthR;
		int lengthS = signature[startS + 1];
		return new BigInteger(Arrays.copyOfRange(signature, startS + 2, startS + 2 + lengthS));

Above code takes the byte array of the signature, and parses the R and S from it as matching the ASN.1 specification I linked above. So with that, another alternative is again to just turn the R and S into hex-strings or Base56 encoded strings, combine them as a single byte-array and hex-string or base56 that, or whatever. But just those two values need to be passed to capture the signature.

Now, finally to parse all this data in Go and to verify the signature. First to get the private key from the hex-string:

	func hexToPrivateKey(hexStr string)  *ecdsa.PrivateKey {
		bytes, err := hex.DecodeString(hexStr)

		k := new(big.Int)

		priv := new(ecdsa.PrivateKey)
		curve := elliptic.P256()
		priv.PublicKey.Curve = curve
		priv.D = k
		priv.PublicKey.X, priv.PublicKey.Y = curve.ScalarBaseMult(k.Bytes())
		//this print can be used to verify if we got the same parameters as in Java version
		fmt.Printf("X: %d, Y: %d", priv.PublicKey.X, priv.PublicKey.Y)

		return priv

The above code takes the hex-string, parses it into a byte array, creates a Go big integer from that, and sets the result as the value into the private key. The other part that is needed is the elliptic curve definition. In practice, one of a predefined set of curves is usually used, and the same curve is used for a specific purpose. So it can be defined as a constant, whichever is selected for the blockchain. In this case it is always defined as the P256 curve, both in the Java and Go versions. For example, Bitcoin uses the Secp256k1 curve. So I just set the curve and the big integer to create the private key. The public key (X and Y parameters) is calculated here from the private key, by using a multiplier function on the private key’s big integer.

To build the public key straight from the X and Y values passed in as hex-strings:

	func hexToPublicKey(xHex string, yHex string) *ecdsa.PublicKey {
		xBytes, _ := hex.DecodeString(xHex)
		x := new(big.Int)

		yBytes, _ := hex.DecodeString(yHex)
		y := new(big.Int)

		pub := new(ecdsa.PublicKey)
		pub.X = x
		pub.Y = y

		pub.Curve = elliptic.P256()

		return pub

Again, base56 or similar would likely be more efficient representation. So the above code allows just to pass around the public key and not the private key, which is how it should be done. With the parameters X and Y passed, and the curve defined as a constant choice.

To create and verify the signature from the passed values:

	type ecdsaSignature struct {
		R, S *big.Int

	func verifyMySig(pub *ecdsa.PublicKey, msg string, sig []byte) bool {
		digest := sha1.Sum([]byte(msg))

		var esig ecdsaSignature
		asn1.Unmarshal(sig, &esig)
		//we can use these prints to compare to what we had in Java...
		fmt.Printf("R: %d , S: %d", esig.R, esig.S)
		return ecdsa.Verify(pub, digest[:], esig.R, esig.S)

The above version reads the actual ASN.1 encoded signature that is produced by the Java default signature encoding. To get the functionality matching the Java “SHA1withECDSA” algorithm, I first have to hash the input data with SHA1 as done here. Since the Java version is a bit of a black box with just that string definition, I spent a good moment wondering about that. I would guess the same approach would apply for other choices such as “SHA256withECDSA” by just replacing the hash function with another. Alternatively, I can also just pass in directly the R and S values of the signature:

	func verifyMySig(pub *ecdsa.PublicKey, msg string, sig []byte) bool {
		digest := sha1.Sum([]byte(msg))

		var esig ecdsaSignature
		esig.R.SetString("89498588918986623250776516710529930937349633484023489594523498325650057801271", 0)
		esig.S.SetString("67852785826834317523806560409094108489491289922250506276160316152060290646810", 0)
		fmt.Printf("R: %d , S: %d", esig.R, esig.S)
		return ecdsa.Verify(pub, digest[:], esig.R, esig.S)

So in the above, the R and S are actually set from numbers passed in. Which normally would be encoded more efficiently, and given as parameters. However, this works to demonstrate. The two long strings are the integers for the R and S I printed out in the Java version.

Strangely, printing the R and S using the ASN.1 and the direct passing of the numbers gives a different value for R and S. Which is a bit odd. But they both verify the signature fine. I read somewhere that some transformations can be done on the signature numbers while keeping it valid. Maybe this is done as part of the encoding or something? I have no idea. But it works. Much trust such crypto skills I have.

func TestSigning(t *testing.T) {
	xHexStr := "4bc55d002653ffdbb53666a2424d0a223117c626b19acef89eefe9b3a6cfd0eb"
	yHexStr := "d8308953748596536b37e4b10ab0d247f6ee50336a1c5f9dc13e3c1bb0435727"
	ePubKey = hexToPublicKey(xHexStr, yHexStr)

	sig := "3045022071f06054f450f808aa53294d34f76afd288a23749628cc58add828e8b8f2b742022100f82dcb51cc63b29f4f8b0b838c6546be228ba11a7c23dc102c6d9dcba11a8ff2"
	sigHex, _ := hex.DecodeString(sig)
	ok := verifyMySig(ePubKey, "This is string to sign", sigHex)

And finally, it works! Great ūüôā


Playing with Pairwise Testing and PICT

A while back, I was doing some lectures on advanced software testing technologies. One topic was combinatorial testing. Looking at the materials, there are good and free tools out there to generate tests to cover various combinations. Still, I don’t see many people use them, and the materials out there don’t seem too great.

Combinatorial testing here refers to having 2-way, 3-way, up to N-way (sometimes they seem to call it t-way…) combinations of data values in different test cases. 2-way is also called pairwise testing. This simply refers to all pairs of data values appearing in different test cases. For example, if one test uses values “A” and “B”, and another uses a combination of “A” and “C”, you would have covered the pairs A+B and A+C but not B+C. With large numbers of potential values, the set of potential combinations can grow pretty huge, so finding a minimal set to cover all combinations can be very useful.

The benefits

There is a nice graph over at NIST, including a PDF with a broader description. Basically these show that 2-way and 3-way combinations already show very high gains in finding defects over considering coverage of single variables alone. Of course, things get a bit more complicated when you need to find all relevant variables in the program control flow, how to define what you can combine, all the constraints, etc. Maybe later. Now I just wanted to try the combinatorial test generation.

Do Not. Try. Bad Yoda Joke. Do Try.

So I gave combinatorial test generation a go. Using a nice and freely available PICT tool from Microsoft Research. It even compiles on different platforms, not just Windows. Or so they say on their Github.

Unexpectedly, compiling and getting PICT to run on my OSX was quite simple. Just “make” and “make test” as suggested on the main Github page. Probably I had most dependencies already from before, but anyway, it was surprisingly easy.

I made “mymodels” and “myoutputs” directories under the directory I cloned the git and compile the code to. Just so I could keep some order to my stuffs. So this is why the following example commands work..

I started with the first example on PICT documentation page. The model looks like this:

Type:          Primary, Logical, Single, Span, Stripe, Mirror, RAID-5
Size:          10, 100, 500, 1000, 5000, 10000, 40000
Format method: quick, slow
File system:   FAT, FAT32, NTFS
Cluster size:  512, 1024, 2048, 4096, 8192, 16384, 32768, 65536
Compression:   on, off

Running the tool and getting some output is actually simpler than I expected:

./pict mymodels/example1.pict >myoutputs/example1.txt

PICT prints the list of generated test value combinations to the standard output. Which generally just translates to printing a bunch of lines on the console/screen. To save the generated values, I just pipe the output to myoutputs/example1.txt, as shown above. In this case, the output looks like this:

Type	Size	Format method	File system	Cluster size	Compression
Stripe	100	quick	FAT32	1024	on
Logical	10000	slow	NTFS	512	off
Primary	500	quick	FAT	65536	off
Span	10000	slow	FAT	16384	on
Logical	40000	quick	FAT32	16384	off
Span	1000	quick	NTFS	512	on
Span	10	slow	FAT32	32768	off
Stripe	5000	slow	NTFS	32768	on
RAID-5	500	slow	FAT	32768	on
Mirror	1000	quick	FAT	32768	off
Single	10	quick	NTFS	4096	on
RAID-5	100	slow	FAT32	4096	off
Mirror	100	slow	NTFS	65536	on
RAID-5	40000	quick	NTFS	2048	on
Stripe	5000	quick	FAT	4096	off
Primary	40000	slow	FAT	8192	on
Mirror	10	quick	FAT32	8192	off
Span	500	slow	FAT	1024	off
Single	1000	slow	FAT32	2048	off
Stripe	500	quick	NTFS	16384	on
Logical	10	quick	FAT	2048	on
Stripe	10000	quick	FAT32	512	off
Mirror	500	quick	FAT32	2048	on
Primary	10	slow	FAT32	16384	on
Single	10	quick	FAT	512	off
Single	10000	quick	FAT32	65536	off
Primary	40000	quick	NTFS	32768	on
Single	100	quick	FAT	8192	on
Span	5000	slow	FAT32	2048	on
Single	5000	quick	NTFS	16384	off
Logical	500	quick	NTFS	8192	off
RAID-5	5000	quick	NTFS	1024	on
Primary	1000	slow	FAT	1024	on
RAID-5	10000	slow	NTFS	8192	on
Logical	100	quick	NTFS	32768	off
Primary	10000	slow	FAT	32768	on
Stripe	40000	quick	FAT32	65536	on
Span	40000	quick	FAT	4096	on
Stripe	1000	quick	FAT	8192	off
Logical	1000	slow	FAT	4096	off
Primary	100	quick	FAT	2048	off
Single	40000	quick	FAT	1024	off
RAID-5	1000	quick	FAT	16384	on
Single	500	quick	FAT32	512	off
Stripe	10	quick	NTFS	2048	off
Primary	100	quick	NTFS	512	off
Logical	10000	slow	NTFS	1024	off
Mirror	5000	quick	FAT	512	on
Logical	5000	slow	NTFS	65536	off
Mirror	10000	slow	FAT	2048	off
RAID-5	10	slow	FAT32	65536	off
Span	100	quick	FAT	65536	on
Single	5000	quick	FAT	32768	on
Span	1000	quick	NTFS	65536	off
Primary	500	slow	FAT32	4096	off
Mirror	40000	slow	FAT32	4096	off
Mirror	10	slow	FAT32	1024	off
Logical	10000	quick	FAT	4096	off
Span	5000	slow	FAT	8192	off
RAID-5	40000	quick	FAT32	512	on
Primary	5000	quick	NTFS	1024	off
Mirror	100	slow	FAT32	16384	off

The first line is the header, and values/columns are separated by tabulator characters (tabs).

The output above is 62 generated combinations/test cases as evidenced by:

wc -l myoutputs/example1.txt 
      63 myoutputs/example1.txt

(wc-l counts lines, and the first line is the header so I substract 1)

To produce all 3-way combinations with PICT, the syntax is:

./pict mymodels/example1.pict >myoutputs/example1.txt /o:3

which generates 392 combinations/test cases:

wc -l myoutputs/example1.txt 
      393 myoutputs/example1.txt

I find the PICT command-line syntax a bit odd, as parameters have to be the last elements on the line, and they are identified by these strange symbols like “/o:”. But it works, so great.


Of course, not all combinations are always valid. So PICT has extensive support to define constraints on the generator model, to limit what kind of combinations PICT generates. The PICT documentation page has lots of good examples. This part actually seems nicely documented. But let’s try a few just to see what happens. The basic example from the page:

Type:           Primary, Logical, Single, Span, Stripe, Mirror, RAID-5
Size:           10, 100, 500, 1000, 5000, 10000, 40000
Format method:  quick, slow
File system:    FAT, FAT32, NTFS
Cluster size:   512, 1024, 2048, 4096, 8192, 16384, 32768, 65536
Compression:    on, off

IF [File system] = "FAT"   THEN [Size] <= 4096;
IF [File system] = "FAT32" THEN [Size] myoutputs/example2.txt

wc -l myoutputs/example2.txt 
      63 myoutputs/example2.txt

So the same number of tests. The contents:

Type	Size	Format method	File system	Cluster size	Compression
Stripe	500	slow	NTFS	1024	on
Primary	500	quick	FAT32	512	off
Single	10	slow	FAT	1024	off
Single	5000	quick	FAT32	32768	on
Span	40000	quick	NTFS	16384	off
Mirror	40000	slow	NTFS	512	on
RAID-5	100	quick	FAT	8192	on
Logical	500	slow	FAT	2048	off
Span	10000	slow	FAT32	1024	on
Logical	1000	slow	FAT32	16384	on
Span	1000	quick	FAT	512	off
Primary	10	quick	NTFS	1024	on
Mirror	1000	quick	NTFS	4096	off
RAID-5	40000	slow	NTFS	1024	off
Single	40000	slow	NTFS	8192	off
Stripe	10	slow	FAT32	4096	on
Stripe	40000	quick	NTFS	2048	on
Primary	100	slow	NTFS	32768	off
Stripe	500	quick	FAT	16384	off
RAID-5	1000	quick	FAT32	2048	off
Mirror	10	quick	FAT	65536	off
Logical	40000	quick	NTFS	4096	on
RAID-5	5000	slow	NTFS	512	off
Stripe	5000	slow	FAT32	65536	on
Span	10	quick	FAT32	2048	off
Logical	10000	quick	NTFS	65536	off
Primary	1000	slow	FAT	65536	off
Mirror	500	quick	FAT	32768	on
Single	100	quick	FAT32	512	on
Mirror	5000	slow	FAT32	2048	on
Mirror	100	quick	NTFS	2048	on
Logical	5000	quick	FAT32	8192	off
Logical	100	slow	FAT32	1024	on
Primary	100	quick	FAT32	16384	off
Primary	10000	quick	FAT32	2048	on
RAID-5	10	slow	FAT	32768	off
Mirror	10	quick	FAT	16384	on
Single	500	slow	FAT	4096	on
Span	500	slow	FAT32	8192	on
Stripe	10000	quick	FAT32	32768	off
Logical	1000	slow	NTFS	32768	on
Single	10000	slow	NTFS	16384	off
Span	100	slow	FAT32	4096	on
Stripe	1000	slow	NTFS	8192	on
Span	5000	quick	NTFS	32768	on
Primary	5000	slow	FAT32	4096	off
RAID-5	100	slow	FAT	65536	off
RAID-5	10000	slow	FAT32	4096	on
Single	1000	quick	FAT	1024	on
Mirror	10	quick	FAT	1024	on
Logical	5000	slow	FAT32	1024	off
Single	500	slow	FAT32	65536	off
Logical	10	quick	NTFS	512	on
Single	1000	slow	FAT	2048	off
Mirror	10000	quick	NTFS	8192	on
Primary	10	quick	FAT32	8192	on
Primary	40000	slow	NTFS	32768	off
Stripe	100	slow	FAT	512	off
Mirror	10000	slow	FAT32	512	on
RAID-5	5000	quick	NTFS	16384	off
Span	40000	quick	NTFS	65536	on
RAID-5	500	quick	FAT	4096	on

In the “size” column vs the “File system” column, the “FAT” file system type now always has a size smaller than 4096. So it works as expected. I have to admit, I found the value 4096 very confusing here, since there is no option of 4096 in the input model for “size” but there is for “Cluster size”. I was looking at the wrong column initially, wondering why the constraint was not working. But it works, just a bit confusing example.

Similarly, 3-way combinations produce the same number of tests (as it did without any constraints) even with these constraints:

./pict mymodels/example2.pict >myoutputs/example2.txt /o:3

wc -l myoutputs/example2.txt 
     393 myoutputs/example2.txt

To experiment a bit more, I set a limit on FAT size to be 100 or less:

Type:           Primary, Logical, Single, Span, Stripe, Mirror, RAID-5
Size:           10, 100, 500, 1000, 5000, 10000, 40000
Format method:  quick, slow
File system:    FAT, FAT32, NTFS
Cluster size:   512, 1024, 2048, 4096, 8192, 16384, 32768, 65536
Compression:    on, off

IF [File system] = "FAT"   THEN [Size] <= 100;
IF [File system] = "FAT32" THEN [Size] myoutputs/example3.txt

wc -l myoutputs/example3.txt 
      62 myoutputs/example3.txt

./pict mymodels/example3.pict >myoutputs/example3.txt /o:3
wc -l myoutputs/example3.txt 
     397 myoutputs/example3.txt

What happened here?

Running the 2-way generator produces 61 tests. So the number of combinations generated was finally reduced by one with the additional constraint.

Running the 3-way generator produces 396 tests. So the number of tests/combinations generated was increased by 4, comparated to 3-way generator without this constraint. Which is odd, as I would expect the number of tests to go down, when there are fewer options. In fact, you could get a smaller number of tests by just by taking the 392 tests from the previous generator run with fewer constraints. Then take every line with “FAT” for “File system”, and if the “Size” for those is bigger than 100, replace it with either 100 or 10. This would be a max of 392 as it was before.

My guess is this is because building the set of inputs to cover all requested combinations is a very hard problem. I believe in computer science this would be called an NP-hard problem (or so I gather from the academic literature for combinatorial testing, even if they seem to call the test set a “covering array” and other academic tricks). So no solution is known that would produce the optimal result. The generator will then have to accomodate all the possible constraints in its code, and ends up taking some tricks here that result in slighly bigger set. It is still likely a very nicely optimized set. Glad it’s not me having to write those algorithms :). I just use them and complain :).

PICT has a bunch of other ways to define conditional constraints with the use of IF, THEN, ELSE, NOT, OR, AND statements. The docs cover that nicely. So lets not go there.

The Naming Trick

Something I found interesting is a way to build models by naming different items separately, and constraining them separately:

# Machine 1
OS_1:   Win7, Win8, Win10
SKU_1:  Home, Pro
LANG_1: English, Spanish, Chinese

# Machine 2
OS_2:   Win7, Win8, Win10
SKU_2:  Home, Pro
LANG_2: English, Spanish, Chinese, Hindi

IF [LANG_1] = [LANG_2]
THEN [OS_1]  [OS_2] AND [SKU_1]  [SKU_2];

Here we have two items (“machines”) with the same three properties (“OS”, “SKU”, “LANG”). However, by numbering the properties, the generator sees them as different. From this, the generator can now build combinations of different two-machine configurations, using just the basic syntax and no need to tweak the generator itself. The only difference between the two is that “Machine 2” can have one additional language (“Hindi”).

The constraint at the end also nicely ensures that if the generated configurations have the same language, the OS and SKU should be different.

Scaling these “machine” combinations to a large number of machines would require a different type of an approach. Since it is doubtful anyone would like to write a model with 100 machines, each separately labeled. No idea what modelling approach would be the best for that, but right now I don’t have a big requirement for it, so not going there. Maybe a different approach of having the generator produce a more abstract set of combinations, and map those to large number of “machines” somehow.

Repetition and Value References

There is quite a bit of repetition in the above model with both machines repeating all the same parameter values. PICT has a way to address this by referencing values defined for other parameters:

# Machine 1
OS_1:   Win7, Win8, Win10
SKU_1:  Home, Pro
LANG_1: English, Spanish, Chinese

# Machine 2
LANG_2: , Hindi

So in this case, “machine 2” is repeating the values from “machine 1”, and changing them in “machine 1” also changes them in “machine 2”. Sometimes that is good, other times maybe not. Because changing one thing would change many, and you might not remember that every time. On the other hand, you would not want to be manually updating all items with the same info every time. But a nice feature to have if you need it.

Data Types

With regards to variable types, PICT supports numbers and strings. So this is given as an example model:

Size:  1, 2, 3, 4, 5
Value: a, b, c, d

IF [Size] > 3 THEN [Value] > "b";

I guess the two types are because you can then define different types of constraints on them. For example, “Size” > 3 makes sense. The part of “value” > 3 a bit less.. So let’s try that:

./pict mymodels/example4.pict >myoutputs/example4.txt

wc -l myoutputs/example4.txt 
      17 myoutputs/example4.txt

The output looks like this:

Size	Value
3	a
2	c
1	c
2	b
2	a
1	d
1	a
3	b
4	d
2	d
3	d
1	b
5	c
3	c
4	c
5	d

And here, if “Size” equals 4 or 5 (so is >3), “Value” is always “c” or d”. The PICT docs state “String comparison is lexicographical and case-insensitive by default”. So [> “b”] just refers to letters coming after “b”, which equals “c” and “d” in the choices in this model. It seems a bit odd to define such comparisons against text in a model, but I guess it can help make a model more readable if you can represent values as numbers or strings, and define constraints on them in a similar way.

To verify, I try a slightly modified model:

./pict mymodels/example4.pict >myoutputs/example4.txt

wc -l myoutputs/example4.txt 
      13 myoutputs/example4.txt

So, the number of tests is reduced from 16 to 12. Results in the following output:

Size	Value
5	c
2	c
1	d
4	d
1	b
4	c
3	d
3	c
2	d
1	c
1	a
5	d

Which confirms that lines (tests) with Size > 2 now have only letters “c” or “d” in them. This naturally also limits the number of available combinations, hence the reduced test set.

Extra Features

There are some nice features that are nicely explained in the PICT docs:

  • Submodels: Refers to defining levels of combinations per test. For example, 2-way combinations of OS with all others, and 3-way combination of File System Type with all others, at the same time.
  • Aliasing: You can give the same parameter several names and all are treated the same. Not sure why you want to do that but anyway.
  • Weighting: Since the full set of combinations will have more of some values anyway, this can be used to set preference for specific ones.‚Äė

Negative Testing / Erronous Values

A few more interesting ones are “negative testing” and “seeding”. So first negative testing. Negative testing refers to having a set of exclusive values. So those values should never appear together. This is because each of them is expected to produce an error. So you want to make sure the error they produce is visible and not “masked” (hidden) by some other erronous value.

The example model from PICT docs, with a small modification to name the invalid values differently:

# Trivial model for SumSquareRoots

A: ~-1, 0, 1, 2
B: ~-2, 0, 1, 2

Running it, we get:

./pict mymodels/example5.pict >myoutputs/example5.txt

wc -l myoutputs/example5.txt 
      16 myoutputs/example5.txt
0	2
0	1
1	2
2	1
1	0
2	0
1	1
2	2
0	0
0	~-2
1	~-2
~-1	0
~-1	1
2	~-2
~-1	2

The negative value is prefixed with “~”, and the results show combinations of the two negative values with all possible values of the other variable. So if A is -1, it is combined with 0, 1, 2 for B. If B is -2 it is combinted with 0, 1, 2 for A. But -1 and -2 are never paired. To avoid one “faulty” variable masking the other one. I find having the “~” added everywhere a bit distracting. But I guess you could parse around it, not a real issue.

Of course, there is nothing to stop us from setting the set of possible values to include -1 and -2, and get combinations of several “negative” values. Lets try:

A: -1, 0, 1, 2
B: -2, 0, 1, 2
./pict mymodels/example6.pict >myoutputs/example6.txt
wc -l myoutputs/example6.txt 
      17 myoutputs/example6.txt
1	-2
2	0
1	0
-1	0
0	-2
2	1
-1	-2
0	0
1	2
-1	2
0	2
2	-2
1	1
-1	1
0	1
2	2

So there we go. This produced one test more than the previous one. And that would be the one where both the negatives are present. Line with “-1” and “-2” together.

Overall, the “~” notation seems like just a way to avoid having a set of variables appear together. Convenient, and good way to optimize more when you have large models, big input spaces, slow tests, difficult problem reports, etc.

Seeding / Forcing Tests In

Seeding. When I hear seeding in test generation, I think about the seed value for a random number generator. Because often those are used to help generate tests.. Well, with PICT it actually means you can predine a set of combinations that need to be a part of the final test set.

So lets try with the first example model from above:

Type:          Primary, Logical, Single, Span, Stripe, Mirror, RAID-5
Size:          10, 100, 500, 1000, 5000, 10000, 40000
Format method: quick, slow
File system:   FAT, FAT32, NTFS
Cluster size:  512, 1024, 2048, 4096, 8192, 16384, 32768, 65536
Compression:   on, off

The seed files should be the same format as the output produced by PICT. Lets say I want to try all types with all file systems, using smallest size. So I try with this:

Type	Size	Format method	File system	Cluster size	Compression
Primary	10		FAT32		on
Logical	10		FAT32		on
Single	10		FAT32		on
Span	10		FAT32		on
Stripe	10		FAT32		on
Mirror	10		FAT32		on
RAID-5	10		FAT32		on
Primary	10		FAT		on
Logical	10		FAT		on
Single	10		FAT		on
Span	10		FAT		on
Stripe	10		FAT		on
Mirror	10		FAT		on
RAID-5	10		FAT		on
Primary	10		NTFS		on
Logical	10		NTFS		on
Single	10		NTFS		on
Span	10		NTFS		on
Stripe	10		NTFS		on
Mirror	10		NTFS		on
RAID-5	10		NTFS		on

To run it:

./pict mymodels/example7.pict /e:mymodels/example7.seed >myoutputs/example7.txt
 wc -l myoutputs/example7.txt 
      73 myoutputs/example7.txt

So in the beginning of this post, the initial model generated 62 combinations. With this seed file, some forced repetition is there and the size goes up to 72. Still not that much bigger, but I guess shows something about how nice it is to have a combinatorial test tool to optimize this type of test set for you.

The actual output:

Type	Size	Format method	File system	Cluster size	Compression
Primary	10	quick	FAT32	2048	on
Logical	10	slow	FAT32	16384	on
Single	10	slow	FAT32	65536	on
Span	10	quick	FAT32	1024	on
Stripe	10	quick	FAT32	8192	on
Mirror	10	quick	FAT32	512	on
RAID-5	10	slow	FAT32	32768	on
Primary	10	slow	FAT	4096	on
Logical	10	quick	FAT	1024	on
Single	10	quick	FAT	32768	on
Span	10	slow	FAT	512	on
Stripe	10	slow	FAT	16384	on
Mirror	10	slow	FAT	8192	on
RAID-5	10	slow	FAT	2048	on
Primary	10	quick	NTFS	65536	on
Logical	10	quick	NTFS	4096	on
Single	10	slow	NTFS	16384	on
Span	10	quick	NTFS	32768	on
Stripe	10	slow	NTFS	1024	on
Mirror	10	slow	NTFS	2048	on
RAID-5	10	quick	NTFS	512	on
Span	40000	slow	FAT	65536	off
Single	5000	quick	NTFS	8192	off
Mirror	1000	quick	FAT32	4096	off
Stripe	100	slow	FAT	32768	off
Primary	500	slow	FAT	512	off
Primary	40000	quick	NTFS	8192	on
Logical	10000	quick	NTFS	32768	off
RAID-5	40000	slow	FAT32	1024	off
Span	100	quick	NTFS	8192	on
Mirror	10000	slow	FAT32	16384	off
Logical	5000	slow	FAT	512	on
Primary	1000	slow	FAT	1024	on
Mirror	5000	quick	FAT32	1024	on
Logical	1000	quick	NTFS	32768	on
Single	40000	slow	FAT32	512	on
Stripe	40000	quick	FAT	16384	on
Logical	100	quick	FAT32	2048	off
Single	100	quick	FAT32	1024	off
Primary	5000	quick	NTFS	32768	off
Single	40000	slow	NTFS	2048	on
Logical	500	quick	FAT32	8192	on
Single	500	slow	NTFS	4096	on
Span	500	quick	FAT32	16384	on
Primary	100	quick	FAT32	512	off
Stripe	1000	slow	FAT32	2048	on
RAID-5	10000	quick	FAT	8192	on
Stripe	10000	slow	NTFS	512	off
Stripe	5000	quick	FAT	65536	on
Mirror	40000	slow	NTFS	32768	on
Primary	10000	quick	NTFS	1024	on
RAID-5	100	quick	FAT	16384	off
Mirror	500	quick	NTFS	1024	on
Single	1000	slow	FAT32	512	on
Span	100	slow	FAT32	4096	off
Span	5000	slow	NTFS	2048	on
RAID-5	40000	slow	FAT	4096	off
Span	1000	slow	FAT32	16384	on
Mirror	100	quick	FAT	65536	on
Single	10000	slow	FAT	4096	off
RAID-5	1000	slow	NTFS	65536	off
Span	10000	slow	NTFS	65536	on
Span	1000	slow	FAT32	8192	off
RAID-5	500	quick	NTFS	32768	off
Stripe	500	slow	FAT	2048	off
RAID-5	5000	slow	NTFS	16384	on
Stripe	5000	slow	FAT32	4096	off
Logical	10	slow	FAT	65536	off
RAID-5	10000	quick	NTFS	2048	on
Primary	1000	slow	FAT	16384	off
Logical	40000	quick	FAT32	8192	on
Primary	500	quick	FAT	65536	on

This output starts with the seeds given, and PICT has done its best to fill in the blanks with such values as to still minimize the test numbers while meeting the combinatorial coverage requirements.

Personal Thoughts

Searching for PICT and pairwise testing or combinatorial testing brings up a bunch of results and reasonably good articles on the topic. Maybe even more of such practice oriented ones than model-based testing. Maybe because it is simpler to apply, and thus easier to pick up and go in practice?

For example, this has a few good points. One is to use an iterative process to build the input models. So as with everything else, not to expect to get it all perfectly right from the first try. Another is to consider invariants for test oracles. So things that should always hold, such as two nodes in a distributed system never being in a conflicting state when an operation involving both is done. Of course, this would also apply to any other type of testing. The article seems to consider this also from a hierarchical viewpoint, checking the strictest or most critical ones first.

Another good point in that article is to use readable names for the values. I guess sometimes people could use the PICT output as such, to define test configurations and the like for manual testing. I would maybe considering using them more as input for automated test execution to define parameter values to cover. In such cases, it would be enough to give each value a short name such as “A”, “A1”, or “1”. But looking at the model and the output, it would be difficult to define which value would map to which symbol. Readable names are just as parseable for the computer but much more so for the human expert.

Combining with Sequential Models

So this is all nice and shiny, but the examples are actually quite simple test scenarios. There are no complex dependencies between them, not complex state that defines what parameters and values are available, and so on. It mostly seems to vary around what combinations of software or system configurations should be used in testing.

I have worked plenty with model-based testing myself (see OSMO), and actually have talked to some people who have done combinations of combinatorial input generation and model-based testing. I can see how this could be interested, to identify a set of injection points for parameters and values in a MBT model, and use a combinatorial test data generator to build data sets for those injection points. Likely doing some more of this in practice would reveal good insights on what works and what could be done to make the match even better. Maybe someday.

In any case, I am sure combining combinatorial test datasets would also work great with other types of sequences as well. I think this could make a very interesting and practical research topic. Again, maybe someday..

Bye Now

In general, this area seems to have great tools for the basic test generation, but missing some in-depth experiences and guides for how to apply to more complex software. Together with sequential test cases and test generators.

A simpler, yet interesting topic to do would be to integrate the PICT type generator directly with the test environment. Run the combinatorial generator from this during the test runs, and have it randomize the combinations in a bit different ways during different runs. While still maintaining the overall combinatorial coverage.

Finnish Topic Modelling

Previously I wrote about a few experiments I ran with topic-modelling. I briefly glossed over having some results for a set of Finnish text as an example of a smaller dataset. This is a bit deeper look into that..

I use two datasets, the Finnish wikipedia dump, and the city of Oulu board minutes. Same ones I used before. Previously I covered topic modelling more generally, so I won’t go into too much detail here. To summarize, topic modelling algorithms (of which LDA or Latent Dirilect Allocation is used here) find sets of words with different distributions over sets of documents. These are then called the “topics” discussed in those documents.

This post looks at how to use topic models for a different language (besides English) and what could one maybe do with the results.

Lemmatize (turn words into baseforms before use) or not? I choose to lemmatize for topic modelling. This seems to be the general consensus when looking up info on topic modelling, and in my experience it just gives better results as the same word appears only once. I covered POS tagging previously, and I believe it would be useful to apply here as well, but I don’t. Mostly because it is not needed to test these concepts, and I find the results are good enough without adding POS tagging to the mix (which has its issues as I discussed before). Simplicity is nice.

I used the Python Gensim package for building the topic models. As input, I used the Finnish Wikipedia text and the city of Oulu board minutes texts. I used my existing text extractor and lemmatizer for these (to get the raw text out of the HTML pages and PDF docs, and to baseform them, as discussed in my previous posts). I dumped the lemmatized raw text into files using slight modifications of my previous Java code and the read the docs from those files as input to Gensim in a Python script.

I started with the Finnish Wikipedia dump, using Gensim to provide 50 topics, with 1 pass over the corpus. First 10 topics that I got:

  • topic0=focus[19565] var[8893] liivi[7391] luku[6072] html[5451] murre[3868] verkkoversio[3657] alku[3313] joten[2734] http[2685]
  • topic1=viro[63337] substantiivi[20786] gen[19396] part[14778] taivutus[13692] tyyppi[6592] t√§ysi[5804] taivutustyyppi[5356] liite[4270] rakenne[3227]
  • topic2=isku[27195] pieni[10315] tms[7445] aine[5807] v√§ri[5716] raha[4629] suuri[4383] helppo[4324] saattaa[4044] heprea[3129]
  • topic3=suomi[89106] suku[84950] substantiivi[70654] pudottaa[59703] kasvi[46085] k√§√§nn√∂s[37875] luokka[35566] sana[33868] kieli[32850] taivutusmuoto[32067]
  • topic4=ohjaus[129425] white[9304] off[8670] black[6825] red[5066] sotilas[4893] fraasi[4835] yellow[3943] perinteinen[3744] flycatcher[3735]
  • topic5=lati[48738] eesti[25987] www[17839] http[17073] keele[15733] eki[12421] l√§hde[11306] dict[11104] s√Ķnaraamat[10648] tallinn[8504]
  • topic6=suomi[534914] k√§√§nn√∂s[292690] substantiivi[273243] aihe[256126] muualla[254788] sana[194213] liittyv√§[193298] etymologi[164158] viite[104417] kieli[102489]
  • topic7=italia[66367] substantiivi[52038] japani[27988] inarinsaame[9464] kohta[7433] yhteys[7071] vaatekappale[5553] rinnakkaismuoto[5469] taas[4986] voimakas[3912]
  • topic8=sana[548232] liittyv√§[493888] substantiivi[298421] ruotsi[164717] synonyymi[118244] alas[75430] etymologi[64170] liikuttaa[38058] johdos[25603] yhdyssana[24943]
  • topic9=juuri[3794] des[3209] jumala[1799] tadŇĺikki[1686] tuntea[1639] tekij√§[1526] tulo[1523] mitta[1337] jatkuva[1329] levy[1197]
  • topic10=t√∂rm√§t√§[22942] user[2374] sur[1664] self[1643] hallita[1447] voittaa[1243] piste[1178] data[1118] harjoittaa[939] jstak[886]

The format of the topic list I used here is “topicX=word1[count] word2[count]”, where X is the number of the topic, word1 is the first word in the topic, word2 the second, and so on. The [count] is how many times the word was associated with the topic in different documents. Consider it the strength, weight, or whatever of the word in the topic.

So just a few notes on the above topic list:

  • topic0 = mostly website related terms, interleaved with a few odd ones. Examples of odd ones; “liivi” = vest, “luku” = number/chapter (POS tagging would help differentiate), “murre” = dialect.
  • topic1 = mostly Finnish language related terms. “viro” = estonia = slightly odd to have here. It is the closest related language to Finnish but still..
  • topic3 = another Finnish language reated topic. Odd one here is “kasvi” = plant. Generally this seems to be more related to words and their forms, where as topic1 maybe more about structure and relations.
  • topic5 = estonia related

Overall, I think this would improve given more passes over the corpus to train the model. This would give the algorithm more time and data to refine the model. I only ran it with one pass here since the training for more topics and with more passes started taking days and I did not have the resources to go there.

My guess is also that with more data and more broader concepts (Wikipedia covering pretty much every topic there is..) you would also need more topics that the 50 I used here. However, I had to limit the size due to time and resource constraints. Gensim probably also has more advanced tuning options (e..g, parallel runs) that would benefit the speed. So I tried a few more sizes and passes with the smaller Oulu city board dataset, as it was faster to run.

Some topics for the city of Oulu board minutes, run for 20 topics and 20 passes over the training data:

  • topic0=oulu[2096] kaupunki[1383] kaupunginhallitus[1261] 2013[854] p√§iv√§m√§√§r√§[575] vuosi[446] p√§√§t√∂sesitys[423] j√§sen[405] hallitus[391] tieto[387]
  • topic1=kunta[52] palvelu[46] asiakaspalvelu[41] yhteinen[38] viranomainen[25] laki[24] valtio[22] my√∂s[20] asiakaspalvelupiste[19] kaupallinen[17]
  • topic2=oulu[126] palvelu[113] kaupunki[113] koulu[89] tukea[87] edist√§√§[71] vuosi[71] osa[64] nuori[63] toiminta[61]
  • topic3=tontti[490] kaupunki[460] oulu[339] asemakaava[249] rakennus[241] kaupunginhallitus[234] p√§iv√§m√§√§r√§[212] yhdyskuntalautakunta[206] muutos[191] alue[179]
  • topic5=kaupunginhallitus[1210] p√§√§t√∂s[1074] j√§sen[861] oulu[811] kaupunki[681] p√∂yt√§kirja[653] klo[429] p√§iv√§m√§√§r√§[409] oikaisuvaatimus[404] matti[316]
  • topic6=000[71] 2012[28] oulu[22] muu[20] tilikausi[16] vuosi[16] yhde[15] kunta[14] 2011[13] 00000[13]
  • topic8=alue[228] asemakaava[96] rakentaa[73] tulla[58] oleva[56] rakennus[55] merkitt√§v√§[53] kortteli[53] oulunsalo[50] nykyinen[48]
  •[15107] ktwebbin[15105] 2016[7773] eet[7570] pk_asil_tweb.htm?[7551] ktwebscr[7550] dbisa.dll[7550] url=http[7540] doctype[7540] =3&docid[7540]
  • topic11=yhti√∂[31] osake[18] osakas[11] energia[10] hallitus[10] 18.11.2013[8] liite[7] lomautus[6] s√§hk√∂[6] osakassopimus[5]
  • topic12=13.05.2013[13] perlacon[8] kuntatalousfoorumi[8] =1418[6] meeting_date=21.3.2013[6] =2070[6] meeting_date=28.5.2013[6] =11358[5] meeting_date=3.10.2016[5] -31.8.2015[4]
  • topic13=001[19] oulu[11] 002[5] kaupunki[4] sivu[3] ÔŅĹÔŅĹÔŅĹ[3] palvelu[3] the[3] asua[2] and[2]

Some notes on the topics above:

  • The word “oulu” repeats in most of the topics. This is quite natural as all the documents are from the board of the city of Oulu. Depending on the use case for the topics, it might be useful to add this word to the list of words to be removed in the pre-cleaning phase for the documents before running the topic modelling algorithm. Or it might be useful information, along with the weight of the word inside the topic. Depends.
  • topic0 = generally about the board structure. For example, “kaupunki”=city, “kaupunginhallitus”=city board, “p√§iv√§m√§√§r√§”=date, “p√§√§t√∂sesitys”=proposal for decision.
  • topic1 = Mostly city service related words. For example, “kunta” = county, “palvelu” = service, “asiakaspalvelu” = customer service, “my√∂s” = also, so something to add to the cleaners again.
  • topic2 = School related. For example, “koulu” = school, “tukea” = support, … Sharing again common words such as “kaupunki” = city, which may also be considered for removal or not depending on the case.
  • topic3 = City area planning related. For example, “tontti” = plot of land, “asemakaava” = zoning plan, …
  • In general quite good and focused topics here, so I think in general quite a good result. Some exceptions to consider:
  • topic10 = mostly garbage related to HTML formatting and website link structures. still a real topic of course, so nicely identified.. I think something to consider to add to the cleaning list for pre-processing.
  • topic12 = Seems related to some city finance related consultation (perlacon seems to be such as company) and associated event (the forum). With a bunch of meeting dates.
  • topic13 = unclear garbage
  • So in general, I guess reasonably good results but in real applications, several iterations of fine-tuning the words, the topic modelling algorithm parameters, etc. based on the results would be very useful.

So that was the city minutes topics for a smaller set of topics and more passes. What does it look for 100 topics, and how does the number of passes over the corpus affect the larger size? more passes should give the algorithm more time to refine the topics, but smaller datasets might not have so many good topics..

For 100 topics, 1 passes, 10 first topics:

  • topic0=oulu[55] kaupunki[22] 000[20] sivu[14] palvelu[14] alue[13] vuosi[13] muu[11] uusi[11] tavoite[9]
  • topic1=kaupunki[18] oulu[17] j√§sen[15] 000[10] kaupunginhallitus[7] kaupunginjohtaja[6] klo[6] muu[5] vuosi[5] takaus[4]
  • topic2=hallitus[158] oulu[151] 25.03.2013[135] kaupunginhallitus[112] j√§sen[105] varsinainen[82] tilintarkastaja[79] kaupunki[75] valita[70] yhti√∂kokousedustaja[50]
  • topic3=kuntalis√§[19] oulu[16] palkkatuki[15] kaupunki[14] tervahovi[13] henkil√∂[12] tukea[12] yritys[10] kaupunginhallitus[10] ty√∂t√∂n[9]
  • topic4=koulu[37] oulu[7] sahantie[5] 000[5] √§√§nestyspaikka[4] maikkulan[4] kaupunki[4] kirjasto[4] monitoimitalo[3] kello[3]
  • topic5=oulu[338] kaupunki[204] euro[154] kaupunginhallitus[143] 2013[105] vuosi[96] milj[82] palvelu[77] kunta[71] uusi[64]
  • topic6=000[8] oulu[7] kaupunki[4] vuosi[3] 2012[3] muu[3] kunta[2] muutos[2] 2013[2] sivu[1]
  • topic7=000[5] 26.03.2013[4] oulu[3] 2012[3] kunta[2] vuosi[2] kirjastoj√§rjestelm√§[2] muu[1] kaupunki[1] muutos[1]
  • topic8=oulu[471] kaupunki[268] kaupunginhallitus[227] 2013[137] p√§iv√§m√§√§r√§[97] p√§√§t√∂s[93] vuosi[71] tieto[67] 000[66] p√§√§t√∂sesitys[64]
  • topic9=oulu[5] lomautus[3] 000[3] kaupunki[2] s√§√§st√∂toimenpidevapaa[1] vuosi[1] kunta[1] kaupunginhallitus[1] sivu[1] henkil√∂st√∂[1]
  • topic10=oulu[123] kaupunki[82] alue[63] sivu[43] rakennus[42] asemakaava[39] vuosi[38] tontti[38] 2013[35] osa[35]

Without going too much into translating every word, I would say these results are too spread out, so from this, for this dataset, it seems a smaller set of topics would do better. This also seems to be visible in the word counts/strengths in the [square brackets]. The topics with small weights also seem pretty poor topics, while the ones with bigger weights look better (just my opinion of course :)). Maybe something to consider when trying to explore the number of topics etc.

And the same run, this time with 20 passes over the corpus (100 topics and 10 first ones shown):

  • topic0=oulu[138] kaupunki[128] palvelu[123] toiminta[92] kehitt√§√§[73] my√∂s[72] tavoite[62] osa[55] vuosi[50] toteuttaa[44]
  • topic1=-seurantatieto[0] 2008-2010[0] =30065[0] =170189[0] =257121[0] =38760[0] =13408[0] oulu[0] 000[0] kaupunki[0]
  • topic2=harmaa[2] tilaajavastuulaki[1][1] torjunta[1] -palvelu[1] talous[0] harmaantalous[0] -30.4.2014[0] hankintayksikk√∂[0] kilpailu[0]
  • topic3=juhlavuosi[14] 15.45[11] perussopimus[9] reilu[7] kauppa[6] juhlatoimikunta[6] ty√∂paja[6] 24.2.2014[6] 18.48[5] tapahtumatuki[4]
  • topic4=kokous[762] kaupunginhallitus[591] p√§√§t√∂s[537] p√∂yt√§kirja[536] ty√∂j√§rjestys[362] hyv√§ksy√§[362] tarkastaja[360] esityslista[239] valin[188] p√§√§t√∂svaltaisuus[185]
  • topic5=koulu[130] sivistys-[35] suuralue[28] perusopetus[25] tilakeskus[24] kulttuurilautakunta[22] j√§rjest√§√§[22] korvensuora[18] p√§iv√§kota[17] p√§iv√§koti[17]
  • topic6=piste[24] hanke[16] toimittaja[12] hankesuunnitelma[12] tila[12] toteuttaa[11] hiukkavaara[10] hyvinvointikeskus[10] tilakeskus[10] monitoimitalo[9]
  • topic7=tiedekeskus[3] museo-[2] prosenttipohjainen[2] taidehankinta[1] uudisrakennushanke[1] hankintam√§√§r√§raha[1] prosenttitaide[1] hankintaprosessi[0] toteutusajankohta[0] ulosvuokrattava[0]
  • topic8=euro[323] milj[191] vuosi[150] oulu[107] talousarvio[100] tilinp√§√§t√∂s[94] kaupunginhallitus[83] kaupunki[79] 2012[73] 2013[68]
  • topic9=p√§√§t√∂s[653] oikaisuvaatimus[335] oulu[295] kaupunki[218] p√§iv√§[215] voi[211] kaupunginhallitus[208] posti[187] p√∂yt√§kirja[161] viimeinen[154]

Even the smaller topics here seem much better now with the increase in passes over the corpus. So perhaps the real difference just comes from having enough passes over the data, giving the algorithms more time and data to refine the models. At least I would not try without multiple passes based on comparing the results here of 1 vs 20 passes.

For example, topic2 here has small numbers but still all items seem related to grey market economy. Similarly, topic7 has small numbers but the words are mostly related to arts and culture.

So to summarize, it seems lemmatizing your words, exploring your parameters, and ensuring to have a decent amount of data and decent number of passes for the algorithm are all good points. And properly cleaning your data, and iterating over the process many times to get these right (well, as “right”as you can).

To answer my “research questions” from the beginning: topic modelling for different languages and use cases for topic modelling.

First, lemmatize all your data (I prefer it over stemming but it can be more resource intensive). Clean all your data from the typical stopwords for your language, but also for your dataset and domain. Run the models and analysis several times, and keep refining your list of removed words to clean also based on your use case, your dataset and your domain. Also likely need to consider domain specific lemmatization rules as I already discussed with POS tagging.

Secondly, what use cases did I find looking at topic modelling use cases online? Actually, it seems really hard to find concrete actual reports of uses for topic models. Quora has usually been promising but not so much this time. So I looked at reports in the published research papers instead, trying to see if any companies were involved as well.

Some potential use cases from research papers:

Bug localization, as in finding locations of bugs in source code is investigated here. Source code (comments, source code identifiers, etc) is modelled as topics, which are mapped to a query created from a bug report.

Matching duplicates of documents in here. Topic distributions over bug reports are used to suggest duplicate bug reports. Not exact duplicates but describing the same bug. If the topic distributions are close, flag them as potentially discussing the same “topic” (bug).

Ericsson has used topic models to map incoming bug reports to specific components. To make resolving bugs easier and faster by automatically assigning them to (correct) teams for resolution. Large historical datasets of bug reports and their assignments to components are used to learn the topic models. Topic distributions of incoming bug reports are used to give probability rankings for the bug report describing a specific component, in comparison to topic distributions of previous bug reports for that component. Topic distributions are also used as explanatory data to present to the expert looking at the classification results. Later, different approaches are reported at Ericsson as well. So just to remind that topic models are not the answer to everything, even if useful components and worth a try in places.

In cyber security, this uses topic models to describe users activity as distributions over the different topics. Learn topic models from user activity logs, describe each users typical activity as a topic distribution. If a log entry (e.g., session?) diverges too much from this topic distribution for the user, flag it as an anomaly to investigate. I would expect simpler things could work for this as well, but as input for anomaly detection, an interesting thought.

Tweet analysis is popular in NLP. This is an example of high-level tweet topic classification: Politics, sports, science, … Useful input for recommendations etc., I am sure. A more targeted domain specific example is of using topics in Typhoon related tweet analysis and classification: Worried, damage, food, rescue operations, flood, … useful input for situation awareness, I would expect. As far as I understood, topic models were generated, labeled, and then users (or tweets) assigned to the (high-level) topics by topic distributions. Tweets are very small documents, so that is something to consider, as discussed in those papers.

Use of topics models in biomedicine for text analysis. To find patterns (topic distributions) in papers discussing specific genes, for example. Could work more broadly as one tool to explore research in an area, to find clusters of concepts in broad sets of research papers on a specific “topic” (here a research on a specific gene). Of course, there likely exist number of other techniques to investigate for that as well, but topic models could have potential.

Generally labelling and categorizing large number of historical/archival documents to assist users in search. Build topic models, have experts review them, and give the topics labels. Then label your documents based on their topic distributions.

Bit further outside the box, split songs into segments based on their acoustic properties, and use topic modelling to identify different categories/types of music in large song databases. Then explore the popularity of such categories/types over time based on topic distributions over time. So the segments are your words, and the songs are your documents.

Finding image duplicates of images in large data sets. Use image features as words, and images as documents. Build topic models from all the images, and find similar types of images by their topic distributions. Features could be edges, or even abstract ones such as those learned by something like a convolutional neural nets. Assists in image search I guess..

Most of these uses seem to be various types of search assistance, with a few odd ones thinking outside the box. With a decent understanding, and some exploration, I think topic models can be useful in many places. The academics would sayd “dude XYZ would work just as well”. Sure, but if it does the job for me, and is simple and easy to apply..

Word2Vec with some Finnish NLP

To get a better view of the popular Word2Vec algorithm and its applications in different contexts, I ran experiments on Finnish language and Word2vec. Let’s see.

I used two datasets. First one is the traditional Wikipedia dump. I got the Wikipedia dump for the Finnish version from October 20th. Because I ran the first experiments around that time. The seconds dataset was the Board minutes for the City of Oulu for the past few years.

After running my clearning code on the Wikipedia dump it reported 600783 sentences and 6778245 words for the cleaned dump. Cleaning here refers to removing all the extra formatting, HTML tagging, etc. Sentences were tokenized using Voikko. For the Board minutes the similar metrics were 4582 documents, 358711 sentences, and 986523 words. Most interesting, yes?

For running Word2vec I used the Deeplearning4J implementation. You can find the example code I used on Github.

Again I have this question of whether to use lemmatization or not. Do I run the algorithm on baseformed words or just unprocessed words in different forms?

Some prefer to run it after lemmatization, while generally the articles on word2vec say nothing on the topic but rather seem to run it on raw text. This description of a similar algorithm actually shows and example of mapping “frog” to “frogs”, further indicating use of raw text. I guess if you have really lots of data and a language that does not have a huge number of forms for different words that makes more sense. Or if you find relations between forms of words more interesting.

For me, Finnish has so many forms of words (morphologies or whatever they should be called?) and generally I don’t expect to run with hundreds of billions of words of data, so I tried both ways (with and without lemmatization) to see. With my limited data and the properties of the Finnish language I would just go with lemmatization really, but it is always interesting to try and see.

Some results for my experiments:

Wikipedia without lemmatization, looking for the closest words to “auto”, which is Finnish for “car”. Top 10 results along with similarity score:

  • auto vs kuorma = 0.6297630071640015
  • auto vs akselin = 0.5929439067840576
  • auto vs auton = 0.5811734199523926
  • auto vs bussi = 0.5807990431785583
  • auto vs rekka = 0.578578531742096
  • auto vs linja = 0.5748337507247925
  • auto vs ty√∂ = 0.562477171421051
  • auto vs autonkuljettaja = 0.5613142848014832
  • auto vs rekkajono = 0.5595266222953796
  • auto vs moottorin = 0.5471497774124146

Words from above translated:

  • kuorma = load
  • akselin = axle’s
  • auton = car’s
  • bussi = bus
  • rekka = truck
  • linja = line
  • ty√∂ = work
  • autonkuljettaja = car driver
  • rekkajono = truck queue
  • moottorin = engine’s

A similarity score of 1 would mean a perfect match, and 0 a perfect mismatch. Word2vec builds a model representing position of words in “vector-space”. This is inferred from “word-embeddings”. This sounds fancy, and as usual, it is difficult to find a simple explanation of what is done. I view it a taking typically 100-300 numbers to represent each numbers relation in the “word-space”. These get adjusted by the algorithm as it goes through all the sentences and records each words relation to other words in those sentences. Probably all wrong in that explanation but until someone gives a better one..

To preprocess the documents for word2vec, I split the documents to sentences to give the words a more meaningful context (a sentence vs just any surrounding words). There are other similar techniques, such as Glove, that may work better with more global “context” than a sentence. But anyway this time I was playing with Word2vec, which I think is also interesting for many things. It also has lots of implementations and popularity.

Looking at the results above, there is the word “auton”, translating to “car’s”. Finnish language has a a large number of forms that different words can take. So, sometimes, it may be good to lemmatize to see what the meaning of the word better maps to vs matching forms of words. So I lemmatize with Voikko, the Finnish language lemmatizer again. Re-run of above, top-10:

  • auto vs ajoneuvo = 0.7123048901557922
  • auto vs juna = 0.6993820667266846
  • auto vs rekka = 0.6949941515922546
  • auto vs ajaa = 0.6905277967453003
  • auto vs matkustaja = 0.6886627674102783
  • auto vs tarkoitettu = 0.66249680519104
  • auto vs rakennettu = 0.6570218801498413
  • auto vs kuljetus = 0.6499230861663818
  • auto vs rakennus = 0.6315782070159912
  • auto vs alus = 0.6273047924041748

Meanings of the words in English:

  • ajoneuvo = vehicle
  • juna = train
  • rekka = truck
  • ajaa = drive
  • matkustaja = passenger
  • tarkoitettu = meant
  • rakennettu = built
  • kuljetus = transport
  • rakennus = building
  • alus = ship

So generally these mappings make some sense. Not sure about those building words. Some deeper exploration would probably help..

Some people also came up with the idea of POS tagging before running word2vec. Called it Sense2Vec and whatever. Just so you could better differentiate how different meanings of a word map differently. So to try to POS tag with the tagger I implemented before. Results:

  • auto_N vs juna_N = 0.7195479869842529
  • auto_N vs ajoneuvo_N = 0.6762610077857971
  • auto_N vs alus_N = 0.6689988970756531
  • auto_N vs kone_N = 0.6615594029426575
  • auto_N vs kuorma_N = 0.6477057933807373
  • auto_N vs tie_N = 0.6470917463302612
  • auto_N vs sein√§_N = 0.6453390717506409
  • auto_N vs kuljettaja_N = 0.6449363827705383
  • auto_N vs matka_N = 0.6337422728538513
  • auto_N vs p√§√§_N = 0.6313328146934509

Meanings of the words in English:

  • juna = train
  • ajoneuvo = vehicle
  • alus = ship
  • kone = machine
  • kuorma = load
  • tie = road
  • sein√§ = wall
  • kuljettaja = driver
  • matka = trip
  • p√§√§ = head

soo… The weirdest ones here are the wall and head parts. Perhaps again a deeper exploration would tell more. The rest seem to make some sense just by looking.

And to do the same for the City of Oulu Board minutes. Now looking for a specific word for the domain. The word being “serviisi”, which is the city office responsible for food production for different facilities and schools. This time lemmatization was applied for all results. Results:

  • serviisi vs tietotekniikka = 0.7979459762573242
  • serviisi vs ty√∂terveys = 0.7201094031333923
  • serviisi vs pelastusliikelaitos = 0.6803742051124573
  • serviisi vs kehitt√§misvisio = 0.678106427192688
  • serviisi vs liikel = 0.6737961769104004
  • serviisi vs j√§tehuolto = 0.6682301163673401
  • serviisi vs serviisin = 0.6641604900360107
  • serviisi vs konttori = 0.6479293704032898
  • serviisi vs efekto = 0.6455909013748169
  • serviisi vs atksla = 0.6436249017715454

because “serviisi” is a very domain specific word/name here, the general purpose Finnish lemmatization does not work for it. This is why “serviisin” is there again. To fix this, I added this and some other basic forms of the word to the list of custom spellings recognized by my lemmatizer tool. That is, using Voikko but if not found trying a lookup in a custom list. And if still not found, writing a list of all unrecognized words sorted by highest frequency first (to allow augmenting the custom list more effectively).

Results after change:

  • serviisi vs tietotekniikka = 0.8719592094421387
  • serviisi vs ty√∂terveys = 0.7782909870147705
  • serviisi vs johtokunta = 0.695137619972229
  • serviisi vs liikelaitos = 0.6921887397766113
  • serviisi vs 19.6.213 = 0.6853622794151306
  • serviisi vs tilakeskus = 0.673351526260376
  • serviisi vs j√§tehuolto = 0.6718368530273438
  • serviisi vs pelastusliikelaitos = 0.6589146852493286
  • serviisi vs oulu-koilismaan = 0.6495324969291687
  • serviisi vs bid=2300 = 0.6414187550544739

Or another run:

  • serviisi vs tietotekniikka = 0.864517867565155
  • serviisi vs ty√∂terveys = 0.7482070326805115
  • serviisi vs pelastusliikelaitos = 0.7050554156303406
  • serviisi vs liikelaitos = 0.6591876149177551
  • serviisi vs oulu-koillismaa = 0.6580390334129333
  • serviisi vs bid=2300 = 0.6545186638832092
  • serviisi vs bid=2379 = 0.6458192467689514
  • serviisi vs johtokunta = 0.6431671380996704
  • serviisi vs rakennusomaisuus = 0.6401894092559814
  • serviisi vs tilakeskus = 0.6375274062156677

So what are all these?

  • tietotekniikka = city office for ICT
  • ty√∂terveys = occupational health services
  • liikelaitos = company
  • johtokunta = board (of directors)
  • konttori = office
  • tilakeskus = space center
  • pelastusliikelaitos = emergency office
  • energia = energy
  • oulu-koilismaan = name of area surrounding the city
  • bid=2300 is an identier for one of the Serviisi board meeting minutes main pages.
  • 19.6.213 seems to be a typoed date and could at least be found in one of the documents listing decisions by different city boards.

So almost all of these words that “serviisi” is found to be closest to are other city offices/companies responsible for different aspects of the city. Such as ICT, energy, office space, emergency response, of occupation health. Makes sense.

OK, so much for the experimental runs. I should summarize something about this.

The wikipedia results seem to give slightly better results in terms of the words it suggests being valid words. For the city board minutes I should probably filter more based on presence of special characters and numbers. Maybe this is the case for larger datasets vs smaller ones, where the “garbage” more easily drowns in the larger sea of data. Don’t know.

The word2vec algorithm also has a set of parameters to tune, which probably would be worth more investigation to get more optimized results for these different types of datasets. I simply used the same settings for both the city minutes and Wikipedia. Yet due to size differences, likely it would be interesting to play at least with the size of the vector space. For example, bigger datasets might benefit more from having a bigger vector space, which should enable them to express richer relations between different words. For smaller sets, a smaller space might be better. Similarly, number of processing iterations, minimum word frequencies etc should be tried a bit more. For me the goal here was to get a general idea on how this works and how to use it with Finnish datasets. For this, these experiments are enough.

If you read up on any articles of Word2Vec you will likely also see the hype on the ability to do equations such as “king – man + woman” = “queen”. These are from training on large English corpuses. It simply says that the relation of the word “queen” to word “woman” in sentences is typically the same as the relation of the word “king” to “man”. But then this is often the only or one of very few examples ever. Looking at the city minutes example here, since “serviisi” seems to map closest to all the other offices/companies of the city, what do we get if we run the arithmatic on “serviisi-liikelaitos” (so liikelaitos would be the common concept of the office/company). I got things like “city traffic”, “reduce”, “children home”, “citizen specific”, “greenhouse gas”. Not really useful. So this seems most useful as a potential tool for exploration but cannot really say which part gives useful results when. But of course, it is nice to report on the interesting abstractions it finds, not on boring fails.

I think lemmatization in these cases I showed here makes sense. I have no interest in just knowing that a singular form of a word is related to a plural form of the same word. But I guess in some use cases that could be valid. Of course, for proper lemmatization you might also wish to first do POS tagging to be able to choose the correct baseforms from all the options presented. In this case I just took the first baseform from the list Voikko gives for each word.

Tokenization could also be of more interest. Finnish language has a lot of compound words, some of which are visible in the above examples. For example, “kuorma-auto”, and “linja-auto” for the wikipedia example. Or the different “liikelaitos” combinations for the city of Oulu version. Further n-grams (combinations of words) would be useful to investigate further. For example, “energia” in the city example could easily be related to the city power company called “Oulun Energia”. Many similar examples likely can be found all over any language and domain vocabulary.

Further custom spelling would also be useful. For example, “oulu-koilismaan” above could be spelled as “oulu-koillismaan”. And it could further be baseformed with other forms of itself as “oulu-koillismaa”. Collecting these from the unrecognized words should make this relatively easy, and filtering out the low-frequency occurrences of the words.

So perhaps the most interesting question, What is this good for?

Not synonym search. Somehow over time I got the idea word2vec could give you some kind of synonums and stuffs. Clearly it is not for that but rather to identify words over similar concepts and the like.

So generally I can see it could be useful for exploring related concepts in documents. Or generally exploring datasets and building concept maps, search definitions, etc. More as an input to the human export work rather than fully automated as the results vary quite a bit.

Some interesting applications I found while looking at this:

  • Word2vec in Google type search, as well as search in general.
  • Exploring associations between medical terms. Perhaps helpful identify new links you did not think of before? Likely would match other similar domains as well.
  • Mapping words in different languages together.
  • Spotify mapping similar songs together via treating songs as words and playlists as sentences.
  • Someone tried it on sentiment analysis. Not really sure how useful that was as I just skimmed the article but in general I can see how it could be useful to find different types of words related to sentiments. As before, not necessarily as automated input but rather as input to an expert to build more detailed models.
  • Using the similarity score weights as means to find different topics. Maybe you could combine this with topic modelling and the look for diversity of topics?
  • Product recommendations by using products as words and sequences of purchases as sentences. Not sure how big is the meaning of purchase order but interesting idea.
  • Bet recommendations by modelling bets made by users as bet targets being words and sequences of bets sentences, finding similarities with other bets to recommend.

So that was mostly that. Similar tools exist for many platforms, whatever gives you the kicks. For example, Voikko has some python module on github to use and Gensim is a nice tool for many NLP processing tasks, including Word2Vec on python.

Also lots of datasets, especially for the English language, to use as pretrained word2vec models. For example, Facebooks FastText, Stanfords Glove datasets, Google news corpus from here. Anyway, some simple internet searches should turn out many such to use, which I think is useful for general purpose results. For more detailed domain specific ones training is good as I did here for the city minutes..

Many tools can also take in word vector models built with some other tool. For example, deeplearning4j mentions import of Glove models and Gensim lists support for FastText, VarEmbed and WordRank. So once you have some good idea of what such models can do and how to use them, building combinations of these is probably not too hard.

Giving Go a Go by forwarding some TCP

Problem? Needed to forward some TCP connections to two different locations (one stream to two destinations). Had trying out Golang on my todolist for a while. So decided to give it a Go. Previously, I have implemented a similar TCP forwarding tool in Java. Installing the full JVM to run some simple TCP forwarding seemed a bit silly. So figured I could just try having a Go at it as well.

The code I wrote can be found on Github.
To summarize, this is what it does:

  1. Open a socket to receive the initial connections to forward.
  2. When a connection is received (call it source connection) that needs to be forwarded
    • open a socket to forwarding destination
    • start a go-routine that reads from the source socket and writes to the destination socket
    • start a go-routine that reads from the destination socket and writes to the source socket
    • both of these go-routines share the same functionality:
    1. read at max N bytes into buffer
    2. write the data from buffer to destination socket
    3. if mirroring for that direction is enabled, write it also to mirror socket
    4. if logging to file is enabled, write the data to file as well

Of course, there are a number of similar Go projects out there, such as 1, 2, 3, 4, 5, etc. Not quite what I was looking for, and most importantly not invented here :). Its good to try some Go anyway.

After looking at all that, maybe the right way would be to Go with the (package? function? object? oh dear, I am lost already) TeeReader. But I used regular old buffering anyway. Naughty, I am sure, but please Go tell me why (comments etc.).
I used Jetbrains Gogland, which is a nice IDE for Go. They didn’t even pay me to advertise it, my bad.

So what did it end up looking like? What did I think about it? Did I learn anything from all this? What should I remember the next time but will surely have forgotten so I could look up here? What could you all correct me about?

The configuration “object” of mine:

//Configuration for the forwarder. Since it is capitalized, should be accessible outside package.
type Configuration struct {
	srcPort int //source where incoming connections to forward are listened to

(WordPress claims to support Go syntax highlighting but for me it just breaks it completely so I set it to text for the snippets here)

Go does not seem to have classes or objects but uses a different more C-style structs to store data. Code is then put into a set of packages, with paths on disk defining which one you are actually referring to when importing. Surely this seems odd considering all the years of telling how great object-oriented stuffs is. But I can see how keeping things simple and setting clear conventions makes it much nicer and maybe even helps avoid people writing too many abstraction layers where not needed. And forced naming of capital start letters for visibility. Why not. Just takes some getting used to all this. Moving on.

For parsing command line arguments, Go comes with a reasonably nice looking “flag” package. But it is quite limited in not making it possible to create long and short versions of the parameter names. Also, customizing the help prints is a bit of a hassle. Maybe that is why there seem to be oh so many command line parsing libraries for Go? Like 1, 2, 3, etc.

In the end, I did not want anything hugely complicated, the external libs did not get me excited and all. So I just used the FlagSet from the Go’s stardard libs:

	flagSet := flag.NewFlagSet("goforward", flag.ExitOnError)

	//this defines an int flag "sp" with default value 0 (which is treated as "undefined")
	srcPortPtr := flagSet.Int("sp", 0,"Source port for incoming connections. Required.")
	if len(os.Args) == 1 {
		fmt.Println("Usage: "+os.Args[0]+" [options]")
		fmt.Println(" Options:")
		flagSet.PrintDefaults() //this nicely prints out the help descriptions for all the args
	Config.srcPort = *srcPortPtr //getting the flag data is this simple, which is nice

Go also comes with a pretty nice logging package. Surprisingly it is called “log”.

My amazingly complex setup for logging to file/console at the same time:

	if Config.logFile != "" {
		f, err := os.OpenFile(Config.logFile, os.O_RDWR | os.O_CREATE | os.O_APPEND, 0666)
		if err != nil {
			//the Fatalf function exits the program after printing the error
			log.Fatalf("Failed to open log file for writing: %v", err)
		if !Config.logToConsole {
			log.SetOutput(io.MultiWriter(os.Stdout, f))
		} else {
	} else {
		if Config.logToConsole {

I like the concurrency mechanism in Go. It is quite nice. But, again, requires some getting used to. Just call “go functionname” to start a thread to run that function separately. We can also call “defer statement” to have “statement” executed after the current function exits.

For example:

	listener, err := net.Listen("tcp", "localhost:"+strconv.Itoa(Config.srcPort))
	defer listener.Close()

Of course, this is also a bit confusing at the beginning. If I do:

func StartServer() {
	listener, err := net.Listen("tcp", "localhost:"+strconv.Itoa(Config.srcPort))
	defer listener.Close()

The StartServer function will exit immediately, and so the defer() function will be called and listener closed. From the language viewpoit, works as intended, of course, just got me first. Because it is not what I expected of my program :).

Or this:

func main() {
	go forwarder.StartServer()

What will happen when program execution starts from main()? It will start the goroutine (call StartServer in a thread). Or maybe not if it is not too fast. Because the program will exit right after the “go forwarder.StartServer()” call, and actually most likely StartServer() never runs. Because you need to block the main thread as goroutines seem to be more like daemon threads in Java, and will not keep the program running if main loop exits.

Or I can do this:

	for {
		mainConn, err := listener.Accept()
		defer mainConn.close()
		//start a new thread for this connection and wait for the next one
		go forward(mainConn)

which would likely lead to resource leaking as new connections would keep getting created but never closed. Since the for loop does not exit and thus defer is not called..

So then the question, how do you do thread pooling in Go? Seems like this. Actually quite nice and simple way to get it done. Just another part that needs a different thinking. You set up some Go-routines (as in threads), have them wait on channels, pull jobs from the channels when available, and the run them in the Go-routine(s), and wait for more on the channel. Possibly return values through a channel as well.

Channels are a nice concept. But they do make for some weird looking code at when starting to Go. As do many other things actually. I guess it is the Go approach to try to be “simple” and terse. Maybe it grows on you.

Some of my weirdest moments:

Allocate a byte array of size 1024

	buf := make([]byte, 1024)

For some reason the brackets are to the left. I sometime read somewhere that Golang reads from left to right. Maybe that is why? But would it be so bad to say “a byte array” instead of “array of bytes”? At least that would not break the minds of programmers who used most of the mainstream languages out there.

Why “make”? Is it for some historical reason from C or something? Apparently there is also a keyword called “new”, and sometime somewhere someone has thought about combining these ( Anyway, seems like some unnecessary mental overhead for me.

The assignment operators can be “:=” if you are declaring the variable while initializing. Otherwise it is “=”. Is this to help tell declaration from re-assignment? Or is there some other logic to it? Maybe then it makes sense. Otherwise seems like just some more special characters mixed up.

Declare a function with return value (example(

	func split(sum int) (x, y int) {

So here spit() takes an integer sum value as parameter and returns two integer values named X and Y. Again, what was wrong with the return value on the left? Same complaints as I had with the array declaration. No idea.

To create a string by concatenating a string and a number:

	"localhost" + ":" + strconv.Itoa(8080)

So you can do “localhost”+”:” for two strings. But not for numbers. What was wrong with “localhost:”+8080? Or even “localhost:”+str(8080)? It’s a small thing but seems like something that I would do often.

Documentation. I know if is fashionable to dish Java and all. But I like the approach of clearly stating in Javadocs what the parameters and return values are. Sometimes it gives way too much repetition and is just silly. But for the official libs and docs etc at least it is nice. Exerpt from the Go “io” package, the doc for WriteString (


func WriteString

func WriteString(w Writer, s string) (n int, err error)

WriteString writes the contents of the string s to w, which accepts a slice of bytes. If w implements a WriteString method, it is invoked directly. Otherwise, w.Write is called exactly once.


OK, so what “n”, what values might “err” take and under what circumstances, etc.? I had plenty of such experiences in building my little app.

Even if there are no classes etc., there is something called an “interface”. Haven’t quite figured it out, but wanted to hack the logging a bit and had to try to figure it out.

func debuglog(msg string, v... interface{}) {
	if loggingEnabled {
		log.Printf(msg, v...)

I guess that is some way to generally refer to whatever type is given. The “…” notation (oddle on the right…) just defines that there can be any number of arguments. And you need it both in parameter and in use. I should probably read up more on what the interface is and does, so I shall not complain too much about it.

Anyway, I could go on about the odd-ish syntax where you put lots of “_:=<-" characters around. But overall after giving Go a bit of a Go in with the TCP forwarder, I do think it is actually a quite nice language. Just takes a bit of getting used to. The concurrency related stuffs with the go-routines and channels, defers et al. are very nice.

There we Go.

Collecting java.util.logging to log4j2

Everybody wants to write a log. And in Java everybody wants to write their own logging framework or at least use of the many different ones. Then someone comes up with logging framework framework such as SLF4J.

OK but what was I about to say. As so many times, I had a piece of Java software writing a log file using Log4J2. I was using some libs/someone elses code that uses java.util.logging to write their log. I wanted to capture those logs and include them in my Log4J2 log file for debugging, error resolution or whatever.

This case was when trying to log errors from the InfluxDB Java driver. The driver uses java.util.logging for minimal external dependencies or something. I used Log4J2 in my app.

So the usual question of how do you merge java.util.logging code, that you do not control, with your own code using Log4J2 to produce a single unified log file?

Most Googling would tell me all about SLF4J etc. I did not want yet-another framework on top of existing frameworks, and yet some more (transitive) dependencies and all sorts of weird stuff. Because I am old and naughty and don’t like too many abstractions just because.

So the code to do this with zero external dependencies.

First a log Handler object for java.util.logging to write to Log4J2:

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

import java.util.logging.Handler;
import java.util.logging.Level;
import java.util.logging.LogRecord;

* @author Daddy Bigbelly.
public class JekkuHandler extends Handler {
//notice that this is the Log4J2 logger here, inside a java.util.logging Handler object
private static final Logger log = LogManager.getLogger();

  public void publish(LogRecord record) {
    Level level = record.getLevel();
    if (level.intValue() == Level.SEVERE.intValue()) {
      log.error(record.getMessage(), record.getThrown());
    } else if (level.intValue() >= Level.INFO.intValue()) {, record.getThrown());
    } else {
      log.debug(record.getMessage(), record.getThrown());

  public void flush() {}

  public void close() throws SecurityException {}

Next setting it up and using it, with the InfluxDB Java driver as an example:

import org.influxdb.InfluxDB;
import org.influxdb.InfluxDBFactory;
import org.influxdb.dto.BatchPoints;
import org.influxdb.dto.Point;
import org.influxdb.dto.Query;
import org.influxdb.impl.BatchProcessor;

import java.util.concurrent.TimeUnit;
import java.util.logging.ConsoleHandler;
import java.util.logging.FileHandler;
import java.util.logging.Formatter;
import java.util.logging.Handler;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.util.logging.SimpleFormatter;

* @author Daddy Bigbelly.

public class LogCaptureExample {
  public static void main(String[] args) throws Exception {
    //oh no the root password is there
    InfluxDB db = InfluxDBFactory.connect("http://myinfluxdbhost:8086", "root", "root");
    String dbName = "aTimeSeries";
    db.enableBatch(2000, 1, TimeUnit.SECONDS);

    //if you look at the influxdb driver code for batchprocessor, 
    //where we wanted to capture the log from, you see it using the classname to set up the logger. 
    //so we get the classname here and use it to hijack the writes for that logger (the one we want to capture)
    Logger logger = Logger.getLogger("org.influxdb.impl.BatchProcessor");
    Handler handler = new JekkuHandler();

    //this runs forever, but the batch mode can throw an error if the network drops.
    //so disconnect network to test this in middle of execution
    while (true) {
      Point point1 = Point.measurement("cpu")
        .time(System.currentTimeMillis(), TimeUnit.MILLISECONDS)
        .addField("idle", 90L)
        .addField("user", 9L)
        .addField("system", 1L)
      db.write(dbName, "autogen", point1);

You could probably quite easily configure a global java.util.logger that would capture all logging written with java.util.logging this way. I did not need it so its not here.

In a similar way, you should be able to capture java.util.logging to any other log framework just by changing where the custom Handler writes the logs to.

Well there you go. Was that as exciting for you as it was for me?

Building a (Finnish) Part of Speech Tagger

I wanted to try a part of speech tagger (POS) to see if it could help me with some of the natural language processing (NLP) problems I had. This was in Finnish, although other languages would be nice to have supported for the future. So off I went, (naively) hoping that there would be some nicely documented, black-box, open-source, free, packages available. Preferably, I was looking for one in Java as I wanted to try using it as part of some other Java code. But other (programming) languages might work as well if possible to use as a service or something. Summary: There are a bunch of cool libs out there, just need to learn POS tagging and some more NLP terms to train them first…

I remembered all the stuffs on ParseMcParseFace, Syntaxnet and all those hyped Google things. It even advertises achieving 95% accuracy on Finnish POS tagging . How cool would that be. And its all about deep learning, Tensorflow, Google Engineers and all the other greatest and coolest stuff out there, right? OK, so all I need to do is go to their Github site , run some 10 steps of installing various random sounding packages, mess up my OS configs with various Python versions, settings, and all the other stuff that makes Python so great (OK lets not get upset, its a great programming language for stuffs :)). Then I just need check out the Syntaxnet git repo, run a build script for an hour or so, set up all sorts of weird stuff, and forget about a clean/clear API. OK, I pass, after messing with it too long.

So. After trying that mess, I Googled, Googled, Duckducked, and some more for some alternatives better suited for me. OpenNLP seemed nice as it is an Apache project, which have generally worked fine for me. There are a number of different models for it at SourceForge . Some of them are even POS tagger models. Many nice languages there. But no Finnish. Now, there is an option to train your own model . Which seems to require some oddly formatted, pre-tagged text sets to train. I guess that just means POS tagging is generally seen as a supervised learning problem. Which is fine, it’s just that if you are not deep in the NLP/POS tagging community, these syntaxes do look a bit odd. And I just wanted a working POS tagger, not a problem of trying to figure out what all these weird syntaxes are, or a problem of going to set up a project on Mechanical Turk or whatever to get some tagged sentences in various languages.

What else? There is a nice looking POS tagger from Stanford NLP group. It also comes with out-of-the-box models for a few languages. Again, no Finnish there either but a few European ones. Promising. After downloading it, I managed to get it to POS tag some English sentences and even do lemmatization for me (finding the dictionary base form of the word, if I interpret that term correctly). Cool, certainly useful for any future parsing and and other NLP tasks for English. They also provide some instructions for training it for new languages.

This training again requires the same pre-annotated set of training data with POS tagging. Seeing some pattern here.. See, even I can figure it out sometime. So there is actually a post on the internets, where someone describes building a Swedish POS tagger using the Stanford tagger. And another one instructing people (in comments) to downloaded the tagger code and read it to understand how to configure it. OK, not going to do that. I just wanted a POS tagger, not an excursion into some large code base to figure out some random looking parameters that require a degree in NLP to understand them. But hey, Sweden is right next to Finland, maybe I can try the configuration used for it to train my own Finnish POS tagger? What a leap of logic I have there..

I downloaded the Swedish .props file for the Stanford tagger, and now just needed the data. Which, BTW, I needed also for all the others, so I might as well have gone with the OpenNLP as well and tried that, but who would remember that anymore at this point.. The Swedish tagger post mentioned using some form of Swedish TreeBank data. So is there a similar form of Finnish TreeBank? I remember hearing that term. Sure there is. So downloaded that. Unpack the 600MB zip to get a 3.8GB text file for training. The ftb3.1.conllx file. Too large to open in most text editors. More/less to the rescue.

But hey, this is sort of like big data, which this should be all about, right? Maybe the Swedish .props file just works with it, after all, both are Treebanks (whatever that means)? The Swedish Treebank site mentions having a specific version for the Stanford parser built by some Swedish treebank visitor at Googleplex. Not so for Finnish.

Just try it. Of course the Swedish .props file wont work with the Finnish TreeBank data. So I build a Python script to parse it and format it more like the Swedish version. Words one per line, sentences separated with linefeeds. The tags seem to differ across various files around but I have no idea about how to map them over so I just leave them and hope the Stanford people have it covered. (Looking at it later, I believe they all treat it as a supervised learning problem with whatever target tags you give.)

Tried the transformed file with the Stanford POS tagger. My Python script tells me the file has about 4.4 million sentences, with about 76¬†million words or something like that. I give the tagger JVM 32GB memory and see if it can handle it. No. Out of memory error. Oh dear. It’s all I had. After a few minor modifications in the .props file, and I make the training data set¬†smaller until finally at 1M sentences the tagger finishes training.

Meaning the program runs through and prints nothing ¬†(no errors but nothing else either). There is a model file generated I can use for tagging. But I have no idea if this is any good or not, or how badly did I just train it. Most of the training parameters have a one-line description in the Javadoc, which isn’t hugely helpful ¬†(for me). Somehow I am not too confident I managed to do it too well. Later as I did various splits on the FinnTreeBank data for my customized Java tagger and the OpenNLP tagger, I also tried this one with the 1.4M sentence test set. Got about 82% accuracy, which seems pretty poor considering everything else I talk about in the following. So I am guessing my configuration must have been really off since otherwise people have reported very good results with it. Oh well, maybe someone can throw me a better config file?

This is what running the Stanford tagger on the 1M sentence set looked like on my resource graphs:


So it mostly runs on a single core and uses about 20GB of RAM for the 1M sentence file. But obviously I did not get it to give me good results, so what other options do I have?

During my Googling and stuff I also ran into a post¬†describing writing a custom POS tagger in 200 lines of Python. Sounds great, even I should be able to get 200 lines of Python, right? I translated that to Java to try it out on my data. Maybe I will call my port “LittlePOS”. Make of that what you will :). At least now I can finally figure out what the input to it should be and how to provide it, since I wrote (or translated) the code, eh?

Just to quickly recap what (I think) this does.

  • Normalize all words = lowercase words, change year numbers to “!YEAR” and other numbers to “!DIGIT”.
  • Collect statistics for each word, how often different POS tags appear for each word. A threshold of 97% is used to mark a word as “unambiguous”, meaning it can always be tagged with a specific tag if it has that tag 97% or more times in the training data. The word also needs to occur some minimum number of times (here it was 20).
  • Build a set of features for each POS tag. These are used for the “machine learning” part to learn to identify the POS tag for a word. In this case the features used were:
    • Suffix of word being tagged. So its last 3 letters in this case.
    • Prefix of word being tagged. Its first letter in this case.
    • Previous tag. The tag assigned to previous word in sentence.
    • 2nd previous tag. The tag assigned to the previous word to the previous word :).
    • Combination of the previous and previous-previous tags. So previous tag-pair.
    • The word being tagged itself.
    • Previous tag and current-word pair.
    • Previous word in sentence.
    • Suffix of previous word, its 3 last letters.
    • Previous-previous word. So back two spots in the sentence where we are tagging.
    • Next word in sentence.
    • Suffix of next word. Its 3 last letters.
    • Next-next word in sentence. So the next word after the next word. To account for the start and end of a sentence, the sentence word array is always initialized with START1, START2 and END1, END2 “synthetic words”. So these features also work even if there is no real previous or next word in the sentence. Also, word can be anything, including punctuation marks.
  • Each of the features is given a weight. This is used to calculate prediction of what POS tag a word should get based on its features in the sentence.
  • If, in training, a word is given (predicted) a wrong tag based on its features, the weights of those features for the wrong tag are reduced by 1 each, and the weights for those features for the correct tag are increase by 1 each.
  • If the tag was correctly predicted, the weights stay the same.

Getting this basic idea also helps me understand the other parsers and their parameters a bit better. I think this is what is defined by the “arch” parameter in the Stanford tagger props file, and would maybe need a better fix? I believe this setting of parameters must be one of the parts of POS tagging with the most diverse sets of possibilities as well.. Back to the Stanford tagger. It also seemed a bit slow at 50ms average tagging time per sentence, compared to the other ones I discuss in the following. Not sure what I did wrong there. But back to my Python to Java porting.

I updated my Python parser for the FinnTreeBank to produce just a file with the word and POS tag extracted and fed that LittlePOS. This still ran out of memory on the 4.4M sentences with 32GB JVM heap. But not in the training phase, only when I finally tried to save the model as a Protocol Buffers binary file. The model in memory seems to get pretty big, so I guess the protobuf generator also ran out of resources when trying to build 600MB file with all the memory allocated for the tagger training data.

In the resources graph this is what it looks like for the full 4.4M sentences:


The part on the right where the “system load” is higher and the “CPU” part looks to bounce wildly is where the protobuf is being generated. The part on the left before that is the part where the actual POS tagger training takes place. So the protobuf generation actually was running pretty long, my guess is the JVM memory was low and way too much garbage collection etc. is happening. Maybe it would have finished after few more hours but I called it a no-go and stopped it.

3M sentences finishes training fine. I use the remaining 1.4M for testing the accuracy. Meaning I use the trained tagger to predict tags for those 1.4M sentences and count how many words it tagged right in all of those. This gives me about 96.1% accuracy on using the trained tagger. Aawesome, now I have a working tagger??

The resulting model for the 3M sentence training set, when saved as a protobuf binary, is about 600MB. Seems rather large. Probably why it was failing to write it with the full 4.4M sentences. A smaller size model might be useful to make it more usable in a smaller cloud VM or something (I am poor, and cloud is expensive for bigger resources..). So I tried to train it on sentences of size 100k to 1M on 100k increments. And on 1M and 2M sentences. Results for LittlePOS are shown in the table below:

Sentences Words correct Accuracy PB Size Time/1
100k 21988662 88.7% 90MB 4.5ms
200k 22490881 90.7% 153MB 4.1ms
300k 22608641 91.2% 195MB 3.9ms
400k 22779163 91.9% 233MB 3.8ms
500k 22911452 92.4% 268MB 3.7ms
600k 23033403 92.9% 304MB 3.5ms
700k 23095784 93.1% 337MB 3.7ms
800k 23149286 93.4% 366MB 3.5ms
900k 23169125 93.4% 390MB 3.2ms
1M 23167721 93.4% 378MB 3.3ms
2M 23520297 94.8% 651MB 3.0ms
3M 23843609 96.2% 890MB 2.0ms
1M_2 23105112 93.2% 467MB ms
3M_0a 20859104 84.1% 651MB 1.7ms
3M_0b 22493702 90.7% 651MB 1.7ms


  • Sentences is the number of sentences in the dataset.
  • Correct is the number of words correctly predicted. The total number of words is always 24798043 as all tests were run against the last 1.4M sentences (ones left over after taking the 3M training set).
  • Accuracy is the percentage of all predictions that it got right.
  • PB Size is the size of the model as a Protocol Buffers binary after saving to disk.
  • Time/1 is the time the tagger took on average to tag a sentence.

The line with 1M_2 shows an updated case, where I changed the training algorithm to run for 50 iterations instead of the 10 it had been set to in the Python script. Why 50? Because the Stanford and OpenNLP seem to use a default of 100 iterations and I wanted to see what difference it makes to increase the iteration count. Why not 100? Because I started it with training the 3M model for 100 iterations and looking at it, I calculated it would take a few days to run. The others were much faster so plenty of room for optimization there. I just ran it for 1M sentences and 50 iterations then, as that gives an indication of improvement just as well.

So, the improvement seems pretty much zero. In fact, the accuracy seems to have gone slightly down. Oh well. I am sure I did something wrong again. It is possible also to take the number of correctly predicted tags from the added iterations during training. The figure below illustrates this:


This figure shows how much of the training set the tagger got right during the training iterations. So maybe the improvement in later iterations is not that big due to the scale but it is still improving. Unfortunately, in this case, this did not seem to have a positive impact on the test set. There are also a few other points of interest in the table.

Back to the results table. The line with 3M_0a shows a case where all the features were ignored. That is, only the “unambiguous” ones were tagged there. This already gives the result of 84.1%. The most frequent tag in the remaining untagged ones is “noun”. So tagging all the remaining 15.9% as nouns gives the score in 3M_0b. In other words, if you take all the words that seem to clearly only have one tag given for them, given them that tag, and tag all the remaining ones as nouns, you get about 90.7% accuracy. I guess that would be the reference to compare against.. This score is without any fancy machine learning stuffs. Looking at this, the low score I got for training the Stanford POS tagger was really bad and I really need that for dummies guide to properly configure it.

But wait, now that I have some tagged input data and Python scripts to transform it into different formats, I could maybe just modify these scripts to give me OpenNLP compliant input data? Brilliant, lets try that. At least OpenNLP has default parameters and seems more suited for dummies like me. So on to transform my FinnTreeBank data to OpenNLP input format and run my experiments. Python script. Results below.

Sentences Words correct Accuracy PB Size Time/1
100k 22247182 89.7% 4.5MB 7.5ms
200k 22680369 91.5% 7.8MB 7.6ms
300k 22861728 92.2% 10.4MB 7.7ms
400k 22994242 92.7% 12.8MB 7.8ms
500k 23114140 93.2% 14.8MB 7.8ms
600k 23199457 93.6% 17.1MB 7.9ms
700k 23235264 93.7% 19.2MB 7.9ms
800k 23298257 94.0% 21.1MB 7.9ms
900k 23324804 94.1% 22.8MB 7.9ms
1M 23398837 94.4% 24.5MB 8.0ms
2M 23764711 95.8% 39.9MB 8.0ms
3M 24337552 98.1% 55.9MB 8.1ms
(4M) 24528432 98.9% 69MB 9.6ms
4M_2 6959169 98.5% 69MB 9.7ms
(4.4M) 24567908 99.1% 73.5MB 9.6ms

There are some special cases here:

  • (4M): This mixed training and test data in training with the first 4M of the 4.4M sentences, and then taking the last 1.4M of the 4.4M for testing. I believe in machine learning you are not supposed to test with the training data or the results will seem too good and not indicate any real world performance. Had to do it anyway, didn’t I ūüôā
  • (4.4): This one used the full 4.4M sentences to train and then tested on the subset 1.4M of the same set. So its a broken test again by mixing training data and test data.
  • 4M_2: For the evaluation, this one used the remaining number of sentences after taking out the 4M training sentences. So since the total is about 4.4M, which is actually more like 4.36M, the test set here was only about 360k sentences as opposed to the other where it was 1.4M or 1.36M to be more accurate. But it is not mixing training and test data any more. Which is probably why it is slightly lower. But still an improvement so might as well train on the whole set at the end. The number of test tags here is 7066894 as opposed to the 24798043 in the 1.4M sentence test set.

And the resource use for training at 4M file size:


So my 32GB of RAM is plenty, and as usual it is a single core implementation..

Next I should maybe look at putting this up as some service to call over the network. Some of these taggers actually already have support for it but anyway..

A few more points I collected on the way:

For the bigger datasets it is obviously easy to run out of memory. Looking at the code for the custom tagger trainer and the full 4.4M sentence training data, I figure I could scale this pretty high in terms of sentences processed by just storing the sentences into a¬†document database and not in memory all at once. ElasticSearch would probably do just fine as I’ve been using it for other stuff as well. Then read the sentences from the database into memory as needed. The main reason the algorithm seems to need to keep the sentences¬†in memory is to shuffle them randomly around for new training iterations. I could just shuffle the index¬†numbers for sentences stored in the DB and read some smaller batches for training into memory. But I guess I am fine with my tagger¬†for now. Similarly, the algorithm uses just a single core in training for now, but could be parallelized to process each sentence separately quite easily, making it “trivially parallel”. Would need to test the impact on accuracy though. Memory use could probably go lower using various optimizations, such as hashing the keys. Probably for both CPU and memory plenty of optimizations are possibly, but maybe I will just use OpenNLP and let someone else worry about it :).

From¬†the results of the different runs, there seems to be some consistency in LittlePOS running faster on bigger datasets, and the OpenNLP slightly slower. The Stanford tagger seems to be quite a bit slower at 50ms, but could be again due to configuration or some other issues. OpenNLP seems to get a better accuracy than my LittlePOS, and the model files are smaller. So the tradeoff in this case would be model size vs tagging speed. The tagging speed being faster with bigger datasets seems a bit odd but maybe more of the words become “unambigous” and thus can be handled with a simple lookup on a map?

Finally, in the hopes of trying the stuff out on a completely different dataset, I tried to download the Finnish datasets for Universal Dependencies and test against those. I got this idea as the Syntaxnet stats showed using these as the test and training sets. Figured maybe it would give better results across sets taken from different sources. Unfortunately Universal Dependencies had different tag sets from the FinnTreeBank I used for training, and I ran out of motivation trying to map them together. Oh well, I just needed a POS tagger and I believe I now know enough on the topic and have a good enough starting point to look at the next steps..

But enough about that. Next, I think I will look at some more items in my NLP pipeline. Get back to that later…