Porting Naivecoin tutorial to Golang (first parts)

Recently I got interested in blockchains, and how do they work in practice. Found a nice tutorial called Naivecoin: a tutorial for building a cryptocurrency. There is also a nice book on the topic called Mastering Bitcoin: Programming the Open Blockchain. Had to get that too, didn’t I.

So to get a better understanding of the topic, I tried implementing the tutorial in Go. Using a different programming environment requires me to look up the actual implementation and try to understand it, as opposed to just copy-pasting all the code. I tried the parts until part 3, and set my code up on Github, hopefully I will update it with the remaining parts of the Naivecoin tutorial, and other related topics, sometime later.

The Naivecoin first part focuses on building the basic blockchain. To start, a block structure is needed to describe the blockchain:

type Block struct {
	Index        int  //the block index in the chain
	Hash         string //hash for this block
	PreviousHash string //hash for previous block
	Timestamp    time.Time //time when this block was created
	Data         string //the data in this block. could be anything. not really needed since real data is transaction but for fun..
	Transactions []Transaction  //the transactions in this block
	Difficulty	 int //block difficulty when created
	Nonce		 int //nonce used to find the hash for this block
}

A blockchain is a long list (or a chain) of blocks. The Block structure above has various fields to describe its position in the chain, its contents, and metadata used to provide assurance over the chain validity.

  • Index: The number of the block on the blockchain. So 1st block is 1, 10th block is 10.
  • Hash: Cryptographic hash of all the fields in the block together. Verifies the block has not been modified since creation (e.g., change transaction address).
  • PreviousHash: Should match the Hash value of the previous block in the blockchain. Useful to verify chain integrity I am sure.
  • Timestamp: Time when this block was created. Used for hash variation, and to keep the creation of blocks within a defined time limit (to make some attacks harder).
  • Data: The blockchain is supposed to store data in an “immutable ledger”. I guess the typical blockchain application is still cryptocurrencies, where the data is the Transactions part. I put this extra data field in just to play with other data if I feel like it.
  • <Transactions: List of transactions included in this block. Moving coins from one person to the other.
  • Difficulty: The mining difficulty when this block was created. Useful to check the hash validity etc.
  • Nonce: A value to vary in trying to get a hash that matches the set difficulty. In case of bigger difficulty, may be required. But for this tutorial implementation I guess this is fine.

The blockchain is composed of these blocks, each referencing the previous one, in a chain. Thus the blockchain, woohoo. So, first how to calculate the hash for a block:

//calculate hash string for the given block
func hash(block *Block) string {
	indexStr := strconv.Itoa(block.Index)
	timeStr := strconv.FormatUint(uint64(block.Timestamp.Unix()), 16) //base 16 output
	nonceStr := strconv.Itoa(block.Nonce)
	diffStr := strconv.Itoa(block.Difficulty)
	txBytes, _ := json.Marshal(block.Transactions)
	txStr := string(txBytes)
	//this joins all the block elements to one long string with all elements appended after another, to produce the hash
	blockStr := strings.Join([]string{indexStr, block.PreviousHash, timeStr, diffStr, block.Data, txStr, nonceStr}, " ")
	bytes := []byte(blockStr)
	hash := sha256.Sum256(bytes)
	s := hex.EncodeToString(hash[:]) //encode the Hash as a hex-string. the [:] is slicing to match datatypes in args
	return s
}

I think Bitcoin uses something like a Merkle tree to combine elements in a block. But in the above, all the elements in a block are simply turned into strings, concatenated into a single long string, and this string is hashed. So again, we could maybe do better but it works well for a tutorial such as Naivecoin.

Now, to create the blocks. As visible in the structure above, each block has to reference the previous ones to create the chain. And to provide assurance over the overall chain, hashes of the block and of the previous block are provided. But the chain has to start somewhere. The first block in the chain is called the genesis block. It has no previous hash, as it has no previous block. So we create it specifically first:

//create genesis block, the first one on the chain to bootstrap the chain
func createGenesisBlock(addToChain bool) Block {
	genesisTime, _ := time.Parse("Jan 2 15:04 2006", "Mar 15 19:00 2018")
	block := Block{1, "", "0", genesisTime, "Teemu oli täällä", nil,1, 1}
	hash := hash(&block)
	block.Hash = hash
	if addToChain {
		globalChain = append(globalChain, block)
	}
	return block
}

More generally, to create non-genesis blocks:

//create a block from the given parameters, and find a nonce to produce a hash matching the difficulty
//finally, append new block to current chain
func createBlock(txs []Transaction, blockData string, difficulty int) Block {
	chainLength := len(globalChain)
	previous := globalChain[chainLength-1]
	index := previous.Index + 1
	timestamp := time.Now().UTC()
	nonce := 0
	newBlock := Block{index, "", previous.Hash, timestamp, blockData, txs, difficulty, nonce}
	for {
		hash := hash(&newBlock)
		newBlock.Hash = hash
		if verifyHashVsDifficulty(hash, difficulty) {
			addBlock(newBlock)
			return newBlock
		}
		nonce++
		newBlock.Nonce = nonce
	}
}

The above code takes the block transactions, data and the current difficulty of the chain as parameters. The rest of the information required to create a block is taken from the previous block (block index, previous block hash) or system clock (current timestamp). Finally, it loops the different nonce values until it finds a hash matching the current difficulty level. Like I mentioned before, it might be a bit simplified, but this is certainly good enough for a tutorial such as the Naivecoin.

So, in this case, to verify the block difficulty:

func verifyHashVsDifficulty(hash string, difficulty int) bool {
	prefix := strings.Repeat("0", difficulty)
	return strings.HasPrefix(hash, prefix)
}

This is quite simple, it just measures that the given hash-string starts with a given number of zeroes. In this case the measure of difficulty is the number of zeroes the hash should start with. In a more general case, I believe the difficulty can be a number under which the hash should fall (2nd top answer at the time of writing this). That gives you much more granularity to define the difficulty.. But, again, no need to go overly complex on things in a Naivecoin tutorial.

Similarly, adding a block to the blockchain is quite simple:

//add a new block to the existing chain
func addBlock(block Block) {
	chainLength := len(globalChain)
	previousBlock := globalChain[chainLength-1]
	block.PreviousHash = previousBlock.Hash
	globalChain = append(globalChain, block)
	for _, tx := range block.Transactions {
		addTransaction(tx)
	}
	//todo: check block hash matches difficulty
}

So as I commented above, I should also check that the new block matches the required difficulty. Would not be too hard but I will leave that update for the next version.

Adding the block also requires adding all the transactions within that block to the list of active transactions:

func addTransaction(tx Transaction) {
	oldTx := findUnspentTransaction(tx.Sender, tx.Id)
	if oldTx >= 0 {
		print("transaction already exists, not adding: ", tx.Id)
		return
	}
	allTransactions = append(allTransactions, tx)
	for _, txIn := range tx.TxIns {
		deleteUnspentTransaction(tx.Sender, txIn.TxId)
	}
	for idx, txOut := range tx.TxOuts {
		utx := UnspentTxOut{tx.Id, idx, txOut.Address, txOut.Amount}
		unspentTxOuts = append(unspentTxOuts, utx)
	}
}

So, as the Naivecoin tutorial so nicely explains, each transaction has two types of “sub-transactions” as I call them here. Not an official term or anything. But these are:

  • TxIn: Transaction inputs.
  • TxIn: Transaction outputs.

I found the explanations related to how these work very confusing to start with. Eventually I got it, but it is not the most intuitive thing. Especially with the terminology of Transaction (Tx), TxIn, TxOut, … Especially since the TxIn and TxOut are mostly the same thing, but expressed in a different way. Well, that is my undestanding anyway. Please do correct me.

A TxOut (would it be correct to call it a transaction output?) is what creates “coins” to spend. Sending coins to someone creates a TxOut. Since a single transaction can contain multiple TxOut instances, it just means you can send coins to multiple people in a single transaction. Or multiple wallets that is. The most common seems to be to send coins to someone else and back to yourself. Why is this?

To create TxOut’s you need a matching amount of TxIn’s. Or the amount of coins referenced in each should match. This check should actually be a part of the addTransaction function above. To check that the transaction is valid by checking that the input and output amounts add up to the same sums. But you can only add existing TxOut instances to a transaction as TxIn’s. So if you got a TxOut giving you 100 coins, you can only add that TxOut to your next transaction as a TxIn if you want to send someone coins. So the TxIn has to be 100 coins in this case. What if you want to send only 50 coins? Then you put the single TxIn with 100 coins, but create 2 TxOut’s for that transaction. One to send 50 coins to he received, and another to send the remaining 50 coins to yourself. This way the balances match again. Confusing? Yes. Check pretty pictures some way down this post.

Of course, if you send coins to an address that is not used, you will “burn” those coins. No-one can ever access them, because no-one knows the private key to match the public key used to sign that TxOut. You could maybe make a lot of money if you could find a key to some of the bigger coin-burn addresses. But I digress.

How do the coins initially come into existence if you can only send coins by referencing an existing TxOut? Where do the first TxOut come from? It has to all be bootstrapped somehow, right? This is what is called a Coinbase transaction. No, not according to the cryptocurrency selling company. They took the name from the term, not the other way around. A coinbase TxIn might look something like this:

var COINBASE_AMOUNT = 1000

func createCoinbaseTx(address string) Transaction {
	var cbTx Transaction

	var txIn TxIn
	txIn.TxIdx = len(globalChain)
	cbTx.TxIns = append(cbTx.TxIns, txIn)

	var txOut TxOut
	txOut.Amount = COINBASE_AMOUNT
	txOut.Address = address
	cbTx.TxOuts = append(cbTx.TxOuts, txOut)

	cbTx.Id = calculateTxId(cbTx)
	cbTx.Signature = "coinbase"

	return cbTx
}

The above code creates the special transaction called the coinbase transaction. In proof-of-work type solutions, such as in the Naivecoin tutorial, this is added to each mined block. So when a miner finds a hash for a new block (thus “creating” it), they get the reward for the block. This reward is paid as the coinbase transaction. Each block can have one, and the miner can create it in the block. I would expect the miner then puts their own address in the transaction as the receiver. For the TxOut address. The TxIn in this case comes from nowhere, from thin air, or from a made up input. This is how the money is made here..

To my understanding, the way coinbase transactions are verified is simply by each node in the network accepting only a single coinbase transaction in a block, with specific parameters. These should match the number of coins to be rewarded according to the coin specification, and the defined coinbase signature. I believe in Bitcoin you can set many of the other fields as you like. For example, people can hide messages in the TxIn address or some other fields of the coinbase transaction. Because they do not need to match anything real. Once the coinbase amount is issued to a miner, they can then send coins using the coinbase TxOut as part of their new transaction.

The related data structures:

type TxOut struct {
	Address string	//receiving public key
	Amount int		//amount of coin units to send/receive
}

type TxIn struct {
	TxId      string	//id of the transaction inside which this TxIn should be found (in the list of TxOut)
	TxIdx     int		//index of TxOut this refers to inside the transaction
}

type UnspentTxOut struct {
	TxId    string	//transaction id
	TxIdx   int		//index of txout in transaction
	Address string  //public key of owner
	Amount  int		//amount coin units that was sent/received
}

//transaction is from a single person/entity
//inputs from that entity use funds, outputs show how much and where
//all the time there should be some list kept and updated based on this
type Transaction struct {
	Id string
	Sender string //the address/public key of the sender
	Signature string	//signature including all txin and txout. in this case we sign Transaction.Id since that already has all the TxIn and TxOut
	TxIns []TxIn
	TxOuts []TxOut
}

Each block contains a set of transactions, which contain a set of TxIn and TxOut.

  • The TxOut defines the address (encoding of recipients public key) of the recipient, and the amount of coins to send.
  • The TxIn defines a reference to a previous TxOut. So an identifier to find the transaction which contains it, and the index of the TxOut within that transaction.
  • The transaction itself contains the address (public key) of the sender. Since each TxIn and TxOut is contained in the transaction, and references that transaction, the owner can be checked against the address of the sender in the transaction. So the recipient in the “spent” TxOut should match the sender of this new transaction.
  • A prespecified way of calculating the transaction id is defined in the coin specification. This is then used to create and encode a hash value for the transaction id. I guess most real coins would use some form of Merkle-trees as linked high above. In this case the naivecoin simply adds the information on the sender and contained TxIn and TxOut instances into a long string and hashes that.
  • The signature is used to sign the transaction id (already containing all the transaction data in the hash) with the sender private key, using ECDSA signatures, as described in my previous post. The authenticity of the transaction and all its contents can then be verified by using the public key of the sender as embedded in the sender address to verify the signature. Brilliant stuff.

In pictures this looks something like this:

tx2

In the above example (in the figure above), Bob is sending 550 coins to Alice. To do this, Bob needs TxIn’s that sum up to minimum of 550. He has a set that sums up to 600. So these are used as the TxIn’s in this transaction. As 600 is more than 550, there are two TxOut’s created. One assigns the 550 coins to Alice, and the other 50 coins back to Bob. The blockchain tracks coin amounts using the TxOut instances stored within transactions, so it can only “spend” a full TxOut. This is why the “change” (50 coins in this example) generates a completely new TxOut.

The wording in the figure above for “deleteing” inputs simply refers to the TxIn’s being marked as used, and not being able to use them again. So the previous TxOut they refer to (from previous transactions that gave coins to Bob) are marked as used. As the wallet application and blockchain nodes track which TxOut are unused (not referenced by any accepted TxIn), it can count the balance of a user as a sum of their unspent TxOut.

The wording for “creating” new TxOut in the figure above refers to adding new TxOut to the list of unspent TxOut. In this case it adds one for Bob (50 coins) and one for Alice (550 coins).

In the code showing the TxOut, TxIn, and UnspentTxOut higher above the structure of each of these is shown. TxId refers each TxOut and TxIn to a specific transaction where the associated TxOut was created. So also TxIn refers to TxOut but serves as a way to mark that TxOut as “spent”. The reference consists of the transaction id (each transaction in the blockchain having a unique id), and the TxIdx refers to the index of the TxOut within a transaction. Since a transaction can have multiple TxOut, this is just the index in the list of TxOut within the transaction to identify which one the TxOut exactly is.

Each TxOut is assigned to a specific address, and the address has the recipient public key. Each transaction is signed using the senders private key, and this signature can be verified using the signers public key. Thus each TxIn can be verified to be spendable by the person who signed the transaction containing the TxIn. Following the TxIn reference to the TxOut it is referring to, and looking up the address it was assigned to gives the public key. By checking that this public key matches the private key used to sign the new transaction wanting the spend that TxIn, it can be verified that the person creating the transaction is in possession of the private key associated to that public key.

Proper transaction verification should check that all TxIn match the transaction signature, there are sufficient number of the TxIn, and that the TxIn total sum matches the total sum of TxOut. Probably much more too, but that is a good start.

Much confusing all this is. Hope you, dear reader, if you exist, find more information to understand my ramblings and to correct me.

After the transaction in the figure above, the end result looks something like this:

tx3

With all this in place, we can write the code to send coins:

func sendCoins(privKey *ecdsa.PrivateKey, to string, count int) Transaction {
	from := encodePublicKey(privKey.PublicKey)
	//TODO: error handling
	txIns, total := findTxInsFor(from, count)
	txOuts := splitTxIns(from, to, count, total)
	tx := createTx(privKey, txIns, txOuts)
	return tx
}

func findTxInsFor(address string, amount int) ([]TxIn, int) {
	balance := 0
	var unspents []TxIn
	for _, val := range unspentTxOuts {
		if val.Address == address {
			balance += val.Amount
			txIn := TxIn{val.TxId, val.TxIdx}
			unspents = append(unspents, txIn)
		}
		if balance >= amount {
			return unspents, balance
		}
	}
	return nil, -1
}

func splitTxIns(from string, to string, toSend int, total int) []TxOut {
	diff := total - toSend
	txOut := TxOut{to, toSend}
	var txOuts []TxOut
	txOuts = append(txOuts, txOut)
	if diff == 0 {
		return txOuts
	}
	txOut2 := TxOut{from, diff}
	txOuts = append(txOuts, txOut2)
	return txOuts
}

func createTx(privKey *ecdsa.PrivateKey, txIns []TxIn, txOuts []TxOut) Transaction {
	pubKey := encodePublicKey(privKey.PublicKey)
	tx := Transaction{"",  pubKey,"", txIns, txOuts}

	signTxIns(tx, privKey)

	tx.Id = calculateTxId(tx)
	tx.Signature = signData(privKey, []byte(tx.Id))
	return tx
}

The full code for my Go port of the naivecoin tutorial (first parts) is on my Github.

There are various other concepts I would still like to look into. Ways to be able to effectively search the whole blockchain and all transactions in it effectively (to validate transactions, build block explorers, etc.) would be interesting. I understand some type of embedded databases may be used. And endlessly argued about. Ah, just when you though the programming language flame-wars of past are no longer, you find it all again. Much stability in argument, such peace it brings. In my Rock’n chair I kek at all things past and present. Sorry, just trying to learn to talk like the internet and crypto people do. On a lighter note, other topics of interest to look into include robust peer-to-peer connections, ring signatures, proof of stake, etc.

That about sums it up for my experiment for now. Do let me know what are all the things I got wrong.

Advertisements

Trying to learn ECDSA and GoLang

Recently I have been looking at the Naivecoin tutorial, and trying to implement it in Go to get an idea of how blockchains really work, and to learn some more Go as well. The tutorial code is in Javascript, and translating it to Go has been mostly straightforward. However, porting the third part with transactions was giving me some issues. I had trouble figuring how to port the signature part. This is tutorial the code in Javascript:

    const key = ec.keyFromPrivate(privateKey, 'hex');
    const signature: string = toHexString(key.sign(dataToSign).toDER());

This is nice and simple, just suitable for a high-level framework and tutorial. However, to implement it myself in Go, …

The JS code above takes the private key as a hex formatted string, and parses that into a Javascript PrivateKey object. This key object is then used to sign the “dataToSign”, the signature is formatted as something called “DER”, and the result is formatted as a hex string. What does all that mean?

The tutorial refers to Elliptic Curve Digital Signature Algorithm (ECDSA). The data to sign in this case is the SHA256 hash of the transaction ID. So how to do this in Go? Go has an ecdsa package with private keys, public keys, and functions to sign and verify data. Sounds good. But the documentation is quite sparse, so how do I know how to properly use it?

To try it, I decided to first write a program in Java using ECDSA signatures, and use it to compare to the results of the Go version I would write. This way I would have another point of reference to compare my results to, and to understand if I did something wrong. I seemed to find more information about the Java implementation, and since I am more familiar with Java in general..

So first to generate the keys to use for signatures in Java:

	public static String generateKey() throws Exception {
		KeyPairGenerator keyGen = KeyPairGenerator.getInstance("EC");
		SecureRandom random = SecureRandom.getInstance("SHA1PRNG");

		keyGen.initialize(256, random); //256 bit key size

		KeyPair pair = keyGen.generateKeyPair();
		ECPrivateKey priv = (ECPrivateKey) pair.getPrivate();
		PublicKey pub = pair.getPublic();

		//actually also need public key, but lets get to that later...
		return priv;
	}

Above code starts with getting an “EC” key-pair generator, EC referring to Elliptic Curve. Then get a secure random number generator instance, in this case one based on SHA1 hash algorithm. Apparently this is fine, even if SHA1 is not recommended for everything these days. Not quite sure about the key size of 256 given, but maybe have to look at that later.. First to get this working.

The “priv.Encoded()” part turns the private key into a standard encoding format as a byte array. Base64 encode it for character representation, to copy to the Go version..

Next, to sign the data (or message, or whatever we want to sign..):

	public static byte[] signMsg(String msg, PrivateKey priv) throws Exception {
		Signature ecdsa = Signature.getInstance("SHA1withECDSA");

		ecdsa.initSign(priv);

		byte[] strByte = msg.getBytes("UTF-8");
		ecdsa.update(strByte);

		byte[] realSig = ecdsa.sign();

		System.out.println("Signature: " + new BigInteger(1, realSig).toString(16));

		return realSig;
	}

Above starts with gettings a Java instance of the ECDSA signature algorithm, with type “SHA1withECDSA”. I spent a good moment wondering what all this means, to be able to copy the functionality into the Go version. So long story short, first the data is hashed with SHA1 and then this hash is signed with ECDSA. Finally, the code above prints the signature bytes as a hexadecimal string (byte array->BigInteger->base 16 string). I can then simply copy-paste this hex-string to Go to see if I can get signature verification to work in Go vs Java. Brilliant.

First I tried to see that I can get the signature verification to work in Java:

	private static boolean verifySignature(PublicKey pubKey,String msg, byte[] signature) throws Exception {
		byte[] message = msg.getBytes("UTF-8");
		Signature ecdsa = Signature.getInstance("SHA1withECDSA");
		ecdsa.initVerify(pubKey);
		ecdsa.update(message);
		return ecdsa.verify(signature);
	}

The code above takes the public key associated with the private key that was used to sign the data (called “msg” here). It creates the same type of ECDSA signature instance as the signature creation previously. This is used to verify the signature is valid for the given message (data). So signed with the private key, verified with the public key. And yes, it returns true for the signed message string, and false otherwise, so it works. So now knowing I got this to work, I can try the same in Go, using the signature, public key, and private key that was used in Java. But again, the question. How do I move these over?

Java seems to provide functions such as key.getEncoded(). This gives a byte array. We can then Base64 encode it to get a string (I believe Bitcoin etc. use Base56 but the same idea). So something like this:

		//https://stackoverflow.com/questions/5355466/converting-secret-key-into-a-string-and-vice-versa
		byte[] pubEncoded = pub.getEncoded();
		String encodedPublicKey = Base64.getEncoder().encodeToString(pubEncoded);
		String encodedPrivateKey = Base64.getEncoder().encodeToString(priv.getEncoded());
		System.out.println(encodedPrivateKey);
		System.out.println(encodedPublicKey);

Maybe I could then take the output I just printed, and decode that into the key in Go? But what is the encoding? Well, the JDK docs say getEncoded() “Returns the key in its primary encoding format”. And what might that be? Well some internet searching and debugger runs later I come up with this (which works to re-create the keys in Java):

	public static PrivateKey base64ToPrivateKey(String encodedKey) throws Exception {
		byte[] decodedKey = Base64.getDecoder().decode(encodedKey);
		PKCS8EncodedKeySpec spec = new PKCS8EncodedKeySpec(decodedKey);
		KeyFactory factory = KeyFactory.getInstance("EC");
		PrivateKey privateKey = factory.generatePrivate(spec);
		return privateKey;
	}

	public static PublicKey base64ToPublicKey(String encodedKey) throws Exception {
		byte[] decodedKey = Base64.getDecoder().decode(encodedKey);
		X509EncodedKeySpec spec = new X509EncodedKeySpec(decodedKey);
		KeyFactory factory = KeyFactory.getInstance("EC");
		return publicKey;
	}

So the JDK encodes the private key in PKCS8 format, and the public key in some kind of X509 format. X509 seems to be related to certificates, and PKCS refers to “Public Key Cryptography Standards”, of which there are several. Both of these seem a bit complicated, as I was just looking to transfer the keys over. Since people can post those online for various crypto tools as short strings, it cannot be that difficult, can it?

I tried to look for ways to take PKCS8 and X509 data into Go and transform those into private and public keys. Did not get me too far with that. Instead, I figured there must be only a small part of the keys that is needed to reproduce them.

So I found that the private key has a single large number that is the important bit, and the public key can be calculated from the private key. And the public key in itself consists of two parameters, the x and y coordinates of a point (I assume on the elliptic curve). I browsed all over the internet trying to figure this all out, but did not keep records of all the sites I visited, so my references are kind of lost. However, here is one description that just so states the integer and point part. Anyway, please let me know of any good references for a non-mathematician like me to understand it if you have any.

To get the private key value into suitable format to pass around in Java:

	//https://stackoverflow.com/questions/40552688/generating-a-ecdsa-private-key-in-bouncy-castle-returns-a-public-key
	private static String getPrivateKeyAsHex(PrivateKey privateKey) {
		ECPrivateKey ecPrivateKey = (ECPrivateKey) privateKey;
		byte[] privateKeyBytes = ecPrivateKey.getS().toByteArray();
		String hex = bytesToHex(privateKeyBytes);
		return hex;
	}

The “hex” string in the above code is the big integer value that forms the basis of the private key. This can now be passed, backed up, or whatever we desire. Of course, it should be kept private so no posting it on the internet.

For the public key:

	private static String getPublicKeyAsHex(PublicKey publicKey) {
		ECPublicKey ecPublicKey = (ECPublicKey) publicKey;
		ECPoint ecPoint = ecPublicKey.getW();

		byte[] affineXBytes = ecPoint.getAffineX().toByteArray();
		byte[] affineYBytes = ecPoint.getAffineY().toByteArray();

		String hexX = bytesToHex(affineXBytes);
		String hexY = bytesToHex(affineYBytes);

		return hexX+":"+hexY;
	}

The above code takes the X and Y coordinates that make up the public key, combines them, and thus forms a single string that can be passed to get the X and Y for public key. A more sensible option would likely just create a single byte array with the length of the first part as first byte or two. Something like [byte count for X][bytes of X][bytes of Y]. But the string concatenation works for my simple example to try to understand it.

And then there is one more thing that needs to be encoded and passed between the implementations, which is the signature. Far above, I wrote the “signMsg()” method to build the signature. I also printed the signature bytes out as a hex-string. But what format is the signature in, and how do you translate it to another platform and verify it? It turns out Java gives the signatures in ASN.1 format. There is a good description of the format here. It’s not too complicated but how would I import that into Go again? I did not find any mention of this in the ECDSA package for Go. By searching with ASN.1 I did finally find an ASN.1 package for Go. But is there a way to do that without these (poorly documented) encodings?

Well, it turns out that ECDSA signatures can also be described by using just two large integers, which I refer to here as R and S. To get these in Java:

	public static byte[] signMsg(String msg, PrivateKey priv) throws Exception {
		Signature ecdsa = Signature.getInstance("SHA1withECDSA");

		ecdsa.initSign(priv);

		byte[] strByte = msg.getBytes("UTF-8");
		ecdsa.update(strByte);

		byte[] realSig = ecdsa.sign();

		System.out.println("R: "+extractR(realSig));
		System.out.println("S: "+extractS(realSig));

		return realSig;
	}

	//https://stackoverflow.com/questions/48783809/ecdsa-sign-with-bouncycastle-and-verify-with-crypto
	public static BigInteger extractR(byte[] signature) throws Exception {
		int startR = (signature[1] & 0x80) != 0 ? 3 : 2;
		int lengthR = signature[startR + 1];
		return new BigInteger(Arrays.copyOfRange(signature, startR + 2, startR + 2 + lengthR));
	}

	public static BigInteger extractS(byte[] signature) throws Exception {
		int startR = (signature[1] & 0x80) != 0 ? 3 : 2;
		int lengthR = signature[startR + 1];
		int startS = startR + 2 + lengthR;
		int lengthS = signature[startS + 1];
		return new BigInteger(Arrays.copyOfRange(signature, startS + 2, startS + 2 + lengthS));
	}

Above code takes the byte array of the signature, and parses the R and S from it as matching the ASN.1 specification I linked above. So with that, another alternative is again to just turn the R and S into hex-strings or Base56 encoded strings, combine them as a single byte-array and hex-string or base56 that, or whatever. But just those two values need to be passed to capture the signature.

Now, finally to parse all this data in Go and to verify the signature. First to get the private key from the hex-string:

	func hexToPrivateKey(hexStr string)  *ecdsa.PrivateKey {
		bytes, err := hex.DecodeString(hexStr)
		print(err)

		k := new(big.Int)
		k.SetBytes(bytes)

		priv := new(ecdsa.PrivateKey)
		curve := elliptic.P256()
		priv.PublicKey.Curve = curve
		priv.D = k
		priv.PublicKey.X, priv.PublicKey.Y = curve.ScalarBaseMult(k.Bytes())
		//this print can be used to verify if we got the same parameters as in Java version
		fmt.Printf("X: %d, Y: %d", priv.PublicKey.X, priv.PublicKey.Y)
		println()

		return priv
	}

The above code takes the hex-string, parses it into a byte array, creates a Go big integer from that, and sets the result as the value into the private key. The other part that is needed is the elliptic curve definition. In practice, one of a predefined set of curves is usually used, and the same curve is used for a specific purpose. So it can be defined as a constant, whichever is selected for the blockchain. In this case it is always defined as the P256 curve, both in the Java and Go versions. For example, Bitcoin uses the Secp256k1 curve. So I just set the curve and the big integer to create the private key. The public key (X and Y parameters) is calculated here from the private key, by using a multiplier function on the private key’s big integer.

To build the public key straight from the X and Y values passed in as hex-strings:

	func hexToPublicKey(xHex string, yHex string) *ecdsa.PublicKey {
		xBytes, _ := hex.DecodeString(xHex)
		x := new(big.Int)
		x.SetBytes(xBytes)

		yBytes, _ := hex.DecodeString(yHex)
		y := new(big.Int)
		y.SetBytes(yBytes)

		pub := new(ecdsa.PublicKey)
		pub.X = x
		pub.Y = y

		pub.Curve = elliptic.P256()

		return pub
	}

Again, base56 or similar would likely be more efficient representation. So the above code allows just to pass around the public key and not the private key, which is how it should be done. With the parameters X and Y passed, and the curve defined as a constant choice.

To create and verify the signature from the passed values:

	type ecdsaSignature struct {
		R, S *big.Int
	}

	func verifyMySig(pub *ecdsa.PublicKey, msg string, sig []byte) bool {
		//https://github.com/gtank/cryptopasta/blob/master/sign.go
		digest := sha1.Sum([]byte(msg))

		var esig ecdsaSignature
		asn1.Unmarshal(sig, &esig)
		//we can use these prints to compare to what we had in Java...
		fmt.Printf("R: %d , S: %d", esig.R, esig.S)
		println()
		return ecdsa.Verify(pub, digest[:], esig.R, esig.S)
	}

The above version reads the actual ASN.1 encoded signature that is produced by the Java default signature encoding. To get the functionality matching the Java “SHA1withECDSA” algorithm, I first have to hash the input data with SHA1 as done here. Since the Java version is a bit of a black box with just that string definition, I spent a good moment wondering about that. I would guess the same approach would apply for other choices such as “SHA256withECDSA” by just replacing the hash function with another. Alternatively, I can also just pass in directly the R and S values of the signature:

	func verifyMySig(pub *ecdsa.PublicKey, msg string, sig []byte) bool {
		//https://github.com/gtank/cryptopasta/blob/master/sign.go
		digest := sha1.Sum([]byte(msg))

		var esig ecdsaSignature
		esig.R.SetString("89498588918986623250776516710529930937349633484023489594523498325650057801271", 0)
		esig.S.SetString("67852785826834317523806560409094108489491289922250506276160316152060290646810", 0)
		fmt.Printf("R: %d , S: %d", esig.R, esig.S)
		println()
		return ecdsa.Verify(pub, digest[:], esig.R, esig.S)
	}

So in the above, the R and S are actually set from numbers passed in. Which normally would be encoded more efficiently, and given as parameters. However, this works to demonstrate. The two long strings are the integers for the R and S I printed out in the Java version.

Strangely, printing the R and S using the ASN.1 and the direct passing of the numbers gives a different value for R and S. Which is a bit odd. But they both verify the signature fine. I read somewhere that some transformations can be done on the signature numbers while keeping it valid. Maybe this is done as part of the encoding or something? I have no idea. But it works. Much trust such crypto skills I have.

func TestSigning(t *testing.T) {
	xHexStr := "4bc55d002653ffdbb53666a2424d0a223117c626b19acef89eefe9b3a6cfd0eb"
	yHexStr := "d8308953748596536b37e4b10ab0d247f6ee50336a1c5f9dc13e3c1bb0435727"
	ePubKey = hexToPublicKey(xHexStr, yHexStr)

	sig := "3045022071f06054f450f808aa53294d34f76afd288a23749628cc58add828e8b8f2b742022100f82dcb51cc63b29f4f8b0b838c6546be228ba11a7c23dc102c6d9dcba11a8ff2"
	sigHex, _ := hex.DecodeString(sig)
	ok := verifyMySig(ePubKey, "This is string to sign", sigHex)
	println(ok)
}

And finally, it works! Great 🙂

Playing with Pairwise Testing and PICT

A while back, I was doing some lectures on advanced software testing technologies. One topic was combinatorial testing. Looking at the materials, there are good and free tools out there to generate tests to cover various combinations. Still, I don’t see many people use them, and the materials out there don’t seem too great.

Combinatorial testing here refers to having 2-way, 3-way, up to N-way (sometimes they seem to call it t-way…) combinations of data values in different test cases. 2-way is also called pairwise testing. This simply refers to all pairs of data values appearing in different test cases. For example, if one test uses values “A” and “B”, and another uses a combination of “A” and “C”, you would have covered the pairs A+B and A+C but not B+C. With large numbers of potential values, the set of potential combinations can grow pretty huge, so finding a minimal set to cover all combinations can be very useful.

The benefits

There is a nice graph over at NIST, including a PDF with a broader description. Basically these show that 2-way and 3-way combinations already show very high gains in finding defects over considering coverage of single variables alone. Of course, things get a bit more complicated when you need to find all relevant variables in the program control flow, how to define what you can combine, all the constraints, etc. Maybe later. Now I just wanted to try the combinatorial test generation.

Do Not. Try. Bad Yoda Joke. Do Try.

So I gave combinatorial test generation a go. Using a nice and freely available PICT tool from Microsoft Research. It even compiles on different platforms, not just Windows. Or so they say on their Github.

Unexpectedly, compiling and getting PICT to run on my OSX was quite simple. Just “make” and “make test” as suggested on the main Github page. Probably I had most dependencies already from before, but anyway, it was surprisingly easy.

I made “mymodels” and “myoutputs” directories under the directory I cloned the git and compile the code to. Just so I could keep some order to my stuffs. So this is why the following example commands work..

I started with the first example on PICT documentation page. The model looks like this:

Type:          Primary, Logical, Single, Span, Stripe, Mirror, RAID-5
Size:          10, 100, 500, 1000, 5000, 10000, 40000
Format method: quick, slow
File system:   FAT, FAT32, NTFS
Cluster size:  512, 1024, 2048, 4096, 8192, 16384, 32768, 65536
Compression:   on, off

Running the tool and getting some output is actually simpler than I expected:

./pict mymodels/example1.pict >myoutputs/example1.txt

PICT prints the list of generated test value combinations to the standard output. Which generally just translates to printing a bunch of lines on the console/screen. To save the generated values, I just pipe the output to myoutputs/example1.txt, as shown above. In this case, the output looks like this:

Type	Size	Format method	File system	Cluster size	Compression
Stripe	100	quick	FAT32	1024	on
Logical	10000	slow	NTFS	512	off
Primary	500	quick	FAT	65536	off
Span	10000	slow	FAT	16384	on
Logical	40000	quick	FAT32	16384	off
Span	1000	quick	NTFS	512	on
Span	10	slow	FAT32	32768	off
Stripe	5000	slow	NTFS	32768	on
RAID-5	500	slow	FAT	32768	on
Mirror	1000	quick	FAT	32768	off
Single	10	quick	NTFS	4096	on
RAID-5	100	slow	FAT32	4096	off
Mirror	100	slow	NTFS	65536	on
RAID-5	40000	quick	NTFS	2048	on
Stripe	5000	quick	FAT	4096	off
Primary	40000	slow	FAT	8192	on
Mirror	10	quick	FAT32	8192	off
Span	500	slow	FAT	1024	off
Single	1000	slow	FAT32	2048	off
Stripe	500	quick	NTFS	16384	on
Logical	10	quick	FAT	2048	on
Stripe	10000	quick	FAT32	512	off
Mirror	500	quick	FAT32	2048	on
Primary	10	slow	FAT32	16384	on
Single	10	quick	FAT	512	off
Single	10000	quick	FAT32	65536	off
Primary	40000	quick	NTFS	32768	on
Single	100	quick	FAT	8192	on
Span	5000	slow	FAT32	2048	on
Single	5000	quick	NTFS	16384	off
Logical	500	quick	NTFS	8192	off
RAID-5	5000	quick	NTFS	1024	on
Primary	1000	slow	FAT	1024	on
RAID-5	10000	slow	NTFS	8192	on
Logical	100	quick	NTFS	32768	off
Primary	10000	slow	FAT	32768	on
Stripe	40000	quick	FAT32	65536	on
Span	40000	quick	FAT	4096	on
Stripe	1000	quick	FAT	8192	off
Logical	1000	slow	FAT	4096	off
Primary	100	quick	FAT	2048	off
Single	40000	quick	FAT	1024	off
RAID-5	1000	quick	FAT	16384	on
Single	500	quick	FAT32	512	off
Stripe	10	quick	NTFS	2048	off
Primary	100	quick	NTFS	512	off
Logical	10000	slow	NTFS	1024	off
Mirror	5000	quick	FAT	512	on
Logical	5000	slow	NTFS	65536	off
Mirror	10000	slow	FAT	2048	off
RAID-5	10	slow	FAT32	65536	off
Span	100	quick	FAT	65536	on
Single	5000	quick	FAT	32768	on
Span	1000	quick	NTFS	65536	off
Primary	500	slow	FAT32	4096	off
Mirror	40000	slow	FAT32	4096	off
Mirror	10	slow	FAT32	1024	off
Logical	10000	quick	FAT	4096	off
Span	5000	slow	FAT	8192	off
RAID-5	40000	quick	FAT32	512	on
Primary	5000	quick	NTFS	1024	off
Mirror	100	slow	FAT32	16384	off

The first line is the header, and values/columns are separated by tabulator characters (tabs).

The output above is 62 generated combinations/test cases as evidenced by:

wc -l myoutputs/example1.txt 
      63 myoutputs/example1.txt

(wc-l counts lines, and the first line is the header so I substract 1)

To produce all 3-way combinations with PICT, the syntax is:

./pict mymodels/example1.pict >myoutputs/example1.txt /o:3

which generates 392 combinations/test cases:

wc -l myoutputs/example1.txt 
      393 myoutputs/example1.txt

I find the PICT command-line syntax a bit odd, as parameters have to be the last elements on the line, and they are identified by these strange symbols like “/o:”. But it works, so great.

Constraints

Of course, not all combinations are always valid. So PICT has extensive support to define constraints on the generator model, to limit what kind of combinations PICT generates. The PICT documentation page has lots of good examples. This part actually seems nicely documented. But let’s try a few just to see what happens. The basic example from the page:

Type:           Primary, Logical, Single, Span, Stripe, Mirror, RAID-5
Size:           10, 100, 500, 1000, 5000, 10000, 40000
Format method:  quick, slow
File system:    FAT, FAT32, NTFS
Cluster size:   512, 1024, 2048, 4096, 8192, 16384, 32768, 65536
Compression:    on, off

IF [File system] = "FAT"   THEN [Size] <= 4096;
IF [File system] = "FAT32" THEN [Size] myoutputs/example2.txt

wc -l myoutputs/example2.txt 
      63 myoutputs/example2.txt

So the same number of tests. The contents:

Type	Size	Format method	File system	Cluster size	Compression
Stripe	500	slow	NTFS	1024	on
Primary	500	quick	FAT32	512	off
Single	10	slow	FAT	1024	off
Single	5000	quick	FAT32	32768	on
Span	40000	quick	NTFS	16384	off
Mirror	40000	slow	NTFS	512	on
RAID-5	100	quick	FAT	8192	on
Logical	500	slow	FAT	2048	off
Span	10000	slow	FAT32	1024	on
Logical	1000	slow	FAT32	16384	on
Span	1000	quick	FAT	512	off
Primary	10	quick	NTFS	1024	on
Mirror	1000	quick	NTFS	4096	off
RAID-5	40000	slow	NTFS	1024	off
Single	40000	slow	NTFS	8192	off
Stripe	10	slow	FAT32	4096	on
Stripe	40000	quick	NTFS	2048	on
Primary	100	slow	NTFS	32768	off
Stripe	500	quick	FAT	16384	off
RAID-5	1000	quick	FAT32	2048	off
Mirror	10	quick	FAT	65536	off
Logical	40000	quick	NTFS	4096	on
RAID-5	5000	slow	NTFS	512	off
Stripe	5000	slow	FAT32	65536	on
Span	10	quick	FAT32	2048	off
Logical	10000	quick	NTFS	65536	off
Primary	1000	slow	FAT	65536	off
Mirror	500	quick	FAT	32768	on
Single	100	quick	FAT32	512	on
Mirror	5000	slow	FAT32	2048	on
Mirror	100	quick	NTFS	2048	on
Logical	5000	quick	FAT32	8192	off
Logical	100	slow	FAT32	1024	on
Primary	100	quick	FAT32	16384	off
Primary	10000	quick	FAT32	2048	on
RAID-5	10	slow	FAT	32768	off
Mirror	10	quick	FAT	16384	on
Single	500	slow	FAT	4096	on
Span	500	slow	FAT32	8192	on
Stripe	10000	quick	FAT32	32768	off
Logical	1000	slow	NTFS	32768	on
Single	10000	slow	NTFS	16384	off
Span	100	slow	FAT32	4096	on
Stripe	1000	slow	NTFS	8192	on
Span	5000	quick	NTFS	32768	on
Primary	5000	slow	FAT32	4096	off
RAID-5	100	slow	FAT	65536	off
RAID-5	10000	slow	FAT32	4096	on
Single	1000	quick	FAT	1024	on
Mirror	10	quick	FAT	1024	on
Logical	5000	slow	FAT32	1024	off
Single	500	slow	FAT32	65536	off
Logical	10	quick	NTFS	512	on
Single	1000	slow	FAT	2048	off
Mirror	10000	quick	NTFS	8192	on
Primary	10	quick	FAT32	8192	on
Primary	40000	slow	NTFS	32768	off
Stripe	100	slow	FAT	512	off
Mirror	10000	slow	FAT32	512	on
RAID-5	5000	quick	NTFS	16384	off
Span	40000	quick	NTFS	65536	on
RAID-5	500	quick	FAT	4096	on

In the “size” column vs the “File system” column, the “FAT” file system type now always has a size smaller than 4096. So it works as expected. I have to admit, I found the value 4096 very confusing here, since there is no option of 4096 in the input model for “size” but there is for “Cluster size”. I was looking at the wrong column initially, wondering why the constraint was not working. But it works, just a bit confusing example.

Similarly, 3-way combinations produce the same number of tests (as it did without any constraints) even with these constraints:

./pict mymodels/example2.pict >myoutputs/example2.txt /o:3

wc -l myoutputs/example2.txt 
     393 myoutputs/example2.txt

To experiment a bit more, I set a limit on FAT size to be 100 or less:

Type:           Primary, Logical, Single, Span, Stripe, Mirror, RAID-5
Size:           10, 100, 500, 1000, 5000, 10000, 40000
Format method:  quick, slow
File system:    FAT, FAT32, NTFS
Cluster size:   512, 1024, 2048, 4096, 8192, 16384, 32768, 65536
Compression:    on, off

IF [File system] = "FAT"   THEN [Size] <= 100;
IF [File system] = "FAT32" THEN [Size] myoutputs/example3.txt

wc -l myoutputs/example3.txt 
      62 myoutputs/example3.txt

./pict mymodels/example3.pict >myoutputs/example3.txt /o:3
wc -l myoutputs/example3.txt 
     397 myoutputs/example3.txt

What happened here?

Running the 2-way generator produces 61 tests. So the number of combinations generated was finally reduced by one with the additional constraint.

Running the 3-way generator produces 396 tests. So the number of tests/combinations generated was increased by 4, comparated to 3-way generator without this constraint. Which is odd, as I would expect the number of tests to go down, when there are fewer options. In fact, you could get a smaller number of tests by just by taking the 392 tests from the previous generator run with fewer constraints. Then take every line with “FAT” for “File system”, and if the “Size” for those is bigger than 100, replace it with either 100 or 10. This would be a max of 392 as it was before.

My guess is this is because building the set of inputs to cover all requested combinations is a very hard problem. I believe in computer science this would be called an NP-hard problem (or so I gather from the academic literature for combinatorial testing, even if they seem to call the test set a “covering array” and other academic tricks). So no solution is known that would produce the optimal result. The generator will then have to accomodate all the possible constraints in its code, and ends up taking some tricks here that result in slighly bigger set. It is still likely a very nicely optimized set. Glad it’s not me having to write those algorithms :). I just use them and complain :).

PICT has a bunch of other ways to define conditional constraints with the use of IF, THEN, ELSE, NOT, OR, AND statements. The docs cover that nicely. So lets not go there.

The Naming Trick

Something I found interesting is a way to build models by naming different items separately, and constraining them separately:

#
# Machine 1
#
OS_1:   Win7, Win8, Win10
SKU_1:  Home, Pro
LANG_1: English, Spanish, Chinese

#
# Machine 2
#
OS_2:   Win7, Win8, Win10
SKU_2:  Home, Pro
LANG_2: English, Spanish, Chinese, Hindi

IF [LANG_1] = [LANG_2]
THEN [OS_1]  [OS_2] AND [SKU_1]  [SKU_2];

Here we have two items (“machines”) with the same three properties (“OS”, “SKU”, “LANG”). However, by numbering the properties, the generator sees them as different. From this, the generator can now build combinations of different two-machine configurations, using just the basic syntax and no need to tweak the generator itself. The only difference between the two is that “Machine 2” can have one additional language (“Hindi”).

The constraint at the end also nicely ensures that if the generated configurations have the same language, the OS and SKU should be different.

Scaling these “machine” combinations to a large number of machines would require a different type of an approach. Since it is doubtful anyone would like to write a model with 100 machines, each separately labeled. No idea what modelling approach would be the best for that, but right now I don’t have a big requirement for it, so not going there. Maybe a different approach of having the generator produce a more abstract set of combinations, and map those to large number of “machines” somehow.

Repetition and Value References

There is quite a bit of repetition in the above model with both machines repeating all the same parameter values. PICT has a way to address this by referencing values defined for other parameters:

#
# Machine 1
#
OS_1:   Win7, Win8, Win10
SKU_1:  Home, Pro
LANG_1: English, Spanish, Chinese

#
# Machine 2
#
OS_2:   
SKU_2:  
LANG_2: , Hindi

So in this case, “machine 2” is repeating the values from “machine 1”, and changing them in “machine 1” also changes them in “machine 2”. Sometimes that is good, other times maybe not. Because changing one thing would change many, and you might not remember that every time. On the other hand, you would not want to be manually updating all items with the same info every time. But a nice feature to have if you need it.

Data Types

With regards to variable types, PICT supports numbers and strings. So this is given as an example model:

Size:  1, 2, 3, 4, 5
Value: a, b, c, d

IF [Size] > 3 THEN [Value] > "b";

I guess the two types are because you can then define different types of constraints on them. For example, “Size” > 3 makes sense. The part of “value” > 3 a bit less.. So let’s try that:

./pict mymodels/example4.pict >myoutputs/example4.txt

wc -l myoutputs/example4.txt 
      17 myoutputs/example4.txt

The output looks like this:

Size	Value
3	a
2	c
1	c
2	b
2	a
1	d
1	a
3	b
4	d
2	d
3	d
1	b
5	c
3	c
4	c
5	d

And here, if “Size” equals 4 or 5 (so is >3), “Value” is always “c” or d”. The PICT docs state “String comparison is lexicographical and case-insensitive by default”. So [> “b”] just refers to letters coming after “b”, which equals “c” and “d” in the choices in this model. It seems a bit odd to define such comparisons against text in a model, but I guess it can help make a model more readable if you can represent values as numbers or strings, and define constraints on them in a similar way.

To verify, I try a slightly modified model:

./pict mymodels/example4.pict >myoutputs/example4.txt

wc -l myoutputs/example4.txt 
      13 myoutputs/example4.txt

So, the number of tests is reduced from 16 to 12. Results in the following output:

Size	Value
5	c
2	c
1	d
4	d
1	b
4	c
3	d
3	c
2	d
1	c
1	a
5	d

Which confirms that lines (tests) with Size > 2 now have only letters “c” or “d” in them. This naturally also limits the number of available combinations, hence the reduced test set.

Extra Features

There are some nice features that are nicely explained in the PICT docs:

  • Submodels: Refers to defining levels of combinations per test. For example, 2-way combinations of OS with all others, and 3-way combination of File System Type with all others, at the same time.
  • Aliasing: You can give the same parameter several names and all are treated the same. Not sure why you want to do that but anyway.
  • Weighting: Since the full set of combinations will have more of some values anyway, this can be used to set preference for specific ones.‘

Negative Testing / Erronous Values

A few more interesting ones are “negative testing” and “seeding”. So first negative testing. Negative testing refers to having a set of exclusive values. So those values should never appear together. This is because each of them is expected to produce an error. So you want to make sure the error they produce is visible and not “masked” (hidden) by some other erronous value.

The example model from PICT docs, with a small modification to name the invalid values differently:

#
# Trivial model for SumSquareRoots
#

A: ~-1, 0, 1, 2
B: ~-2, 0, 1, 2

Running it, we get:

./pict mymodels/example5.pict >myoutputs/example5.txt

wc -l myoutputs/example5.txt 
      16 myoutputs/example5.txt
A	B
0	2
0	1
1	2
2	1
1	0
2	0
1	1
2	2
0	0
0	~-2
1	~-2
~-1	0
~-1	1
2	~-2
~-1	2

The negative value is prefixed with “~”, and the results show combinations of the two negative values with all possible values of the other variable. So if A is -1, it is combined with 0, 1, 2 for B. If B is -2 it is combinted with 0, 1, 2 for A. But -1 and -2 are never paired. To avoid one “faulty” variable masking the other one. I find having the “~” added everywhere a bit distracting. But I guess you could parse around it, not a real issue.

Of course, there is nothing to stop us from setting the set of possible values to include -1 and -2, and get combinations of several “negative” values. Lets try:

A: -1, 0, 1, 2
B: -2, 0, 1, 2
./pict mymodels/example6.pict >myoutputs/example6.txt
wc -l myoutputs/example6.txt 
      17 myoutputs/example6.txt
A	B
1	-2
2	0
1	0
-1	0
0	-2
2	1
-1	-2
0	0
1	2
-1	2
0	2
2	-2
1	1
-1	1
0	1
2	2

So there we go. This produced one test more than the previous one. And that would be the one where both the negatives are present. Line with “-1” and “-2” together.

Overall, the “~” notation seems like just a way to avoid having a set of variables appear together. Convenient, and good way to optimize more when you have large models, big input spaces, slow tests, difficult problem reports, etc.

Seeding / Forcing Tests In

Seeding. When I hear seeding in test generation, I think about the seed value for a random number generator. Because often those are used to help generate tests.. Well, with PICT it actually means you can predine a set of combinations that need to be a part of the final test set.

So lets try with the first example model from above:

Type:          Primary, Logical, Single, Span, Stripe, Mirror, RAID-5
Size:          10, 100, 500, 1000, 5000, 10000, 40000
Format method: quick, slow
File system:   FAT, FAT32, NTFS
Cluster size:  512, 1024, 2048, 4096, 8192, 16384, 32768, 65536
Compression:   on, off

The seed files should be the same format as the output produced by PICT. Lets say I want to try all types with all file systems, using smallest size. So I try with this:

Type	Size	Format method	File system	Cluster size	Compression
Primary	10		FAT32		on
Logical	10		FAT32		on
Single	10		FAT32		on
Span	10		FAT32		on
Stripe	10		FAT32		on
Mirror	10		FAT32		on
RAID-5	10		FAT32		on
Primary	10		FAT		on
Logical	10		FAT		on
Single	10		FAT		on
Span	10		FAT		on
Stripe	10		FAT		on
Mirror	10		FAT		on
RAID-5	10		FAT		on
Primary	10		NTFS		on
Logical	10		NTFS		on
Single	10		NTFS		on
Span	10		NTFS		on
Stripe	10		NTFS		on
Mirror	10		NTFS		on
RAID-5	10		NTFS		on

To run it:

./pict mymodels/example7.pict /e:mymodels/example7.seed >myoutputs/example7.txt
 wc -l myoutputs/example7.txt 
      73 myoutputs/example7.txt

So in the beginning of this post, the initial model generated 62 combinations. With this seed file, some forced repetition is there and the size goes up to 72. Still not that much bigger, but I guess shows something about how nice it is to have a combinatorial test tool to optimize this type of test set for you.

The actual output:

Type	Size	Format method	File system	Cluster size	Compression
Primary	10	quick	FAT32	2048	on
Logical	10	slow	FAT32	16384	on
Single	10	slow	FAT32	65536	on
Span	10	quick	FAT32	1024	on
Stripe	10	quick	FAT32	8192	on
Mirror	10	quick	FAT32	512	on
RAID-5	10	slow	FAT32	32768	on
Primary	10	slow	FAT	4096	on
Logical	10	quick	FAT	1024	on
Single	10	quick	FAT	32768	on
Span	10	slow	FAT	512	on
Stripe	10	slow	FAT	16384	on
Mirror	10	slow	FAT	8192	on
RAID-5	10	slow	FAT	2048	on
Primary	10	quick	NTFS	65536	on
Logical	10	quick	NTFS	4096	on
Single	10	slow	NTFS	16384	on
Span	10	quick	NTFS	32768	on
Stripe	10	slow	NTFS	1024	on
Mirror	10	slow	NTFS	2048	on
RAID-5	10	quick	NTFS	512	on
Span	40000	slow	FAT	65536	off
Single	5000	quick	NTFS	8192	off
Mirror	1000	quick	FAT32	4096	off
Stripe	100	slow	FAT	32768	off
Primary	500	slow	FAT	512	off
Primary	40000	quick	NTFS	8192	on
Logical	10000	quick	NTFS	32768	off
RAID-5	40000	slow	FAT32	1024	off
Span	100	quick	NTFS	8192	on
Mirror	10000	slow	FAT32	16384	off
Logical	5000	slow	FAT	512	on
Primary	1000	slow	FAT	1024	on
Mirror	5000	quick	FAT32	1024	on
Logical	1000	quick	NTFS	32768	on
Single	40000	slow	FAT32	512	on
Stripe	40000	quick	FAT	16384	on
Logical	100	quick	FAT32	2048	off
Single	100	quick	FAT32	1024	off
Primary	5000	quick	NTFS	32768	off
Single	40000	slow	NTFS	2048	on
Logical	500	quick	FAT32	8192	on
Single	500	slow	NTFS	4096	on
Span	500	quick	FAT32	16384	on
Primary	100	quick	FAT32	512	off
Stripe	1000	slow	FAT32	2048	on
RAID-5	10000	quick	FAT	8192	on
Stripe	10000	slow	NTFS	512	off
Stripe	5000	quick	FAT	65536	on
Mirror	40000	slow	NTFS	32768	on
Primary	10000	quick	NTFS	1024	on
RAID-5	100	quick	FAT	16384	off
Mirror	500	quick	NTFS	1024	on
Single	1000	slow	FAT32	512	on
Span	100	slow	FAT32	4096	off
Span	5000	slow	NTFS	2048	on
RAID-5	40000	slow	FAT	4096	off
Span	1000	slow	FAT32	16384	on
Mirror	100	quick	FAT	65536	on
Single	10000	slow	FAT	4096	off
RAID-5	1000	slow	NTFS	65536	off
Span	10000	slow	NTFS	65536	on
Span	1000	slow	FAT32	8192	off
RAID-5	500	quick	NTFS	32768	off
Stripe	500	slow	FAT	2048	off
RAID-5	5000	slow	NTFS	16384	on
Stripe	5000	slow	FAT32	4096	off
Logical	10	slow	FAT	65536	off
RAID-5	10000	quick	NTFS	2048	on
Primary	1000	slow	FAT	16384	off
Logical	40000	quick	FAT32	8192	on
Primary	500	quick	FAT	65536	on

This output starts with the seeds given, and PICT has done its best to fill in the blanks with such values as to still minimize the test numbers while meeting the combinatorial coverage requirements.

Personal Thoughts

Searching for PICT and pairwise testing or combinatorial testing brings up a bunch of results and reasonably good articles on the topic. Maybe even more of such practice oriented ones than model-based testing. Maybe because it is simpler to apply, and thus easier to pick up and go in practice?

For example, this has a few good points. One is to use an iterative process to build the input models. So as with everything else, not to expect to get it all perfectly right from the first try. Another is to consider invariants for test oracles. So things that should always hold, such as two nodes in a distributed system never being in a conflicting state when an operation involving both is done. Of course, this would also apply to any other type of testing. The article seems to consider this also from a hierarchical viewpoint, checking the strictest or most critical ones first.

Another good point in that article is to use readable names for the values. I guess sometimes people could use the PICT output as such, to define test configurations and the like for manual testing. I would maybe considering using them more as input for automated test execution to define parameter values to cover. In such cases, it would be enough to give each value a short name such as “A”, “A1”, or “1”. But looking at the model and the output, it would be difficult to define which value would map to which symbol. Readable names are just as parseable for the computer but much more so for the human expert.

Combining with Sequential Models

So this is all nice and shiny, but the examples are actually quite simple test scenarios. There are no complex dependencies between them, not complex state that defines what parameters and values are available, and so on. It mostly seems to vary around what combinations of software or system configurations should be used in testing.

I have worked plenty with model-based testing myself (see OSMO), and actually have talked to some people who have done combinations of combinatorial input generation and model-based testing. I can see how this could be interested, to identify a set of injection points for parameters and values in a MBT model, and use a combinatorial test data generator to build data sets for those injection points. Likely doing some more of this in practice would reveal good insights on what works and what could be done to make the match even better. Maybe someday.

In any case, I am sure combining combinatorial test datasets would also work great with other types of sequences as well. I think this could make a very interesting and practical research topic. Again, maybe someday..

Bye Now

In general, this area seems to have great tools for the basic test generation, but missing some in-depth experiences and guides for how to apply to more complex software. Together with sequential test cases and test generators.

A simpler, yet interesting topic to do would be to integrate the PICT type generator directly with the test environment. Run the combinatorial generator from this during the test runs, and have it randomize the combinations in a bit different ways during different runs. While still maintaining the overall combinatorial coverage.

Finnish Topic Modelling

Previously I wrote about a few experiments I ran with topic-modelling. I briefly glossed over having some results for a set of Finnish text as an example of a smaller dataset. This is a bit deeper look into that..

I use two datasets, the Finnish wikipedia dump, and the city of Oulu board minutes. Same ones I used before. Previously I covered topic modelling more generally, so I won’t go into too much detail here. To summarize, topic modelling algorithms (of which LDA or Latent Dirilect Allocation is used here) find sets of words with different distributions over sets of documents. These are then called the “topics” discussed in those documents.

This post looks at how to use topic models for a different language (besides English) and what could one maybe do with the results.

Lemmatize (turn words into baseforms before use) or not? I choose to lemmatize for topic modelling. This seems to be the general consensus when looking up info on topic modelling, and in my experience it just gives better results as the same word appears only once. I covered POS tagging previously, and I believe it would be useful to apply here as well, but I don’t. Mostly because it is not needed to test these concepts, and I find the results are good enough without adding POS tagging to the mix (which has its issues as I discussed before). Simplicity is nice.

I used the Python Gensim package for building the topic models. As input, I used the Finnish Wikipedia text and the city of Oulu board minutes texts. I used my existing text extractor and lemmatizer for these (to get the raw text out of the HTML pages and PDF docs, and to baseform them, as discussed in my previous posts). I dumped the lemmatized raw text into files using slight modifications of my previous Java code and the read the docs from those files as input to Gensim in a Python script.

I started with the Finnish Wikipedia dump, using Gensim to provide 50 topics, with 1 pass over the corpus. First 10 topics that I got:

  • topic0=focus[19565] var[8893] liivi[7391] luku[6072] html[5451] murre[3868] verkkoversio[3657] alku[3313] joten[2734] http[2685]
  • topic1=viro[63337] substantiivi[20786] gen[19396] part[14778] taivutus[13692] tyyppi[6592] täysi[5804] taivutustyyppi[5356] liite[4270] rakenne[3227]
  • topic2=isku[27195] pieni[10315] tms[7445] aine[5807] väri[5716] raha[4629] suuri[4383] helppo[4324] saattaa[4044] heprea[3129]
  • topic3=suomi[89106] suku[84950] substantiivi[70654] pudottaa[59703] kasvi[46085] käännös[37875] luokka[35566] sana[33868] kieli[32850] taivutusmuoto[32067]
  • topic4=ohjaus[129425] white[9304] off[8670] black[6825] red[5066] sotilas[4893] fraasi[4835] yellow[3943] perinteinen[3744] flycatcher[3735]
  • topic5=lati[48738] eesti[25987] www[17839] http[17073] keele[15733] eki[12421] lähde[11306] dict[11104] sõnaraamat[10648] tallinn[8504]
  • topic6=suomi[534914] käännös[292690] substantiivi[273243] aihe[256126] muualla[254788] sana[194213] liittyvä[193298] etymologi[164158] viite[104417] kieli[102489]
  • topic7=italia[66367] substantiivi[52038] japani[27988] inarinsaame[9464] kohta[7433] yhteys[7071] vaatekappale[5553] rinnakkaismuoto[5469] taas[4986] voimakas[3912]
  • topic8=sana[548232] liittyvä[493888] substantiivi[298421] ruotsi[164717] synonyymi[118244] alas[75430] etymologi[64170] liikuttaa[38058] johdos[25603] yhdyssana[24943]
  • topic9=juuri[3794] des[3209] jumala[1799] tadžikki[1686] tuntea[1639] tekijä[1526] tulo[1523] mitta[1337] jatkuva[1329] levy[1197]
  • topic10=törmätä[22942] user[2374] sur[1664] self[1643] hallita[1447] voittaa[1243] piste[1178] data[1118] harjoittaa[939] jstak[886]

The format of the topic list I used here is “topicX=word1[count] word2[count]”, where X is the number of the topic, word1 is the first word in the topic, word2 the second, and so on. The [count] is how many times the word was associated with the topic in different documents. Consider it the strength, weight, or whatever of the word in the topic.

So just a few notes on the above topic list:

  • topic0 = mostly website related terms, interleaved with a few odd ones. Examples of odd ones; “liivi” = vest, “luku” = number/chapter (POS tagging would help differentiate), “murre” = dialect.
  • topic1 = mostly Finnish language related terms. “viro” = estonia = slightly odd to have here. It is the closest related language to Finnish but still..
  • topic3 = another Finnish language reated topic. Odd one here is “kasvi” = plant. Generally this seems to be more related to words and their forms, where as topic1 maybe more about structure and relations.
  • topic5 = estonia related

Overall, I think this would improve given more passes over the corpus to train the model. This would give the algorithm more time and data to refine the model. I only ran it with one pass here since the training for more topics and with more passes started taking days and I did not have the resources to go there.

My guess is also that with more data and more broader concepts (Wikipedia covering pretty much every topic there is..) you would also need more topics that the 50 I used here. However, I had to limit the size due to time and resource constraints. Gensim probably also has more advanced tuning options (e..g, parallel runs) that would benefit the speed. So I tried a few more sizes and passes with the smaller Oulu city board dataset, as it was faster to run.

Some topics for the city of Oulu board minutes, run for 20 topics and 20 passes over the training data:

  • topic0=oulu[2096] kaupunki[1383] kaupunginhallitus[1261] 2013[854] päivämäärä[575] vuosi[446] päätösesitys[423] jäsen[405] hallitus[391] tieto[387]
  • topic1=kunta[52] palvelu[46] asiakaspalvelu[41] yhteinen[38] viranomainen[25] laki[24] valtio[22] myös[20] asiakaspalvelupiste[19] kaupallinen[17]
  • topic2=oulu[126] palvelu[113] kaupunki[113] koulu[89] tukea[87] edistää[71] vuosi[71] osa[64] nuori[63] toiminta[61]
  • topic3=tontti[490] kaupunki[460] oulu[339] asemakaava[249] rakennus[241] kaupunginhallitus[234] päivämäärä[212] yhdyskuntalautakunta[206] muutos[191] alue[179]
  • topic5=kaupunginhallitus[1210] päätös[1074] jäsen[861] oulu[811] kaupunki[681] pöytäkirja[653] klo[429] päivämäärä[409] oikaisuvaatimus[404] matti[316]
  • topic6=000[71] 2012[28] oulu[22] muu[20] tilikausi[16] vuosi[16] yhde[15] kunta[14] 2011[13] 00000[13]
  • topic8=alue[228] asemakaava[96] rakentaa[73] tulla[58] oleva[56] rakennus[55] merkittävä[53] kortteli[53] oulunsalo[50] nykyinen[48]
  • topic10=asiakirjat.ouka.fi[15107] ktwebbin[15105] 2016[7773] eet[7570] pk_asil_tweb.htm?[7551] ktwebscr[7550] dbisa.dll[7550] url=http[7540] doctype[7540] =3&docid[7540]
  • topic11=yhtiö[31] osake[18] osakas[11] energia[10] hallitus[10] 18.11.2013[8] liite[7] lomautus[6] sähkö[6] osakassopimus[5]
  • topic12=13.05.2013[13] perlacon[8] kuntatalousfoorumi[8] =1418[6] meeting_date=21.3.2013[6] =2070[6] meeting_date=28.5.2013[6] =11358[5] meeting_date=3.10.2016[5] -31.8.2015[4]
  • topic13=001[19] oulu[11] 002[5] kaupunki[4] sivu[3] ���[3] palvelu[3] the[3] asua[2] and[2]

Some notes on the topics above:

  • The word “oulu” repeats in most of the topics. This is quite natural as all the documents are from the board of the city of Oulu. Depending on the use case for the topics, it might be useful to add this word to the list of words to be removed in the pre-cleaning phase for the documents before running the topic modelling algorithm. Or it might be useful information, along with the weight of the word inside the topic. Depends.
  • topic0 = generally about the board structure. For example, “kaupunki”=city, “kaupunginhallitus”=city board, “päivämäärä”=date, “päätösesitys”=proposal for decision.
  • topic1 = Mostly city service related words. For example, “kunta” = county, “palvelu” = service, “asiakaspalvelu” = customer service, “myös” = also, so something to add to the cleaners again.
  • topic2 = School related. For example, “koulu” = school, “tukea” = support, … Sharing again common words such as “kaupunki” = city, which may also be considered for removal or not depending on the case.
  • topic3 = City area planning related. For example, “tontti” = plot of land, “asemakaava” = zoning plan, …
  • In general quite good and focused topics here, so I think in general quite a good result. Some exceptions to consider:
  • topic10 = mostly garbage related to HTML formatting and website link structures. still a real topic of course, so nicely identified.. I think something to consider to add to the cleaning list for pre-processing.
  • topic12 = Seems related to some city finance related consultation (perlacon seems to be such as company) and associated event (the forum). With a bunch of meeting dates.
  • topic13 = unclear garbage
  • So in general, I guess reasonably good results but in real applications, several iterations of fine-tuning the words, the topic modelling algorithm parameters, etc. based on the results would be very useful.

So that was the city minutes topics for a smaller set of topics and more passes. What does it look for 100 topics, and how does the number of passes over the corpus affect the larger size? more passes should give the algorithm more time to refine the topics, but smaller datasets might not have so many good topics..

For 100 topics, 1 passes, 10 first topics:

  • topic0=oulu[55] kaupunki[22] 000[20] sivu[14] palvelu[14] alue[13] vuosi[13] muu[11] uusi[11] tavoite[9]
  • topic1=kaupunki[18] oulu[17] jäsen[15] 000[10] kaupunginhallitus[7] kaupunginjohtaja[6] klo[6] muu[5] vuosi[5] takaus[4]
  • topic2=hallitus[158] oulu[151] 25.03.2013[135] kaupunginhallitus[112] jäsen[105] varsinainen[82] tilintarkastaja[79] kaupunki[75] valita[70] yhtiökokousedustaja[50]
  • topic3=kuntalisä[19] oulu[16] palkkatuki[15] kaupunki[14] tervahovi[13] henkilö[12] tukea[12] yritys[10] kaupunginhallitus[10] työtön[9]
  • topic4=koulu[37] oulu[7] sahantie[5] 000[5] äänestyspaikka[4] maikkulan[4] kaupunki[4] kirjasto[4] monitoimitalo[3] kello[3]
  • topic5=oulu[338] kaupunki[204] euro[154] kaupunginhallitus[143] 2013[105] vuosi[96] milj[82] palvelu[77] kunta[71] uusi[64]
  • topic6=000[8] oulu[7] kaupunki[4] vuosi[3] 2012[3] muu[3] kunta[2] muutos[2] 2013[2] sivu[1]
  • topic7=000[5] 26.03.2013[4] oulu[3] 2012[3] kunta[2] vuosi[2] kirjastojärjestelmä[2] muu[1] kaupunki[1] muutos[1]
  • topic8=oulu[471] kaupunki[268] kaupunginhallitus[227] 2013[137] päivämäärä[97] päätös[93] vuosi[71] tieto[67] 000[66] päätösesitys[64]
  • topic9=oulu[5] lomautus[3] 000[3] kaupunki[2] säästötoimenpidevapaa[1] vuosi[1] kunta[1] kaupunginhallitus[1] sivu[1] henkilöstö[1]
  • topic10=oulu[123] kaupunki[82] alue[63] sivu[43] rakennus[42] asemakaava[39] vuosi[38] tontti[38] 2013[35] osa[35]

Without going too much into translating every word, I would say these results are too spread out, so from this, for this dataset, it seems a smaller set of topics would do better. This also seems to be visible in the word counts/strengths in the [square brackets]. The topics with small weights also seem pretty poor topics, while the ones with bigger weights look better (just my opinion of course :)). Maybe something to consider when trying to explore the number of topics etc.

And the same run, this time with 20 passes over the corpus (100 topics and 10 first ones shown):

  • topic0=oulu[138] kaupunki[128] palvelu[123] toiminta[92] kehittää[73] myös[72] tavoite[62] osa[55] vuosi[50] toteuttaa[44]
  • topic1=-seurantatieto[0] 2008-2010[0] =30065[0] =170189[0] =257121[0] =38760[0] =13408[0] oulu[0] 000[0] kaupunki[0]
  • topic2=harmaa[2] tilaajavastuulaki[1] tilaajavastuu.fi[1] torjunta[1] -palvelu[1] talous[0] harmaantalous[0] -30.4.2014[0] hankintayksikkö[0] kilpailu[0]
  • topic3=juhlavuosi[14] 15.45[11] perussopimus[9] reilu[7] kauppa[6] juhlatoimikunta[6] työpaja[6] 24.2.2014[6] 18.48[5] tapahtumatuki[4]
  • topic4=kokous[762] kaupunginhallitus[591] päätös[537] pöytäkirja[536] työjärjestys[362] hyväksyä[362] tarkastaja[360] esityslista[239] valin[188] päätösvaltaisuus[185]
  • topic5=koulu[130] sivistys-[35] suuralue[28] perusopetus[25] tilakeskus[24] kulttuurilautakunta[22] järjestää[22] korvensuora[18] päiväkota[17] päiväkoti[17]
  • topic6=piste[24] hanke[16] toimittaja[12] hankesuunnitelma[12] tila[12] toteuttaa[11] hiukkavaara[10] hyvinvointikeskus[10] tilakeskus[10] monitoimitalo[9]
  • topic7=tiedekeskus[3] museo-[2] prosenttipohjainen[2] taidehankinta[1] uudisrakennushanke[1] hankintamääräraha[1] prosenttitaide[1] hankintaprosessi[0] toteutusajankohta[0] ulosvuokrattava[0]
  • topic8=euro[323] milj[191] vuosi[150] oulu[107] talousarvio[100] tilinpäätös[94] kaupunginhallitus[83] kaupunki[79] 2012[73] 2013[68]
  • topic9=päätös[653] oikaisuvaatimus[335] oulu[295] kaupunki[218] päivä[215] voi[211] kaupunginhallitus[208] posti[187] pöytäkirja[161] viimeinen[154]

Even the smaller topics here seem much better now with the increase in passes over the corpus. So perhaps the real difference just comes from having enough passes over the data, giving the algorithms more time and data to refine the models. At least I would not try without multiple passes based on comparing the results here of 1 vs 20 passes.

For example, topic2 here has small numbers but still all items seem related to grey market economy. Similarly, topic7 has small numbers but the words are mostly related to arts and culture.

So to summarize, it seems lemmatizing your words, exploring your parameters, and ensuring to have a decent amount of data and decent number of passes for the algorithm are all good points. And properly cleaning your data, and iterating over the process many times to get these right (well, as “right”as you can).

To answer my “research questions” from the beginning: topic modelling for different languages and use cases for topic modelling.

First, lemmatize all your data (I prefer it over stemming but it can be more resource intensive). Clean all your data from the typical stopwords for your language, but also for your dataset and domain. Run the models and analysis several times, and keep refining your list of removed words to clean also based on your use case, your dataset and your domain. Also likely need to consider domain specific lemmatization rules as I already discussed with POS tagging.

Secondly, what use cases did I find looking at topic modelling use cases online? Actually, it seems really hard to find concrete actual reports of uses for topic models. Quora has usually been promising but not so much this time. So I looked at reports in the published research papers instead, trying to see if any companies were involved as well.

Some potential use cases from research papers:

Bug localization, as in finding locations of bugs in source code is investigated here. Source code (comments, source code identifiers, etc) is modelled as topics, which are mapped to a query created from a bug report.

Matching duplicates of documents in here. Topic distributions over bug reports are used to suggest duplicate bug reports. Not exact duplicates but describing the same bug. If the topic distributions are close, flag them as potentially discussing the same “topic” (bug).

Ericsson has used topic models to map incoming bug reports to specific components. To make resolving bugs easier and faster by automatically assigning them to (correct) teams for resolution. Large historical datasets of bug reports and their assignments to components are used to learn the topic models. Topic distributions of incoming bug reports are used to give probability rankings for the bug report describing a specific component, in comparison to topic distributions of previous bug reports for that component. Topic distributions are also used as explanatory data to present to the expert looking at the classification results. Later, different approaches are reported at Ericsson as well. So just to remind that topic models are not the answer to everything, even if useful components and worth a try in places.

In cyber security, this uses topic models to describe users activity as distributions over the different topics. Learn topic models from user activity logs, describe each users typical activity as a topic distribution. If a log entry (e.g., session?) diverges too much from this topic distribution for the user, flag it as an anomaly to investigate. I would expect simpler things could work for this as well, but as input for anomaly detection, an interesting thought.

Tweet analysis is popular in NLP. This is an example of high-level tweet topic classification: Politics, sports, science, … Useful input for recommendations etc., I am sure. A more targeted domain specific example is of using topics in Typhoon related tweet analysis and classification: Worried, damage, food, rescue operations, flood, … useful input for situation awareness, I would expect. As far as I understood, topic models were generated, labeled, and then users (or tweets) assigned to the (high-level) topics by topic distributions. Tweets are very small documents, so that is something to consider, as discussed in those papers.

Use of topics models in biomedicine for text analysis. To find patterns (topic distributions) in papers discussing specific genes, for example. Could work more broadly as one tool to explore research in an area, to find clusters of concepts in broad sets of research papers on a specific “topic” (here a research on a specific gene). Of course, there likely exist number of other techniques to investigate for that as well, but topic models could have potential.

Generally labelling and categorizing large number of historical/archival documents to assist users in search. Build topic models, have experts review them, and give the topics labels. Then label your documents based on their topic distributions.

Bit further outside the box, split songs into segments based on their acoustic properties, and use topic modelling to identify different categories/types of music in large song databases. Then explore the popularity of such categories/types over time based on topic distributions over time. So the segments are your words, and the songs are your documents.

Finding image duplicates of images in large data sets. Use image features as words, and images as documents. Build topic models from all the images, and find similar types of images by their topic distributions. Features could be edges, or even abstract ones such as those learned by something like a convolutional neural nets. Assists in image search I guess..

Most of these uses seem to be various types of search assistance, with a few odd ones thinking outside the box. With a decent understanding, and some exploration, I think topic models can be useful in many places. The academics would sayd “dude XYZ would work just as well”. Sure, but if it does the job for me, and is simple and easy to apply..

Word2Vec with some Finnish NLP

To get a better view of the popular Word2Vec algorithm and its applications in different contexts, I ran experiments on Finnish language and Word2vec. Let’s see.

I used two datasets. First one is the traditional Wikipedia dump. I got the Wikipedia dump for the Finnish version from October 20th. Because I ran the first experiments around that time. The seconds dataset was the Board minutes for the City of Oulu for the past few years.

After running my clearning code on the Wikipedia dump it reported 600783 sentences and 6778245 words for the cleaned dump. Cleaning here refers to removing all the extra formatting, HTML tagging, etc. Sentences were tokenized using Voikko. For the Board minutes the similar metrics were 4582 documents, 358711 sentences, and 986523 words. Most interesting, yes?

For running Word2vec I used the Deeplearning4J implementation. You can find the example code I used on Github.

Again I have this question of whether to use lemmatization or not. Do I run the algorithm on baseformed words or just unprocessed words in different forms?

Some prefer to run it after lemmatization, while generally the articles on word2vec say nothing on the topic but rather seem to run it on raw text. This description of a similar algorithm actually shows and example of mapping “frog” to “frogs”, further indicating use of raw text. I guess if you have really lots of data and a language that does not have a huge number of forms for different words that makes more sense. Or if you find relations between forms of words more interesting.

For me, Finnish has so many forms of words (morphologies or whatever they should be called?) and generally I don’t expect to run with hundreds of billions of words of data, so I tried both ways (with and without lemmatization) to see. With my limited data and the properties of the Finnish language I would just go with lemmatization really, but it is always interesting to try and see.

Some results for my experiments:

Wikipedia without lemmatization, looking for the closest words to “auto”, which is Finnish for “car”. Top 10 results along with similarity score:

  • auto vs kuorma = 0.6297630071640015
  • auto vs akselin = 0.5929439067840576
  • auto vs auton = 0.5811734199523926
  • auto vs bussi = 0.5807990431785583
  • auto vs rekka = 0.578578531742096
  • auto vs linja = 0.5748337507247925
  • auto vs työ = 0.562477171421051
  • auto vs autonkuljettaja = 0.5613142848014832
  • auto vs rekkajono = 0.5595266222953796
  • auto vs moottorin = 0.5471497774124146

Words from above translated:

  • kuorma = load
  • akselin = axle’s
  • auton = car’s
  • bussi = bus
  • rekka = truck
  • linja = line
  • työ = work
  • autonkuljettaja = car driver
  • rekkajono = truck queue
  • moottorin = engine’s

A similarity score of 1 would mean a perfect match, and 0 a perfect mismatch. Word2vec builds a model representing position of words in “vector-space”. This is inferred from “word-embeddings”. This sounds fancy, and as usual, it is difficult to find a simple explanation of what is done. I view it a taking typically 100-300 numbers to represent each numbers relation in the “word-space”. These get adjusted by the algorithm as it goes through all the sentences and records each words relation to other words in those sentences. Probably all wrong in that explanation but until someone gives a better one..

To preprocess the documents for word2vec, I split the documents to sentences to give the words a more meaningful context (a sentence vs just any surrounding words). There are other similar techniques, such as Glove, that may work better with more global “context” than a sentence. But anyway this time I was playing with Word2vec, which I think is also interesting for many things. It also has lots of implementations and popularity.

Looking at the results above, there is the word “auton”, translating to “car’s”. Finnish language has a a large number of forms that different words can take. So, sometimes, it may be good to lemmatize to see what the meaning of the word better maps to vs matching forms of words. So I lemmatize with Voikko, the Finnish language lemmatizer again. Re-run of above, top-10:

  • auto vs ajoneuvo = 0.7123048901557922
  • auto vs juna = 0.6993820667266846
  • auto vs rekka = 0.6949941515922546
  • auto vs ajaa = 0.6905277967453003
  • auto vs matkustaja = 0.6886627674102783
  • auto vs tarkoitettu = 0.66249680519104
  • auto vs rakennettu = 0.6570218801498413
  • auto vs kuljetus = 0.6499230861663818
  • auto vs rakennus = 0.6315782070159912
  • auto vs alus = 0.6273047924041748

Meanings of the words in English:

  • ajoneuvo = vehicle
  • juna = train
  • rekka = truck
  • ajaa = drive
  • matkustaja = passenger
  • tarkoitettu = meant
  • rakennettu = built
  • kuljetus = transport
  • rakennus = building
  • alus = ship

So generally these mappings make some sense. Not sure about those building words. Some deeper exploration would probably help..

Some people also came up with the idea of POS tagging before running word2vec. Called it Sense2Vec and whatever. Just so you could better differentiate how different meanings of a word map differently. So to try to POS tag with the tagger I implemented before. Results:

  • auto_N vs juna_N = 0.7195479869842529
  • auto_N vs ajoneuvo_N = 0.6762610077857971
  • auto_N vs alus_N = 0.6689988970756531
  • auto_N vs kone_N = 0.6615594029426575
  • auto_N vs kuorma_N = 0.6477057933807373
  • auto_N vs tie_N = 0.6470917463302612
  • auto_N vs seinä_N = 0.6453390717506409
  • auto_N vs kuljettaja_N = 0.6449363827705383
  • auto_N vs matka_N = 0.6337422728538513
  • auto_N vs pää_N = 0.6313328146934509

Meanings of the words in English:

  • juna = train
  • ajoneuvo = vehicle
  • alus = ship
  • kone = machine
  • kuorma = load
  • tie = road
  • seinä = wall
  • kuljettaja = driver
  • matka = trip
  • pää = head

soo… The weirdest ones here are the wall and head parts. Perhaps again a deeper exploration would tell more. The rest seem to make some sense just by looking.

And to do the same for the City of Oulu Board minutes. Now looking for a specific word for the domain. The word being “serviisi”, which is the city office responsible for food production for different facilities and schools. This time lemmatization was applied for all results. Results:

  • serviisi vs tietotekniikka = 0.7979459762573242
  • serviisi vs työterveys = 0.7201094031333923
  • serviisi vs pelastusliikelaitos = 0.6803742051124573
  • serviisi vs kehittämisvisio = 0.678106427192688
  • serviisi vs liikel = 0.6737961769104004
  • serviisi vs jätehuolto = 0.6682301163673401
  • serviisi vs serviisin = 0.6641604900360107
  • serviisi vs konttori = 0.6479293704032898
  • serviisi vs efekto = 0.6455909013748169
  • serviisi vs atksla = 0.6436249017715454

because “serviisi” is a very domain specific word/name here, the general purpose Finnish lemmatization does not work for it. This is why “serviisin” is there again. To fix this, I added this and some other basic forms of the word to the list of custom spellings recognized by my lemmatizer tool. That is, using Voikko but if not found trying a lookup in a custom list. And if still not found, writing a list of all unrecognized words sorted by highest frequency first (to allow augmenting the custom list more effectively).

Results after change:

  • serviisi vs tietotekniikka = 0.8719592094421387
  • serviisi vs työterveys = 0.7782909870147705
  • serviisi vs johtokunta = 0.695137619972229
  • serviisi vs liikelaitos = 0.6921887397766113
  • serviisi vs 19.6.213 = 0.6853622794151306
  • serviisi vs tilakeskus = 0.673351526260376
  • serviisi vs jätehuolto = 0.6718368530273438
  • serviisi vs pelastusliikelaitos = 0.6589146852493286
  • serviisi vs oulu-koilismaan = 0.6495324969291687
  • serviisi vs bid=2300 = 0.6414187550544739

Or another run:

  • serviisi vs tietotekniikka = 0.864517867565155
  • serviisi vs työterveys = 0.7482070326805115
  • serviisi vs pelastusliikelaitos = 0.7050554156303406
  • serviisi vs liikelaitos = 0.6591876149177551
  • serviisi vs oulu-koillismaa = 0.6580390334129333
  • serviisi vs bid=2300 = 0.6545186638832092
  • serviisi vs bid=2379 = 0.6458192467689514
  • serviisi vs johtokunta = 0.6431671380996704
  • serviisi vs rakennusomaisuus = 0.6401894092559814
  • serviisi vs tilakeskus = 0.6375274062156677

So what are all these?

  • tietotekniikka = city office for ICT
  • työterveys = occupational health services
  • liikelaitos = company
  • johtokunta = board (of directors)
  • konttori = office
  • tilakeskus = space center
  • pelastusliikelaitos = emergency office
  • energia = energy
  • oulu-koilismaan = name of area surrounding the city
  • bid=2300 is an identier for one of the Serviisi board meeting minutes main pages.
  • 19.6.213 seems to be a typoed date and could at least be found in one of the documents listing decisions by different city boards.

So almost all of these words that “serviisi” is found to be closest to are other city offices/companies responsible for different aspects of the city. Such as ICT, energy, office space, emergency response, of occupation health. Makes sense.

OK, so much for the experimental runs. I should summarize something about this.

The wikipedia results seem to give slightly better results in terms of the words it suggests being valid words. For the city board minutes I should probably filter more based on presence of special characters and numbers. Maybe this is the case for larger datasets vs smaller ones, where the “garbage” more easily drowns in the larger sea of data. Don’t know.

The word2vec algorithm also has a set of parameters to tune, which probably would be worth more investigation to get more optimized results for these different types of datasets. I simply used the same settings for both the city minutes and Wikipedia. Yet due to size differences, likely it would be interesting to play at least with the size of the vector space. For example, bigger datasets might benefit more from having a bigger vector space, which should enable them to express richer relations between different words. For smaller sets, a smaller space might be better. Similarly, number of processing iterations, minimum word frequencies etc should be tried a bit more. For me the goal here was to get a general idea on how this works and how to use it with Finnish datasets. For this, these experiments are enough.

If you read up on any articles of Word2Vec you will likely also see the hype on the ability to do equations such as “king – man + woman” = “queen”. These are from training on large English corpuses. It simply says that the relation of the word “queen” to word “woman” in sentences is typically the same as the relation of the word “king” to “man”. But then this is often the only or one of very few examples ever. Looking at the city minutes example here, since “serviisi” seems to map closest to all the other offices/companies of the city, what do we get if we run the arithmatic on “serviisi-liikelaitos” (so liikelaitos would be the common concept of the office/company). I got things like “city traffic”, “reduce”, “children home”, “citizen specific”, “greenhouse gas”. Not really useful. So this seems most useful as a potential tool for exploration but cannot really say which part gives useful results when. But of course, it is nice to report on the interesting abstractions it finds, not on boring fails.

I think lemmatization in these cases I showed here makes sense. I have no interest in just knowing that a singular form of a word is related to a plural form of the same word. But I guess in some use cases that could be valid. Of course, for proper lemmatization you might also wish to first do POS tagging to be able to choose the correct baseforms from all the options presented. In this case I just took the first baseform from the list Voikko gives for each word.

Tokenization could also be of more interest. Finnish language has a lot of compound words, some of which are visible in the above examples. For example, “kuorma-auto”, and “linja-auto” for the wikipedia example. Or the different “liikelaitos” combinations for the city of Oulu version. Further n-grams (combinations of words) would be useful to investigate further. For example, “energia” in the city example could easily be related to the city power company called “Oulun Energia”. Many similar examples likely can be found all over any language and domain vocabulary.

Further custom spelling would also be useful. For example, “oulu-koilismaan” above could be spelled as “oulu-koillismaan”. And it could further be baseformed with other forms of itself as “oulu-koillismaa”. Collecting these from the unrecognized words should make this relatively easy, and filtering out the low-frequency occurrences of the words.

So perhaps the most interesting question, What is this good for?

Not synonym search. Somehow over time I got the idea word2vec could give you some kind of synonums and stuffs. Clearly it is not for that but rather to identify words over similar concepts and the like.

So generally I can see it could be useful for exploring related concepts in documents. Or generally exploring datasets and building concept maps, search definitions, etc. More as an input to the human export work rather than fully automated as the results vary quite a bit.

Some interesting applications I found while looking at this:

  • Word2vec in Google type search, as well as search in general.
  • Exploring associations between medical terms. Perhaps helpful identify new links you did not think of before? Likely would match other similar domains as well.
  • Mapping words in different languages together.
  • Spotify mapping similar songs together via treating songs as words and playlists as sentences.
  • Someone tried it on sentiment analysis. Not really sure how useful that was as I just skimmed the article but in general I can see how it could be useful to find different types of words related to sentiments. As before, not necessarily as automated input but rather as input to an expert to build more detailed models.
  • Using the similarity score weights as means to find different topics. Maybe you could combine this with topic modelling and the look for diversity of topics?
  • Product recommendations by using products as words and sequences of purchases as sentences. Not sure how big is the meaning of purchase order but interesting idea.
  • Bet recommendations by modelling bets made by users as bet targets being words and sequences of bets sentences, finding similarities with other bets to recommend.

So that was mostly that. Similar tools exist for many platforms, whatever gives you the kicks. For example, Voikko has some python module on github to use and Gensim is a nice tool for many NLP processing tasks, including Word2Vec on python.

Also lots of datasets, especially for the English language, to use as pretrained word2vec models. For example, Facebooks FastText, Stanfords Glove datasets, Google news corpus from here. Anyway, some simple internet searches should turn out many such to use, which I think is useful for general purpose results. For more detailed domain specific ones training is good as I did here for the city minutes..

Many tools can also take in word vector models built with some other tool. For example, deeplearning4j mentions import of Glove models and Gensim lists support for FastText, VarEmbed and WordRank. So once you have some good idea of what such models can do and how to use them, building combinations of these is probably not too hard.

Finnish POS tagging part 2

Previously I wrote about Building a Finnish POS tagger. This post is to elaborate a bit on training with OpenNLP, which I skimmed last time, put the code for it out, and do some additional tests on it.

I am again using the Finnish Treebank to get 4.4M pre-tagged sentences to train on. Start with a Python script to transform the Treebank XML into an OpenNLP suitable format. A short example of the output below, in the format OpenNLP takes as input (at least in the configuration I used). One line contains one sentence, each word with associated POS tag, word and tag separated with an underscore “_”.

  • 1_Num artikla_N Nimi_N ja_CC tarkoitus_N
  • Hankintakeskukseen_N sovelletaan_V perustamissopimuksen_N ja_CC tämän_Pron perussäännön_N määräyksiä_N ._Punct
  • Hankintakeskuksen_N toiminnan_N kestolle_N ei_V aseteta_V määräaikaa_N ._Punct

The tags have been assigned by human experts who provide the Treebank. The whole Treebank file is parsed and output similar to above is generated by the Python script.

Check Github for the code to train the OpenNLP tagger. Or use the command line options.

Previously I described the test results using the Treebank data with a train/test split, showing reasonably good results. However, how well does it work in practice with some simple test sentences? Does it matter how the training and tagger input data is pre-processed? What do I mean by pre-processed?

Stemming and lemmatization are two basic transformations that are often used in NLP. Stemming is a process of cutting the ending of a word to get simple version that matches all different forms of the word. The result is not always a real “word”. For example, “argue”, “arguing”, “argus” could all stem to “argu”. Lemmatization on the other hand produces more “real” words (the Wikipedia link describes it as producing the dictionary base forms).

A related question that came to my mind: Does it matter if you stem/lemmatize your words you give as input to the tagger to train and test? I could not find a good answer on Google. One question on Stack Overflow about stemming vs POS tagging. And the response seems to be not to give an answer but riddles… Who would’ve guessed about the machine learning community? 😛

Well, reading the discussion and other answers on the StackOverflow page seems to suggest not to stem before POS tagging. And the wikipedia pages on stemming and lemmatization describe the difference as in Lemmatization requiring the context (the POS tag) to properly function. Which makes sense, since words can have multiple meaning depending on their context (part of speech). So therefore we should probably conclude that it is better to not stem or lemmatize before training a POS tagger (or using it I guess). But common sense never stopped us before, so lets try it.

To see for myself, I tried to train and use the tagger with some different configurations:

  • Tagger: Plain = Takes words in the sentence and tries to POS tag them as is. Not stemmed, not lemmatized, just as they are.
  • Tagger: Voikko = Takes words in the sentence, converts them to baseform (lemma?), reconstructs the sentence from the baseformed words. You can see the actual results and effect in the output column in the results table below.
  • Trained on: 100k = The tagger was trained on the first 100k sentences in the Finnish Treebank.
  • Trained on: 4M = The tagger was trained on the first 4M sentences in the Finnish Treebank.
  • Trained on: basecol = The tagger was trained on baseform column of the treebank.
  • Trained on: col1 = The tagger was trained on column 1 of the treebank, containing the unprocessed words (no baseforming or anything else).
  • Trained on: voikko = The tagger was trained on column 1 of the treebank, but before training all words in the sentence were baseformed using Voikko. Similar to “Tagger: Voikko” but for training data.
  • Input: The input sentence fed to the tagger. This was split to an array on whitespace, as the OpenNLP tagger takes an array of words for sentence as input.
  • Output: The output from the tagger, formatted as word_tag. Word = the word given to the tagger as input for that part of the sencence, tag = the POS tag assigned by the tagger for that word.

So the Treebank actually has a “baseform” column that is described in the Treebank docs as having the baseform of each word. However, I do not have the tool used for the Treebank to baseform the words. Maybe it was manually done by the people who also tagged the sentences. Don’t know. I use Voikko as a tool to baseform words.

I still wanted to try the use of the baseform column in the Treebank so I ran all the words (baseform col and col1) in the Treebank through Voikko to see if it would recognize them. Recorded all the misses and sorted them highest occurence count to lowest. This showed me that the Treebank has its own “oddities”. Some examples:

  • “merkittävä” becomes “merkittää”
  • “päivästä” becomes “päivänen”
  • “työpaikkoja” becomes “työ#paikko”

These are just a few examples of highly occurring and odd looking baseforms in the Treebank. None of these, in my opinion, map quite directly to understandable Finnish words. And Voikko provides different results (gives different baseform for “merkittävä”, “päivästä”, etc), so the two baseforming approaches would not match. I wanted results that I felt I could show to people who would understand what they meant. On the other hand, some of the words in the Treebank are quite domain-specific and valid but Voikko does not recognize them. Common Treebank examples of this include “CN-koodeihin”, “CN-koodiin”, “ETY-tyyppihyväksynnän”, “ETY-tyyppihyväksyntään”, “läsnäollessa”. Treebank has valid baseforms for these but Voikko does not recognize these specific ones.

So I just tried it with the different configuration versions above, as illustrated in the results table below:

Tagger Trained on Input Output
Plain 100k basecol junassa on vessa junassa_N on_A vessa_N
tuli tuli tulipesästä tuli_N tuli_N tulipesästä_V
voi on maukasta leivän päällä voi_N on_A maukasta_N leivän_PrfPrc päällä_Abbr
juodaan jaffaa ladassa juodaan_Unkwn jaffaa_Punct ladassa_Unkwn
liika vesi vesitti kilpailun liika_N vesi_N vesitti_N kilpailun_Abbr
syynä ovat todennäköisimmin rutiininomaiset tietokannan ylläpitotoimet. syynä_N ovat_Unkwn todennäköisimmin_Adv rutiininomaiset_Unkwn tietokannan_Unkwn ylläpitotoimet._Punct
teemu, miksi sinulla on tuollaisia lauseita tuossa? teemu,_Punct miksi_Unkwn sinulla_V on_Unkwn tuollaisia_Unkwn lauseita_Unkwn tuossa?_Punct
no kun jotain piti keksiä 😦 no_Interj kun_CS jotain_Adv piti_N keksiä_V :(_Punct
Plain 100k col1 junassa on vessa junassa_N on_V vessa_N
tuli tuli tulipesästä tuli_V tuli_V tulipesästä_N
voi on maukasta leivän päällä voi_V on_V maukasta_N leivän_N päällä_N
juodaan jaffaa ladassa juodaan_V jaffaa_CC ladassa_N
liika vesi vesitti kilpailun liika_N vesi_N vesitti_V kilpailun_N
syynä ovat todennäköisimmin rutiininomaiset tietokannan ylläpitotoimet. syynä_N ovat_V todennäköisimmin_Adv rutiininomaiset_A tietokannan_N ylläpitotoimet._Punct
teemu, miksi sinulla on tuollaisia lauseita tuossa? teemu,_Punct miksi_N sinulla_N on_V tuollaisia_A lauseita_N tuossa?_Punct
no kun jotain piti keksiä 😦 no_Abbr kun_CS jotain_Pron piti_V keksiä_A :(_Punct
Voikko 100k voikko junassa on vessa juna_N olla_V vessa_N
tuli tuli tulipesästä tuli_V tuli_N tulipesä_N
voi on maukasta leivän päällä voi_V olla_V maukas_N leipä_N pää_N
juodaan jaffaa ladassa juoda_V jaffa_CC lada_V
liika vesi vesitti kilpailun liika_Adv vesi_N vesittää_V kilpailu_N
syynä ovat todennäköisimmin rutiininomaiset tietokannan ylläpitotoimet. syy_N olla_V todennäköinen_Adv rutiininomainen_A tietokanta_N ylläpitotoimet._Punct
teemu, miksi sinulla on tuollaisia lauseita tuossa? teemu,_Punct mikä_Pron sinä_N olla_V tuollainen_A lause_N tuossa?_Punct
no kun jotain piti keksiä 😦 no_Interj kun_CS jokin_Pron pitää_V keksi_N :(_Punct
Voikko 100k basecol junassa on vessa juna_N olla_V vessa_Unkwn
tuli tuli tulipesästä tuli_N tuli_N tulipesä_N
voi on maukasta leivän päällä voi_N olla_V maukas_N leipä_N pää_N
juodaan jaffaa ladassa juoda_PrsPrc jaffa_CC lada_PrsPrc
liika vesi vesitti kilpailun liika_N vesi_N vesittää_V kilpailu_N
syynä ovat todennäköisimmin rutiininomaiset tietokannan ylläpitotoimet. syy_N olla_V todennäköinen_Adv rutiininomainen_A tietokanta_N ylläpitotoimet._Punct
teemu, miksi sinulla on tuollaisia lauseita tuossa? teemu,_Punct mikä_Pron sinä_Pron olla_V tuollainen_A lause_N tuossa?_Punct
no kun jotain piti keksiä 😦 no_Interj kun_CS jokin_Pron pitää_V keksi_Adv :(_Punct
Plain 4M basecol junassa on vessa junassa_Unkwn on_Unkwn vessa_Unkwn
tuli tuli tulipesästä tuli_N tuli_N tulipesästä_Punct
voi on maukasta leivän päällä voi_N on_V maukasta_Unkwn leivän_Abbr päällä_Abbr
juodaan jaffaa ladassa juodaan_Unkwn jaffaa_Punct ladassa_Unkwn
liika vesi vesitti kilpailun liika_N vesi_N vesitti_N kilpailun_Abbr
syynä ovat todennäköisimmin rutiininomaiset tietokannan ylläpitotoimet. syynä_A ovat_Unkwn todennäköisimmin_Adv rutiininomaiset_Unkwn tietokannan_Adv ylläpitotoimet._Abbr
teemu, miksi sinulla on tuollaisia lauseita tuossa? teemu,_Punct miksi_Unkwn sinulla_Unkwn on_Unkwn tuollaisia_Unkwn lauseita_Punct tuossa?_Punct
no kun jotain piti keksiä 😦 no_Interj kun_Punct jotain_Adv piti_N keksiä_PrfPrc :(_Punct
Plain 4M col1 junassa on vessa junassa_N on_V vessa_N
tuli tuli tulipesästä tuli_V tuli_V tulipesästä_N
voi on maukasta leivän päällä voi_V on_V maukasta_A leivän_N päällä_N
juodaan jaffaa ladassa juodaan_V jaffaa_V ladassa_PrsPrc
liika vesi vesitti kilpailun liika_A vesi_N vesitti_V kilpailun_N
syynä ovat todennäköisimmin rutiininomaiset tietokannan ylläpitotoimet. syynä_N ovat_V todennäköisimmin_Adv rutiininomaiset_A tietokannan_N ylläpitotoimet._Punct
teemu, miksi sinulla on tuollaisia lauseita tuossa? teemu,_Punct miksi_Pron sinulla_Pron on_V tuollaisia_A lauseita_N tuossa?_Punct
no kun jotain piti keksiä 😦 no_Interj kun_Punct jotain_Pron piti_V keksiä_N :(_Punct
Voikko 4M col1 junassa on vessa juna_N olla_V vessa_N
tuli tuli tulipesästä tuli_V tuli_V tulipesä_N
voi on maukasta leivän päällä voi_V olla_V maukas_A leipä_N pää_N
juodaan jaffaa ladassa juoda_V jaffa_Num lada_V
liika vesi vesitti kilpailun liika_A vesi_N vesittää_V kilpailu_N
syynä ovat todennäköisimmin rutiininomaiset tietokannan ylläpitotoimet. syy_N olla_V todennäköinen_A rutiininomainen_A tietokanta_N ylläpitotoimet._Punct
teemu, miksi sinulla on tuollaisia lauseita tuossa? teemu,_Punct mikä_Pron sinä_Pron olla_V tuollainen_A lause_N tuossa?_Punct
no kun jotain piti keksiä 😦 no_Interj kun_Punct jokin_Pron pitää_V keksi_N :(_Punct
Voikko 4M voikko junassa on vessa juna_N olla_V vessa_N
tuli tuli tulipesästä tuli_V tuli_N tulipesä_N
voi on maukasta leivän päällä voi_V olla_V maukas_A leipä_N pää_N
juodaan jaffaa ladassa juoda_N jaffa_N lada_N
liika vesi vesitti kilpailun liika_Adv vesi_N vesittää_V kilpailu_N
syynä ovat todennäköisimmin rutiininomaiset tietokannan ylläpitotoimet. syy_N olla_V todennäköinen_A rutiininomainen_A tietokanta_N ylläpitotoimet._Punct
teemu, miksi sinulla on tuollaisia lauseita tuossa? teemu,_Punct mikä_Pron sinä_Pron olla_V tuollainen_A lause_N tuossa?_Punct
no kun jotain piti keksiä 😦 no_Interj kun_Punct jokin_Pron pitää_V keksi_N :(_Punct
Voikko 4M basecol junassa on vessa juna_N olla_V vessa_N
tuli tuli tulipesästä tuli_N tuli_N tulipesä_N
voi on maukasta leivän päällä voi_N olla_V maukas_A leipä_N pää_N
juodaan jaffaa ladassa juoda_V jaffa_N lada_V
liika vesi vesitti kilpailun liika_N vesi_N vesittää_V kilpailu_N
syynä ovat todennäköisimmin rutiininomaiset tietokannan ylläpitotoimet. syy_N olla_V todennäköinen_Adv rutiininomainen_A tietokanta_N ylläpitotoimet._Punct
teemu, miksi sinulla on tuollaisia lauseita tuossa? teemu,_Punct mikä_Pron sinä_Pron olla_V tuollainen_A lause_N tuossa?_Punct
no kun jotain piti keksiä 😦 no_Interj kun_CS jokin_Pron pitää_V keksi_N :(_Punct

You can find all the POS tags etc. listed and explained in the Treebank Manual. Here are most of the above:

  • N = Noun
  • V = Verb
  • PrfPrc = Past participle
  • A = Adjective
  • CS = Subordinating conjunction
  • Abbr = Abbreviation
  • Num = Numeral
  • Punct = Punctuation
  • Adv = Adverb
  • Unkwn = Unknown

Some of these (CS, PrfPrc, Adv, …) are bit more detailed than I ever want to get after leaving primary school 100 years ago. That is to say, I have no idea what they mean. Luckily I am really only interested in the POS tag as input to other algoritms so don’t really care what they are as long as they are correct and help to differentiate the words in context. Of course, with my lack of the language nuances and academic details of all those tags, I am not very good at judging the correctness of the taggings above. But a few notes anyway:

  • Using the baseform column from the Treebank to train the tagger and to tag unprocessed sentences (tagger “plain”): Lots of unknowns and failed taggings in general. Size of training corpus makes little difference.
  • Using Treebank col 1 to train and the “plain” tagger gives better results. Still it has some issues but most general cases are not too bad.
  • Baseforming all words in the sentence to be tagged with Voikko (tagger “Voikko”) and using col 1 to train results in about similar performance as “plain” tagger with col 1.
  • Tagger “Voikko” with training type “voikko” and 4M sentences seems to give the best match. It has some issues though.
  • Baseforming the sentence to tag with Voikko has a chicken and egg problem (as mentioned in the Wikipedia links I put high above). You can get multiple baseforms for a word, depending on what POS the word is. If you need to define this to do POS tagging, then how do you pick which one to use? For example “keksiä” in Finnish refers to “innovating” but could also mean “cookie”. Here, I just used the first baseform of a word given by Voikko, which for “keksiä” just happens to be the one for “cookie”. When the correct one in this case would be the “innovation” one..
  • As there are two different baseforming approaches here (Voikko and Treebank baseform col), mixing them causes worse results than using a unified baseforming approach (Voikko for both training and later tagging). So better to stick with just the same baseformer/lemmatizer for all data.
  • Special elements such as smileys would need to be trained separately :). Here they are just treated as punctuation.
  • “Jaffa” is a Finnish drink. It gets classified here correctly as N but also as numerical, punctuation, or verb. Maybe too rare a word or something? Numerical and punctuation are still odd.
  • Splitting with whitespace here causes issues with sentences ending in puctuation. The last words of sentences with “.”, “?”, or such, end up classified as “Punct”. Better splitting (tokenization) needed. Since punctuation is also trained on the tagger, it should not be just discarded though as I guess it can provide valuable context for the rest of the words.
  • Some of my test sentences I made up to be difficult to POS tag, and with very limited sentences above, this is likely not a generally representative case. For example, “Tuli tuli” can be translated as “Fire came” (intent here), “Fire fire”, “It came it came”, and probably valid taggings would also be “N V N”, “V N N”, “N N N”, “V V V”. Some of it might even be difficult for humans without broader context, although the “tulipesä” (fireplace) would likely tip people off. Similarly “voi” could also be translated as “butter” (intent here) or “could”.
  • Much bigger tests would be very useful to categorize what can be tagged right, what causes issues, etc.
  • It would also be useful to have a system available to choose whether the sentence was tagged right or not, and to retrain further the tagger with the errors. Maybe use a generator to build further examples of such errors.

So I guess the better configurations here can do a reasonable job of tagging most sentences, as illustrated by these results and the ones I listed before (the accuracy test on Treebank test/train split).

Most obviously, words with multiple meanings (possible POS tags) still require some more tuning. Maybe something with broader context (e.g., previous sentences, following sentences, iterations, probabilistic approaches,..?)?

I am not so familiar with all the works, such as Google’s Parsey McParseface. Because you know, its deep learning and that is all the rage, right ? 🙂 Would be interesting to try, but the whole setup is more than I can do right now.

Better tuning of OpenNLP parameters might also help if I had more expertise on that, and the its mapping to Finnish language peculiarities. In general, I am sure I am missing plenty of magic tricks the NLP guru’s could use.

In generall, I guess it is most likely just better to train the tagger before lemmatization/baseforming as noted before.

What more can I summarize here? Not much, not further than the bullets and points above. But this may provide a useful starting point for those interested in POS tagging for Finnish. Possibly useful points for some other languages as well..

Topicmodels, topicmodels, …

I have previously done some topic modelling using LDA (Latent Dirilech Allocation). Back then I used a nice video from some nice guy but somehow could not find the video with search engines anymore. Too bad. Implemented LDA in Java back then based on that tutorial. I learned how it works, not why it works. Still don’t quite get why the set of topics emerges from the algorithm.

Actually I found a reasonably good explanation on Quora. Well, it is a good one if you already know most of how LDA works. Eh. Also a tutorial briefly summarizing how online LDA works, which is a nice improvement, and I guess what the tools use these days.

The number of topics LDA produces is given as a parameter, and is always a bit of a puzzle for me how to pick the best number for topics. Googling for it, I found various references to using “perplexity” to choose the best number of topics. I still have not found a good “for dummies” explanation for what that really means in practice for LDA, or how to implement it. Maybe some of the libs out there will do it for me? Python seems all the rage in data science these days, because whatever. So after a few search, gensim it is.

Gensim seems to have some perplexity options and a bunch of weird formulas to apply. Is it so hard to write some simple docs and explain these things? I guess nobody pays people to do it, and doing for free would just go against the goal of making oneself important. Sort of makes sense, and applies to most OSS software I have used. Or maybe I am just bad at using stuff.

Anyway. There is also something called topic coherence in Gensim. This is supposed to be some way to evaluate the number of topics. Somehow the explanation does not work for me. I did not quite grasp how it works for real. So I just gave it a try to see what I get, that would be most important for me regardless.

I start with the English wikipedia (I used a May 2017 dump). Because it is sorta big and I can put the results here, everyone knows it and it’s public data. Gensim nicely comes with a script to parse it for dictionary and corpus:

python -m gensim.scripts.make_wiki

Then some code to build different sizes of topic models (25 to 200 topics in 25 topic size increments)

import logging, gensim, bz2
import os, sys

#http://stackoverflow.com/questions/13733552/logger-configuration-to-log-to-file-and-print-to-stdout
#https://aykutakin.wordpress.com/2013/08/06/logging-to-console-and-file-in-python/
#configure_log function reconfigures python logging to write to the specific directory for the analysis size. so lda25 log goes into lda25 dir
def configure_log(log_path, log_name):
    logFormatter = logging.Formatter("%(asctime)s [%(threadName)-12.12s] [%(levelname)-5.5s]  %(message)s")
    rootLogger = logging.getLogger()
    rootLogger.setLevel(logging.INFO)

    #http://stackoverflow.com/questions/12034393/import-side-effects-on-logging-how-to-reset-the-logging-module
    #http://stackoverflow.com/questions/2612802/how-to-clone-or-copy-a-list
    #need to copy the list of handlers or we will be iterating what we are modifying and it will fail to work as intended
    handlers_to_remove = rootLogger.handlers[:]
    for handler in handlers_to_remove:
        rootLogger.removeHandler(handler)
        
    filters_to_remove = rootLogger.filters[:]
    for filter in filters_to_remove:
        rootLogger.removeFilter(filter)

    fileHandler = logging.FileHandler("{0}/{1}.log".format(log_path, log_name))
    fileHandler.setFormatter(logFormatter)
    rootLogger.addHandler(fileHandler)

    consoleHandler = logging.StreamHandler(sys.stdout)
    consoleHandler.setFormatter(logFormatter)
    rootLogger.addHandler(consoleHandler)

#load wikipedia dictionary. this gets generated by the gensim wikipedia script
id2word = gensim.corpora.Dictionary.load_from_text('wikires_wordids.txt.bz2')
#and the wikipedia corpus
mm = gensim.corpora.MmCorpus('wikires_tfidf.mm')

sizes = [25, 50, 75, 100, 125, 150, 175, 200]

#ensure_dir makes sure a given path exists, creating if needed
def ensure_dir(file_path):
    directory = os.path.dirname(file_path)
    if not os.path.exists(directory):
        os.makedirs(directory)

#run gensim LDA using autotuning for the hyperparameters
def run_auto():
    for size in sizes:
        dir = "lda_auto"+str(size)+"/"
        ensure_dir(dir)
        configure_log(dir, "lda_auto"+str(size))
        lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=size, update_every=1, chunksize=10000, passes=1, alpha="auto", eta="auto")
        lda.print_topics(20)
        lda.save(dir+"a_model"+str(size)+".lda")

#run gensim LDA using default values for the hyperparameters
def run_default():
    for size in sizes:
        dir = "lda"+str(size)+"/"
        ensure_dir(dir)
        configure_log(dir, "lda"+str(size))
        lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=size, update_every=1, chunksize=10000, passes=1)
        lda.print_topics(20)
        lda.save(dir+"model"+str(size)+".lda")

run_default()
run_auto()

The code above drops a set of 9 different sized topic models into matching directories. Both for default parameters and autotuned parameters. Takes a while to run. The machine I ran it on has 32GB RAM and a quad-core Core i7 processor (hyperthreads to 8 virtual cores). Resource use? I actually found the Gensim implementations are quite nicely optimized not to take huge amounts of memory, and they also pretty much make use of all the cores in a system. Except perhaps the topic cohesion ones that seemed to run single core still. Perhaps because they seem relatively new?

My first mistake in this regard was to think of LDA as a single-core solution. I implemented the original algorithm some times back, and did not see it becoming anything else. But the online version seems to batch it in pieces, which I guess makes it more parallelizable. And the Gensim docs also nicely describe how running this online algorithm now also merges the results in a way that you don’t necessarily need to run large numbers of passes (iterations) over the corpus to converge on a better model. Chunksize 10000 in the above code seems to cause this merge after each 10000 docs, and with Wikipedia having about 4 million articles, this amounts for quite a few merges. Maybe somewhat equal to iterations of old.

With logging enabled, Gensim prints some texts about “topic diff” between each batch and merge. This seems to indicate how much the topic model changed between the runs. So I plotted the topic diff for the wikipedia run (when generating the LDA models), to see how much the topics drift during the run. See figure below for the 9 sizes I used, using Gensim default LDA parameters:

lda_grid

And for using the autotuned parameters:

lda_a_grid

From this, it seems the topic model actually pretty much “converges” quite early in the process. That is, the topic diff goes down to a small number and the topics become quite stable across merges/iterations. Maybe because there is so much data in this dataset? And the autotuned version seems much more direct to converge. So I will use that later.

After this, I ran the same analysis on a bunch of document sets I have from different Finnish organizations. I won’t be putting the exact data for those documents online here, but I will show some statistics on the runs and the models produced, as well as my feeling from looking at the topics generated and the stats. Some stats when running the autotuned version (because the autotuned seemed to converge faster and about equally on quality on wikipedia):

type id doc count
1 3651
2 1930
3 679
4 5596
5 1058
6 343
7 228
8 1069
9 333
10 213
11 279
12 316
13 592
14 397
15 104
16 1076
17 1648

Since these have a very small number of documents when compare to Wikipedia, I ran the Gensim LDA model generator for them in the online mode using batch size of 1000. Separately with 10 iterations and 100 iterations to get some comparable data on impact of iteration counts. Listing all 3×3 grids for the 17 document sets would be a bit much to show here. So after looking at them, I figured they were mostly similar but with maybe a few minor differences. So I picked three types (based on my feelings when looking at the figures):

Type 1 (this grid is for doc set with type id 6 from above):
10 iterations:
t6_bd_lda_a_grid

100 iterations:
t6_bd_lda_a100_grid

Type 2 (this grid is for doc set with type id 5 from above):
10 iterations:
t5_j_lda_a_grid

100 iterations:
t5_j_lda_a100_grid

Type 3 (this grid is for doc set with type id 7 from above):
10 iterations:
t7_sd_lda_a_grid

100 iterations:
t7_sd_lda_a100_grid

Remember, the types are just something I made up myself. I chose Type 1 to refer to models where there was a big difference from 10 iterations to 100 iterations in the final topic diff for the 25 topic run. In the example Type 1 figures here (for doc type 6), the 10 iteration run gets to around 0.25 final diff. In my set for type 1, document sets 2, 16, and 17 had the biggest diff of about 0.5 in the end after 10 iterations. Document sets 3, 6, 9, 12, 13, and 14 were close to 0.2 diff after 10 iterations. Document sets 10 and 11 were close to 0.1 diff for 10 iterations. Each of these was close to 0 final diff after 100 iterations.

Type 2 refers to models where the 25 topics line has a noticeable “jiggly” effect to it. Maybe this is between the iterations (or “passes”)? Not sure how Gensim restarts iterations, so could have something to do with it. Topics for document sets 5 and 8 had the biggest such effects, as also shown in the Type 2 figure above for document set 5. For document sets 1 and 4, the effect was smaller but still seemed to be there.

Type 3 refers to models where there was no big difference in final topic diff in 10 vs 100 iterations. This was just the models for document sets 7 and 15. These are also the two smallest document sets (least docs). Maybe smaller sets converge better with fewer iterations?

Looking at the document type count table above, there is no clear correlation with document count and the types of figures (1,2,3) I used above. There could be other differences in properties of the documents (e.g., length, number of real distinct topics embedded in each). Not in my scope to investigate further, but the reasons could be anything, what do I know.

The properties I used to select the types are mostly visible in the smaller number of topics. With higher number of topics they all seem quite similar. Maybe the algorithm has to work harder to fit the data into fewer topics? Or maybe I just have so little data there that larger number of topics always produces garbage topics uniformly? No idea, really.

The code I used to run this is here:

__author__ = 'teemu kanstren'

#loads docs from es and runs lda on those, saves the model

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
from gensim.corpora.dictionary import Dictionary
from gensim import corpora
import gensim
import logging, sys, os

#configure logging for gensim and other packages to write to correct dir and with given log file name
def configure_log(log_path, log_name):
#this is the same code as before for this function so not repeating here..

#ensures a dir exists
def ensure_dir(file_path):
    directory = os.path.dirname(file_path)
    if not os.path.exists(directory):
        os.makedirs(directory)

configure_log(".", "teemu")

es = Elasticsearch()

indices=es.indices.get_alias().keys()
print(indices)

#get mapping for the index we are interested in
mapping = es.indices.get_mapping("my_index")
print(mapping)

#find all document types in the mapping
keys = mapping["my_index"]["mappings"].keys()
types = [key for key in keys]
print(types)

fields = es.indices.get_field_mapping(index="my_type", fields="*")
print(fields)

#https://marcobonzanini.com/2015/02/02/how-to-query-elasticsearch-with-python/

def process_search(s, dirname, filename):
    dir = "output/"+dirname+"/"
    ensure_dir(dir)
    count = 0

    dict = Dictionary()

    for hit in s.scan():
        #    print(hit.meta.score, hit.file_name)
        #    print(count)
        #skip file if we are lazy with the query writing and potentially loading too many and need a specific fifeld
        if "my_contents" not in hit: continue
        count += 1
        # update dictionary with document words
        dict.doc2bow(hit.my_contents.split(), allow_update=True)

    print(count)
    print(dict)

    corpus = []
    for hit in s.scan():
        if "my_contents" not in hit: continue
        line = dict.doc2bow(hit.my_contents.split())
        corpus.append(line)

    dict.save(dir+filename+"_hellome.dict")
    corpora.MmCorpus.serialize(dir+filename+'_hellome-corpus.mm', corpus)

    # exit()

    sizes = [25, 50, 75, 100, 125, 150, 175, 200]
    for size in sizes:
        configure_log(dir, filename+"_lda_auto" + str(size))
        lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dict, num_topics=size, update_every=1, chunksize=1000, passes=10, alpha="auto", eta="auto")
        lda.print_topics(size)
        lda.save(dir + filename+"_a_model" + str(size) + ".lda")

    for size in sizes:
        configure_log(dir, filename+"_lda_auto_100" + str(size))
        lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dict, num_topics=size, update_every=1, chunksize=1000, passes=100, alpha="auto", eta="auto")
        lda.print_topics(size)
        lda.save(dir + filename+"_a_model_100" + str(size) + ".lda")

for type in types:
    #this is simply if you want to combine several, so the ES query is just a list for doc_type
    s = Search(using=es, index="oulu_komu", doc_type=[type, type+"_extra_field"]) \
        .query("match_all").sort("doc_id")
    process_search(s, type, type)

And to plot it:

__author__ = 'teemu kanstren'

import matplotlib.pyplot as plt
import sys

dirname=sys.argv[1]

sizes = [25, 50, 75, 100, 125, 150, 175, 200]

def read_log_data(fileprefix):
    log_data = []
    for size in sizes:
        #http://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python
        with open(fileprefix+str(size)+".log") as f:
            topic_diffs = []
            rhos = []
            iterations = []
            td_str = "topic diff="
            td_str_len = len(td_str)
            rho_str ="rho="
            rho_str_len = len(rho_str)
            i = 0
            for line in f:
                ti = line.find(td_str)
                ri = line.find(rho_str)
                if ti > 0 and ri > 0:
                    iterations.append(i)
                    i += 1
                    ti += td_str_len
                    ri += rho_str_len
                    te = line.index(",", ti)
                    re = len(line)
                    topic_diff = float(line[ti:te])
                    rho = float(line[ri:])
                    topic_diffs.append(topic_diff)
                    rhos.append(rho)
            log_data.append((iterations, topic_diffs, rhos))
            print("topic diffs:"+str(topic_diffs))
            print("rhos:"+str(rhos))
    return log_data

def create_plot(log_datum, row, col, topic_n, axarr):
    iterations = log_datum[0]
    topic_diffs = log_datum[1]
    rhos = log_datum[2]
    axarr[row, col].plot(iterations[1:], topic_diffs[1:])
    axarr[row, col].plot(iterations[1:], rhos[1:])
    axarr[row, col].set_title('LDA'+str(topic_n))

def create_plots(suffix):
    plt.figure()
    plt.gcf().set_size_inches(18.5, 10.5)
    f, axarr = plt.subplots(3, 3)

    log_data = read_log_data(dirname+"/"+dirname+suffix)
    #log_data2 = read_log_data(dirname+"/"+dirname+"_lda_auto_100")

    row = 0
    col = 0
    for idx, val in enumerate(log_data):
        create_plot(log_data[idx], row, col, sizes[idx], axarr)
        col += 1
        if col >= 3:
            col = 0
            row += 1

    # Fine-tune figure; make subplots farther from each other.
    f.subplots_adjust(hspace=0.3)

    plt.gcf().set_size_inches(18.5, 10.5)

create_plots("_lda_auto")
plt.savefig(dirname+'/lda_a_grid.png', bbox_inches='tight', dpi=200)
plt.savefig(dirname+'/lda_a_grid.pdf', bbox_inches='tight', dpi=200)

create_plots("_lda_auto_100")
plt.savefig(dirname+'/lda_a100_grid.png', bbox_inches='tight', dpi=200)
plt.savefig(dirname+'/lda_a100_grid.pdf', bbox_inches='tight', dpi=200)

And once the models are built, the Gensim cohesion estimatior can be run to evaluate which of these is best according to Gensim. I used the u_mass evaluator here, since it does not require the corpus to be reloaded. According to this website, others such as c_v are more accurate while u_mass is faster. For my experiments I am just looking for a general experience on usefulness of the coherence measure here. If I had more motivation and resources I might try the others as well. Mostly resources, since my results are not too good and further exploration would be interesting to make the results better. But lets not jump too far. Code:

__author__ = 'teemu kanstren'

from gensim.models.coherencemodel import CoherenceModel
import logging
import gensim, sys

dirname = sys.argv[1]
size = int(sys.argv[2])
dir = dirname+"/"

#first set up python logging to go into the separate subdir+filename for the given dirname and size
logFormatter = logging.Formatter("%(asctime)s [%(threadName)-12.12s] [%(levelname)-5.5s]  %(message)s")
rootLogger = logging.getLogger()
rootLogger.setLevel(logging.DEBUG)

fileHandler = logging.FileHandler(dir+"coherence"+str(size)+".log") #log name
fileHandler.setFormatter(logFormatter)
rootLogger.addHandler(fileHandler)

consoleHandler = logging.StreamHandler()
consoleHandler.setFormatter(logFormatter)
rootLogger.addHandler(consoleHandler)

log = logging.getLogger("bob") #this ("bob") can be whatever but do check python docs

log.info("calculating coherence for size:"+str(size))

log.info("loading dictionary")
dictionary = gensim.corpora.Dictionary.load(dir+dirname+'_hellome.dict')
log.info("loading corpus")
corpus = gensim.corpora.MmCorpus(dir+dirname+'_hellome-corpus.mm')
log.info("loading previously generated lda model")
lda = gensim.models.ldamodel.LdaModel.load(dir+dirname+'_a_model'+str(size)+'.lda')

log.info("building coherence model")
cm = CoherenceModel(model=lda, corpus=corpus, coherence='u_mass')
log.info("cm built, getting coherence")
c = cm.get_coherence() #this is the part that seems to do the calculation and takes a while
log.info("done, c="+str(c))

And to plot it:

__author__ = 'teemu kanstren'

import sys
import matplotlib

#this statement needs to be before importing pyplot if wanting to run in headless mode
matplotlib.use('Agg')
sizes = [25, 50, 75, 100, 125, 150, 175, 200]

import matplotlib.pyplot as plt
from os import walk

dirname=sys.argv[1]

def read_log_data(dirname):
    fileprefix = dirname+"/coherence"
    iterations = []
    for size in sizes:
        #http://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python
        with open(fileprefix+str(size)+".log") as f:
            target_str = " c="
            target_str_len = len(target_str)
            i = 0
            for line in f:
                ti = line.find(target_str)
                if ti > 0:
                    start_i = ti+target_str_len
                    iterations.append(line[start_i:])
                    i += 1
    return iterations

data = read_log_data(dirname)
print(data)

f, ax = plt.subplots()
ax.plot(sizes, data)
ax.set_title('Coherence 10 iterations')
plt.savefig(dirname+'_lda.png', bbox_inches='tight', dpi=200)

And the results for each of the document sets:

Doc set id 10 iterations 100 iterations
1 t1_lda_10 t1_lda_100
2 t2_lda_10 t2_lda_100
3 t3_lda_10 t3_lda_100
4 t4_lda_10 t4_lda_100
5 t5_lda_10 t5_lda_100
6 t6_lda_10 t6_lda_100
7 t7_lda_10 t7_lda_100
8 t8_lda_10 t8_lda_100
9 t9_lda_10 t9_lda_100
10 t10_lda_10 t10_lda_100
11 t11_lda_10 t11_lda_100
12 t12_lda_10 t12_lda_100
13 t13_lda_10 t13_lda_100
14 t14_lda_10 t14_lda_100
15 t15_lda_10 t15_lda_100
16 t16_lda_10 t16_lda_10
17 t17_lda_10 t17_lda_10

So how does all this feel when I load the topics up and look at them?

Have to say, maybe not very excited. Mostly the topics make at least some sense but many of those coherence measures show higher values for bigger numbers. Like 100 iteration coherence for document sets 7 and 15 showing a set of topics around 150 would be great. Doc set 15 even has fewer documents that that. Manually looking at the generated topics, a large number them are almost the same topics actually. They have mostly the same words, and very low weights for topics/words, meaning very few words in the docs got assigned to the topics. So it would seem that for most purposes topic count for these document sets is better at the lower number of topics. Unless maybe if you want to capture really fine grained differences in topics. Not sure what that would be good fo but maybe it has some use cases.

So if the smaller number of topics would be better, maybe I need to try even smaller number of topics. Seems reasonable given the smallish number of documents I have. Like number of topics at 5, 10, 15, 20. See where that takes me. Here we go:

Doc set id coherence (autotuned parameters, 100 iterations)
1 t1s_lda_100
2 t2s_lda_100
3 t3s_lda_100
4 t4s_lda_100
5 t5s_lda_100
6 t6s_lda_100
7 t7s_lda_100
8 t8s_lda_100
9 t9s_lda_100
10 t10s_lda_100
11 t11s_lda_100
12 t12s_lda_100
13 t13s_lda_100
14 t14s_lda_100
15 t15s_lda_100
16 t16s_lda_10
17 t17_lda_10

Comparing these figures with the ones before for topic counts 25-200, the lower number of topics generally scored better here. Just for a quick comparison, most of these 2-20 sizes have the highest score close to -0.5 to -0.7, while the best scores for 25-200 were closer to -1.0. The difference being againg topic 15, which trolls us again with a value close to -0.8 at 3 and 150 topics. Eh.

For final comparison and seeing what I think of the topics found at different sizes, I simply manually examined the topics by printing them to files like so:

__author__ = 'teemu kanstren'

from gensim.models import LdaModel
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
from collections import defaultdict

import gensim
import operator, logging, sys

def configure_log(log_path, log_name):
    #again, this configure_log is the same as in previous samples so not repeating..

def process_lda_model(dict, model_file, topic_count, docs):
    log = logging.getLogger("bob")

    lda = LdaModel.load(model_file, mmap='r')
    topic_words = {}
    for t in range(topic_count):
        # top is now list of tuples (word, probability). topn=number of words to take
        top = lda.show_topic(t, topn=100)
        topic_words[t] = top

    #now calculate the size (or "relevance") of each topic. 
    #meaning large portion of all docs was assigned to each topic.

    topic_sizes = defaultdict(int)

    for doc in docs:
        doc_bow = dict.doc2bow(doc)
        dist = lda[doc_bow]
        for topic_word in dist:
            #count topic sizes by summing the percentage of all words in all docs assigned to that topic
            #(note: instances of one word can be in different topics across the doc)
            topic_id = topic_word[0]
            percent = topic_word[1]
            topic_sizes[topic_id] += percent
    log.info("sized topics")

    #now calculate the size (or "relevance") of each word in each topic in relation to other topics
    #so if word "hello" is 90% of topic A, which is itself 90% of all docs, "hello" gets a size of 0.9*0.9 for topic A

    topic_words_weighted = {}
    for t in range(topic_count):
        t_words = topic_words[t] #get the top words for this topic as stored before
        topic_size = topic_sizes[t] #the weight/size/relevance of this topic as calculated before
        tw_words = [] #to hold list of weighted words for this topic
        topic_words_weighted[t] = tw_words
        for word, percent in t_words:
            my_tuple = (word, percent * topic_size)
            tw_words.append(my_tuple)

    log.info("sized words")

    #sort the topics in numerical order so sorted_topics contains them in order topic 0, topic 1, topic 2, ...
    sorted_topics = sorted(topic_sizes.items(), key=operator.itemgetter(0))

    #finally, create a nice file to write it all out in my favourite format
    file_data = ""

    for topic in sorted_topics:
        topic_id = topic[0]
        file_data += "topic"+str(topic_id)+"="
        tww = topic_words_weighted[topic_id]
        for tw in tww:
            #word sizes are floats, and typically quite small ones. like 0-10 or so. 
            #multiply by 100 to give the values some diff when converted to ints
            word_size = int(tw[1]*100)
            file_data += tw[0]+"["+str(word_size)+"] "
        file_data += "\n"

    log.info("built file data")
    print(file_data)
    return file_data

#create the weighted word list for the docs given by the elasticsearch query stored in "s"
#assume lda models are stored under "dirname" in "fname" with specific extensions
def process_model(s, dirname, fname):
    log = logging.getLogger("bob")
    configure_log(dirname, dirname+"_topicbulklister.log")
    dict = gensim.corpora.Dictionary.load(dirname+"/"+dirname+'_hellome.dict')
    docs = []
    count = 0
    for hit in s.scan():
        count += 1
        #taking the lazy way out here, loading all docs into memory for processing
        #mostly because my doc sets are small and i got tired of optimizing everything when no real need
        #of course, it would be nice to have an example of doing it right for real cases later..
        docs.append(hit.contents.split())

    log.info("loaded "+str(count)+" docs for:" + dirname)
    sizes = [25, 50, 75, 100, 125, 150, 175, 200]
    for size in sizes:
        #these would be models with 10 iterations
        log.info("processing model size:" + str(size))
        model_file = dirname +"/"+ fname + "_a_model" + str(size) + ".lda"
        file_data = process_lda_model(dict, model_file, size, docs)
        f = open(dirname+"/topics_a"+str(size)+".txt", 'w')
        f.write(file_data)
        f.close()

        #these would be models run with 100 iterations
        log.info("processing a100_model size:" + str(size))
        model_file = dirname +"/"+ fname + "_a_model_100" + str(size) + ".lda"
        file_data = process_lda_model(dict, model_file, size, docs)
        f = open(dirname+"/topics_a100_"+str(size)+".txt", 'w')
        f.write(file_data)
        f.close()


es = Elasticsearch()

#https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/
#http://miningthedetails.com/blog/python/lda/GensimLDA/
#https://groups.google.com/forum/#!topic/gensim/s4OivwKdfng

mapping = es.indices.get_mapping("my_index")
# find all document types in the mapping
keys = mapping["my_index"]["mappings"].keys()
types = [key for key in keys]

for type in types:
    #this is simply if you want to combine several, so the ES query is just a list for doc_type
    s = Search(using=es, index="my_index", doc_type=[type, type+"_extra_field"]) \
        .query("match_all").sort("doc_id")
    process_model(s, type, type)

After dumping all my doc sets (1-17) like this, and looking at the ones getting the highest/lowest cohesion values, I could not really say in any way that the values would have been better for the highest cohesion values. Certainly for these small document sets, the smaller topic counts were better if looking for clearly distinct topics. Which I think most people would look for. So I am sure there is some value here. And trying out the more accurate cohesion metrics such as c_v (as discussed at the beginning of this post) would probably give better results. Maybe someday.

Alternatively, for a more visual exploration, there is also the option to use the LDAvis package. Wikipedia example:

__author__ = 'teemu kanstren'

import gensim
import pyLDAvis.gensim
import sys
import logging

size = int(sys.argv[1])
dir = "lda"+str(size)+"/"

logFormatter = logging.Formatter("%(asctime)s [%(threadName)-12.12s] [%(levelname)-5.5s]  %(message)s")
rootLogger = logging.getLogger()
rootLogger.setLevel(logging.DEBUG)

fileHandler = logging.FileHandler(dir+"ldavis"+str(size)+".log")
fileHandler.setFormatter(logFormatter)
rootLogger.addHandler(fileHandler)

consoleHandler = logging.StreamHandler()
consoleHandler.setFormatter(logFormatter)
rootLogger.addHandler(consoleHandler)

log = logging.getLogger("bob")

log.info("processing model size:"+str(size))

log.info("loading dictionary")
dictionary = gensim.corpora.Dictionary.load_from_text('wikires_wordids.txt.bz2')
log.info("loading corpus")
corpus = gensim.corpora.MmCorpus('wikires_tfidf.mm')
log.info("loading lda")
lda = gensim.models.ldamodel.LdaModel.load(dir+'model'+str(size)+'.lda')

log.info("preparing model")
p = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
log.info("saving HTML")
pyLDAvis.save_html(p, dir+'lda'+str(size)+'.html')
log.info("done")

This dumps the whole LDAvis thing into a HTML file you can then load up any time later and play with. Nice thing about this is that it can be run on a headless remote server, and produces a single HTML file (a bit large but anyway). This HTML file can then be downloaded and opened from a local file. So no webserver needed anywhere, and the interactive visualization can be shared as a single file.

How does it look? To continue avoiding dumping the Finnish datasets here, I use examples for 25, 100 and 200 topics from Wikipedia:

25:
ldavis25

100:
ldavis100

200:
ldavis200

The first (and biggest) topic in the list of 25 is related to movies. Same for the 100 topics. In 200 topics, music takes the first spot. In 200, the second is about novels (book), third football, and finally movies come fourth.

In the LDAvis figure here for 25 topics, the cluster of four smaller ones on the right are related to Asian countries. In the topic word list below for 25 topics, these are topics 4, 14,16, and 20. The numbering is just different because they are ordered differently. The LDAvis figure above for 200 topics also has a cluster of small ones on the left, with many of those for countries/states but also some for other topics such as chess, church, weightlifting and more. I am sure this would also be an interesting topic to study, why PCA grounds them together.

In general, there are a number of parameters to play with in LDAvis, and I don’t pretend to know all of/about them. For example, you can cycle through the topics using the controls on the top as well. A handy tool for topic exploration.

But I do also prefer just using the textual outputs of the topics as shown below. To see a large number of topics at once vs cycling through one at a time. Maybe some combination would work best.

The 25 and 100 topics from wikipedia for my text output code above:

25 Wikipedia topics (I manually tried cut these to 20 top words from 100 I printed, so its ~20 words each):

topic0=missouri[103342] wisconsin[87078] iowa[73418] virginia[70289] illinois[69885] arkansas[69130] carolina[68071] michigan[65583] ohio[60676] texas[60572] community[57331] washington[56913] indiana[54950] oregon[50765] florida[49446] district[46548] tennessee[46349] georgia[45458] california[45178] minnesota[45132] 
topic1=radio[76316] fm[67433] tv[52798] station[48613] channel[45489] television[39179] news[38537] broadcast[32143] broadcasting[27691] suffusion[26570] show[25864] am[24653] intelsat[24026] network[23193] owned[22635] pm[20234] presenter[17802] format[15873] program[15775] satellite[15183] 
topic2=village[221007] river[158933] district[140531] population[132051] km[116103] lake[111496] census[93802] island[90835] workers[84431] mountain[78471] park[74170] municipality[69123] creek[66698] reserve[65274] villages[63916] region[62479] road[61653] forest[61465] nearest[58958] town[58572] 
topic3=historic[213169] building[206192] station[165629] railway[147564] church[125265] register[124965] listed[100791] places[100328] street[90542] buildings[90221] brick[79067] roof[76969] bridge[72999] story[70627] road[62897] tower[62769] style[59430] district[57602] construction[57450] stone[54772] 
topic4=bangladesh[60954] india[47358] indian[43847] singh[37676] delhi[22949] kumar[22606] ludhiana[22469] sarpanch[21241] punjab[21091] bengal[19677] dhaka[19650] nepal[18896] hindi[17747] maharashtra[14853] raj[14251] bengali[14100] mumbai[13728] assam[13684] ram[12357] bangladeshi[11979]
topic5=mollusca[32685] mandal[26963] vijayawada[25245] space[22879] physics[18043] earth[17861] satellite[17756] ngc[17519] theory[17341] mathematics[17237] mathematical[16721] star[16649] subsp[15672] orbit[15671] solar[15360] indistinct[15112] purplish[14796] quantum[14606] observatory[14220] fascia[14179] 
topic6=art[151996] museum[105653] gallery[67696] painting[64919] artist[50553] exhibition[50485] painter[46831] paintings[42141] arts[41286] jpg[36333] artists[35557] sculpture[34781] temple[33499] exhibitions[33377] works[33171] collection[32549] meyrick[31776] fine[28686] file[26993] exhibited[26811] 
topic7=la[148371] le[80186] french[73902] german[71216] des[69606] italian[61779] der[61079] paris[60412] du[55323] del[54742] et[53219] spanish[53162] france[51363] jean[51084] von[49691] les[48917] el[46300] di[45814] josé[44796] und[42241] 
topic8=orchestra[44592] opera[35813] composer[34666] piano[29442] symphony[24092] conductor[18077] ballet[17176] violin[16627] choir[16239] musical[14962] pianist[14198] ensemble[13952] performed[13784] soprano[13767] composition[13440] concert[13162] concerto[13151] festival[13137] agder[12684] quartet[11731] 
topic9=episode[105985] films[103371] award[100538] television[100108] directed[93229] cast[92256] tv[91624] awards[90325] festival[89783] actor[88894] novel[84196] role[83970] drama[81154] theatre[79356] actress[79129] story[78659] director[78350] book[77382] episodes[71225] show[67188] 
topic10=research[125717] professor[103512] education[98381] science[92598] institute[90969] society[70253] students[68236] medical[66741] journal[66189] women[63658] studies[62359] award[60876] health[58041] sciences[56046] degree[55455] social[53099] engineering[49884] association[48622] director[48312] department[48123]
topic11=game[76288] software[49848] tamil[47686] india[47314] data[37736] business[33861] mobile[31384] indian[31195] app[31181] companies[30977] bank[30409] services[30306] million[30044] users[29862] http[29824] com[29450] technology[28942] founded[28176] platform[27679] online[27549] t
topic12=bishop[183424] church[156784] catholic[100783] roman[92168] cathedral[61440] pope[58868] diocese[56411] priest[48203] king[48191] archbishop[45448] saint[37550] titular[35613] ordained[34872] religious[32680] papacy[32656] appointed[32149] consecrated[32043] prelate[31535] ancient[30725] holy[30355] 
topic13=scottish[70752] london[66229] sir[61661] william[57129] edinburgh[55924] married[53191] england[53094] scotland[50931] royal[50117] wales[49541] son[44603] ireland[42954] educated[38190] glasgow[37848] thomas[37705] henry[35982] george[35583] james[35313] daughter[35242] irish[34458]
topic14=hong[52137] kong[46242] korean[42778] kim[39075] norwegian[38362] chinese[35425] peakposition[34595] korea[33229] swedish[31117] china[27414] taiwan[26788] lee[25540] thailand[25034] qualifier[23037] thai[21629] norway[20943] jung[19624] min[19478] bangkok[19244] chen[18922] 
topic15=album[361182] song[256952] chart[187401] band[164060] track[134572] vocals[118688] guitar[109699] label[100924] songs[98269] listing[97232] you[96709] studio[95459] records[91948] albums[91396] charts[90836] release[86768] singles[84327] video[82745] singer[80484] bass[76230] 
topic16=japan[58447] japanese[57450] tokyo[33408] termen[29464] albanian[24897] anime[19283] albania[18629] fuji[17357] manga[15986] prefecture[15531] tbs[15178] osaka[12928] ntv[10843] tirana[9896] kyoto[8861] nagano[8601] ni[8014] nippon[7697] kazakhstan[7353] niigata[6965]
topic17=army[114556] regiment[93198] military[74472] navy[70157] division[68054] aircraft[65031] air[65002] ship[64259] infantry[63562] brigade[55079] commander[54114] battle[52332] corps[51897] command[49360] force[46507] naval[46317] forces[45781] battalion[44770] officer[41944] ships[41202]
topic18=al[101883] russian[82869] pakistan[49870] ukrainian[43693] ali[43688] sri[43566] khan[40688] soviet[39267] turkish[38736] moscow[37860] ukraine[35605] iran[35236] polish[33661] russia[30179] islamic[29989] indian[29284] mosque[28952] india[27978] turkey[27629] constituency[27275] 
topic19=league[349638] football[331012] cup[244141] club[238609] tournament[234016] championships[208445] championship[207165] round[184733] player[168229] goals[165863] games[164860] women[158768] coach[156908] basketball[153931] teams[149110] apps[147840] division[143359] professional[125977] match[125015] fc[120375] 
topic20=serbian[38639] china[36968] chinese[35505] serbia[28052] li[24803] croatian[22860] bosnia[19552] belgrade[19534] zhang[19404] wang[19349] herzegovina[16988] segunda[16327] greek[15680] croatia[15431] beijing[14989] rebounds[14775] liu[14475] yugoslav[13266] zagreb[12612] chen[12575]
topic21=engine[39614] energy[35303] power[34124] protein[33132] car[31071] model[30585] cells[29632] gas[27840] design[27833] production[27820] plant[27718] water[27239] system[24462] weight[24101] chemical[22907] acid[21817] gene[21771] cars[21734] type[21152] development[20916]
topic22=party[223593] election[194792] minister[116518] president[104109] elected[97838] council[93701] law[89045] democratic[88397] court[86376] elections[82934] political[81414] assembly[77292] votes[71977] politician[67919] committee[67674] parliament[67278] secretary[65255] union[63703] legislative[63331] police[61634] 
topic23=species[273887] genus[107924] fuscous[90171] mm[89481] forewings[76966] moth[71553] hindwings[67873] described[64660] grows[61203] wingspan[60843] dark[60647] costa[58702] grey[58502] shrub[58279] flowers[57526] ochreous[51859] australia[50788] description[48983] brown[48945] whitish[47779]
topic24=mf[54407] df[43579] outscored[43032] michael[27588] george[27293] james[27233] david[26909] cast[26476] robert[25935] paul[24548] jack[21916] william[21808] smith[21616] peter[21466] richard[20760] frank[20154] ap[19931] tom[19720] joe[18140] directed[18092] 

100 Wikipedia topics (too many topics here so did not manually try to cut it):

topic0=ufc[14709] cornwall[6614] akron[5052] quercus[5002] choke[3639] viaduct[3550] diablos[3463] nani[3381] cornish[3153] hokuriku[3095] zombie[2958] amarillo[2874] quezon[2823] cove[2805] shingle[2664] llanelli[2557] hyeon[2525] lubbock[2443] shooto[2318] bacolod[2253] boku[2209] devonport[2175] belltower[2106] aru[2044] tachi[2000] watashi[1924] quilt[1917] viterbo[1905] aki[1894] grahamstown[1894] angelica[1864] grosvenor[1835] jiu[1812] kacper[1745] yarmouth[1715] volgograd[1706] naru[1694] ives[1686] tomsk[1679] lawton[1665] chinatown[1615] vulgare[1612] bonifacio[1592] chelmsford[1574] pasco[1572] falmouth[1571] dorchester[1557] talmadge[1554] arnheim[1551] jitsu[1544] lunenburg[1542] carousel[1542] truro[1522] zombies[1518] herrero[1509] redruth[1474] brera[1468] águila[1443] rockville[1438] roswell[1434] atif[1417] devon[1417] christi[1411] alston[1404] lenox[1386] anata[1385] llm[1381] usta[1372] mana[1369] mojave[1362] kore[1331] gracie[1328] petrucci[1327] markham[1316] rockaway[1314] laredo[1314] mccord[1313] sherborne[1298] koti[1283] dutchess[1277] riggs[1252] barnstaple[1237] coney[1232] kono[1228] yell[1213] galán[1210] farris[1206] kanto[1205] mcallen[1203] winona[1183] tsa[1170] glitch[1157] buller[1155] nationaal[1152] bia[1144] sphagnum[1139] launceston[1132] bernardino[1116] woodbine[1111] reale[1110] 
topic1=russian[86175] bwf[66598] soviet[41495] moscow[40087] russia[36133] ukrainian[31745] ukraine[28202] hurdles[19550] vladimir[19430] armenian[14691] petersburg[14496] kazakhstan[12926] azerbaijan[12492] ussr[11904] saint[11006] mikhail[10926] armenia[10709] belarusian[10337] ivan[10284] nikolai[10273] alexander[10230] kiev[10169] sergey[9644] latvian[9465] ru[8885] union[8538] aleksandr[8422] georgian[8237] leningrad[7996] sergei[7734] на[7652] freestyle[7589] belarus[7582] azerbaijani[7228] dmitry[7075] latvia[6959] lenin[6831] riga[6595] boris[6426] lithuanian[6339] rostov[5940] andrei[5905] ssr[5866] konstantin[5819] backstroke[5784] pavel[5769] kazan[5688] oleg[5596] yuri[5595] igor[5324] stanislaus[5299] federation[5248] alexey[5109] viktor[5068] bolsheviks[4999] leonid[4986] lithuania[4954] republic[4886] stalin[4885] vasily[4867] pyotr[4826] crimea[4793] duma[4771] romanov[4737] featherweight[4621] almaty[4603] kyrgyzstan[4521] kazakh[4472] anna[4363] medley[4305] flanker[4203] uzbekistan[4177] olga[4136] caucasus[3973] botswana[3968] purge[3891] imperial[3857] по[3840] putin[3738] turkmenistan[3635] ivanov[3623] novgorod[3586] ural[3449] anastasia[3407] siberian[3393] alexei[3208] flyweight[3108] doubles[3024] bantamweight[3015] poltava[3001] empire[2943] surname[2928] maxim[2927] ufa[2924] greek[2923] graduated[2911] georgi[2906] disbanded[2905] player[2892] siberia[2885] 
topic2=acacia[26697] suffused[18767] oblique[10993] fifths[10506] fourths[9659] ell[9292] estrogen[6430] certifications[6119] testosterone[5572] estradiol[5196] snep[4692] blackish[4586] fimi[4473] lh[3303] androgen[3157] ultratop[3129] umass[3044] aas[2972] nz[2849] ant[2830] anabolic[2662] steroid[2532] lista[2444] crib[2338] fabricius[2304] thi[2139] progesterone[2123] ifpi[2055] bpi[2033] vg[1878] giannis[1821] nirmal[1821] pinball[1813] nirmala[1755] hitparade[1694] stinging[1642] kelso[1619] estrogens[1591] suomen[1570] bình[1529] invicta[1528] saito[1517] artem[1497] anh[1487] bp[1457] occ[1394] transporter[1390] nh[1389] wallaroo[1360] sixths[1332] iosif[1322] alcorn[1319] petiole[1303] ethyl[1276] educationist[1258] tran[1254] scoreless[1251] entomologist[1248] paw[1243] grayish[1232] professorships[1180] oriya[1174] intermedia[1171] staudinger[1164] wallonia[1137] hasbro[1112] pce[1087] danang[1081] rasa[1061] bpm[1057] bombus[1046] alder[1034] platformer[1022] amer[1017] đồng[1002] subunit[988] lindner[980] ios[975] ngai[969] basheer[965] bindi[957] gorman[952] hòa[948] oud[940] setar[935] panjab[934] nettles[933] brunner[902] cheetahs[902] bathinda[900] dawley[891] neuro[887] ahr[885] steroids[882] parsecs[880] dimethyl[875] dur[874] sahni[873] falcón[872] ura[871] 
topic3=village[109355] van[97424] dutch[86857] district[81290] municipality[78818] census[68729] population[66663] netherlands[51807] administrative[43905] amsterdam[37228] belgian[33340] settlement[32358] town[28706] province[28269] antwerp[26004] governorate[25952] rural[24730] belgium[23748] villages[23039] region[21933] inhabitants[21374] urban[21286] municipalities[20903] municipal[20388] utrecht[19781] het[19129] km[18919] brussels[18885] community[18634] der[18300] geography[17924] canton[17439] ghent[16428] rotterdam[15092] reorganisation[15005] jan[14911] flemish[14900] jpg[14850] flanders[14518] seat[14282] localities[14236] den[14032] liège[14008] leuven[13419] settlements[13306] republic[13264] willem[13032] file[12861] zambia[12806] division[12731] hague[12532] groningen[12448] center[12404] towns[12052] according[11938] river[11890] craftsman[11795] districts[11787] northern[11585] en[11513] leiden[11433] openstreetmap[11282] pieter[11231] haarlem[11033] consists[10939] nl[10694] holland[10557] divisions[10547] cities[10216] sint[10013] frans[9621] centre[9394] created[9380] bureau[9330] brabant[9082] church[8803] okrug[8795] bruges[8710] situated[8688] demographics[8589] suriname[8569] capital[8311] een[8257] effect[8099] surinamese[8045] sdf[8013] mechelen[8000] nijmegen[7975] zambian[7888] nederland[7879] limburg[7875] jurisdiction[7788] land[7736] divided[7722] delft[7710] voor[7706] central[7479] border[7408] norway[7344] arti[7342] 
topic4=river[144700] lake[113130] park[85940] creek[80149] island[75102] mountain[74717] forest[64862] reserve[62474] water[52777] site[50826] conservation[47660] stream[46429] valley[40905] region[40730] flows[40716] tributary[37487] mountains[37008] land[36780] bay[36438] nature[34530] lighthouse[34487] wildlife[34463] village[34400] rivers[33915] sea[33073] km[31437] species[31426] natural[30547] lies[29973] northern[29535] district[29350] protected[28939] range[28891] areas[28859] basin[28198] mount[27906] locality[27849] western[27748] southern[26962] province[26934] birds[26082] cave[25641] coast[25198] islands[25112] trail[24699] trees[24513] hipped[24159] elevation[24137] australia[23805] situated[23769] hill[23518] eastern[23385] town[23381] meters[23323] road[23186] southwest[22865] northwest[22762] confluence[22391] dam[22130] peak[21994] fish[21661] municipality[21330] northeast[21243] beach[21191] lakes[21074] peninsula[21034] flora[20955] rock[20762] forests[20256] above[20159] location[19914] point[19840] summit[19748] southeast[19698] fishing[19559] reservoir[19511] fauna[19157] jpg[18836] archaeological[18566] approximately[18498] border[18305] andes[18256] hills[17980] mouth[17872] geography[17837] canyon[17780] route[17663] formation[17231] climate[17134] blooms[17048] vegetation[17008] level[16838] parks[16303] access[16235] population[16118] cattle[16090] woodland[15944] source[15750] height[15743] rocks[15725] 
topic5=zealand[44443] fa[27430] auckland[23845] manchester[21181] england[21153] london[20403] town[19447] wellington[17898] yorkshire[17373] leeds[17219] councillors[15904] sheffield[15658] liverpool[15104] christchurch[14995] lancashire[14533] canterbury[14082] borough[14068] ward[13976] bradford[13950] nottingham[13657] archdeacon[13336] wales[13281] leicester[13136] bristol[13019] cardiff[12311] birmingham[12167] hibernian[12056] wards[11719] halifax[10861] midlothian[10837] park[10572] ontario[10460] scotia[10200] midlands[9725] newcastle[9638] welsh[9357] nova[9323] hull[8834] bowls[8702] council[8461] oldham[8461] durham[8328] otago[8252] scorers[8251] hon[8014] newfoundland[7959] essex[7799] brighton[7590] educated[7461] coventry[7390] chelsea[7379] unionist[7304] curling[7250] alberta[6996] stoke[6928] sunderland[6889] redistribution[6884] plymouth[6864] aston[6796] dunedin[6771] lib[6758] kingston[6705] exeter[6693] huddersfield[6684] attendance[6603] salford[6514] peterborough[6475] swindon[6446] middlesbrough[6411] watford[6360] cambridge[6354] bolton[6320] barrow[6313] bucurești[6260] scorer[6260] johnstone[6153] ipswich[6129] cheshire[6079] ireland[6052] barnet[6045] vale[5989] preston[5941] prop[5911] charlton[5889] wolverhampton[5832] southend[5626] northern[5527] manitoba[5507] davies[5486] athletic[5475] kensington[5470] canadian[5418] oxford[5417] ham[5371] stockport[5357] canada[5355] wembley[5321] queensland[5298] score[5277] sutton[5216] 
topic6=orchestra[35771] opera[30905] composer[28564] piano[21415] symphony[19113] festival[16541] ballet[15497] theatre[15462] gymnastics[15316] conductor[14614] musical[13243] ensemble[12998] choir[12860] violin[12808] performed[11796] pianist[11778] dance[11451] soprano[10907] concert[10711] directed[10634] gymnast[10602] conservatory[10567] concerto[10255] cast[10094] quartet[9296] composition[9177] frau[9061] theater[9058] starring[8832] classical[8608] op[8607] philharmonic[8498] studied[8125] chamber[8068] vaudeville[7625] director[7610] bach[7564] singer[7434] composed[7383] telenovela[7377] prize[7369] composers[7184] gma[7140] yoo[7122] teatro[7120] cbn[7016] abs[6888] works[6756] competition[6739] cello[6736] violinist[6604] artistic[6597] bibliography[6591] organist[6574] maria[6515] rhythmic[6462] drama[6402] dancer[6329] concerts[6252] soloist[6192] string[6182] jazz[6062] concise[5914] libretto[5852] clarinet[5828] premiere[5705] flute[5678] performance[5640] italian[5627] viola[5607] choral[5483] act[5412] anna[5255] rmnz[5235] mozart[5139] cinema[5133] teacher[5116] solo[5111] performances[5092] sonata[5090] la[5068] compositions[5061] tenor[5038] conducted[5029] ehf[5003] elena[4982] screened[4848] orchestras[4834] voice[4775] orchestral[4736] singing[4682] di[4632] premiered[4591] piece[4575] beethoven[4508] folkloric[4503] acts[4500] comedy[4469] silent[4419] performing[4416] 
topic7=missouri[89320] wisconsin[67139] iowa[59342] community[58234] virginia[50291] carolina[48249] illinois[46772] unincorporated[46333] porch[45947] vermont[43797] ohio[42105] maine[40117] arkansas[38441] tennessee[37574] railroad[37546] oregon[36736] indiana[34821] texas[33894] office[33652] alabama[32494] italianate[32284] post[31448] mississippi[31417] washington[29996] georgia[29282] pennsylvania[28409] kentucky[27526] kansas[26582] florida[23025] creek[22903] louisiana[22693] michigan[22401] massachusetts[22141] nc[22074] township[21957] maryland[21784] district[21039] town[20622] dakota[20567] oklahoma[20531] established[20255] nebraska[19566] jersey[18595] chicago[18236] remained[17690] minnesota[17594] operation[17364] louisville[17253] elementary[16629] historic[16451] schools[16357] franklin[16264] california[16201] moved[16194] delaware[15531] portland[15413] utah[15250] colorado[14996] route[14978] springs[14731] jefferson[14595] cemetery[14551] river[14545] milwaukee[14238] sec[14093] madison[14071] connecticut[14020] nashville[13944] miles[13489] william[13316] fort[13311] sioux[13247] lake[13246] jackson[13224] richmond[13101] charleston[13067] arizona[13029] lincoln[12906] bays[12777] burlington[12403] hill[12385] baltimore[12348] farm[12217] montgomery[12182] hampshire[12072] counties[11942] register[11682] ozarks[11622] nevada[11563] ld[11247] wyoming[11205] salem[11059] rhode[11049] fbs[10974] center[10953] valley[10821] farmstead[10779] orleans[10764] grove[10741] monroe[10638] 
topic8=league[263636] cup[220871] club[218025] championships[181957] football[161841] goals[159024] apps[146890] round[133792] championship[125108] women[118141] tournament[117783] fc[111936] teams[109893] player[106082] rugby[100262] games[97071] match[96036] rank[87517] draw[85827] division[85120] plays[83401] olympics[82290] event[81582] men[81047] competition[79039] footballer[77007] medal[76548] competed[75988] matches[75155] professional[72169] debut[71098] finals[69955] profile[68986] stadium[68557] metres[66188] champions[64770] summer[64106] results[63938] points[62208] european[61932] squad[58958] bronze[58547] players[56702] junior[56506] score[55013] playing[54854] olympic[54736] premier[51783] youth[51510] athlete[51163] liga[50601] gold[49455] statistics[46849] athletics[46726] volleyball[45992] champion[45221] qualified[44915] sports[43907] win[43020] silver[42695] qualification[42439] scored[42194] indoor[42081] loan[41377] play[41314] competitions[41285] winner[40405] heat[39794] qualifying[38926] clubs[38828] nationality[38800] coach[38685] winners[38451] midfielder[38392] runner[37707] nd[37468] opponent[37340] goal[36729] side[36602] badminton[35684] senior[35359] semi[35164] seeds[34921] rd[34874] challenge[34836] result[34777] uefa[34595] finished[34304] relay[33733] table[33213] record[32957] game[32393] appearances[32378] represented[32304] super[32271] sport[32164] title[31349] half[31237] level[30897] signed[30558] 
topic9=album[361508] song[250314] chart[187669] band[164009] track[132615] vocals[118988] guitar[110313] label[100615] songs[97321] listing[96822] studio[92239] albums[91577] records[90983] charts[90979] you[90448] singles[81265] release[81136] bass[77378] singer[77132] video[75510] billboard[75083] recorded[72940] tracks[70983] ep[70211] jazz[69051] drums[67918] rock[64948] love[63662] me[62255] recording[58899] artist[58120] digital[56808] cd[56142] download[54913] peakposition[53856] personnel[53695] pop[51970] live[50545] my[50247] producer[50013] featuring[49763] debut[48787] discography[48209] songwriter[43493] piano[41466] hot[40221] performed[39497] tour[39486] record[39388] written[37762] lead[37412] us[35568] peak[35432] dj[34982] hop[34931] saxophone[34907] reception[34621] blues[34225] sound[34024] peaked[33650] format[33552] hip[33443] lyrics[33059] remix[33036] solo[33027] dance[33013] artists[32878] date[32599] production[32403] performance[31775] eurovision[31580] title[30821] radio[30676] musician[30435] your[30340] version[30125] produced[29797] youtube[29292] we[29176] percussion[29166] uk[28805] allmusic[28730] musical[28721] guitarist[28573] keyboards[27759] don[27603] aria[27485] musicians[27099] backing[26959] background[26771] featured[26715] cover[26286] recordings[26191] mixing[25613] hit[25584] termen[25279] reached[25219] rapper[24782] duo[24744] weekly[23316] 
topic10=philippines[28452] philippine[20831] manila[16754] filipino[16695] language[9832] tag[8157] wwe[7682] ng[7374] eaves[7166] yerevan[6121] ang[6017] languages[5988] davao[5939] sunil[5607] och[5131] deaf[5115] clapboard[5052] nwa[4958] lucha[4893] mindanao[4829] deepak[4793] smokehouse[4421] rizal[4205] enugu[4124] sa[4020] aquino[3904] luzon[3781] assamese[3741] spinnin[3712] dialect[3707] frescoed[3535] mahi[3499] feu[3460] fayard[3408] anambra[3360] ni[3142] spoken[3039] venu[2987] sveriges[2965] laguna[2961] corazón[2924] kya[2879] zamboanga[2868] dialects[2824] belles[2809] oaxaca[2801] ghar[2783] libre[2758] akshay[2740] njpw[2702] madhav[2650] sanam[2624] dictionary[2621] sab[2621] speakers[2611] för[2517] universel[2491] cuenca[2478] filipinos[2462] word[2460] metro[2453] ka[2440] na[2411] vowel[2400] arroyo[2375] abia[2371] gucci[2333] naga[2324] cagayan[2297] nisha[2273] researchgate[2265] occidental[2205] sta[2172] tawi[2160] anupam[2102] wcw[2095] más[2087] words[2060] names[2016] visayas[2014] marcos[1973] minori[1970] hombre[1944] moro[1929] ett[1893] mo[1889] phonology[1883] sur[1881] ahrar[1875] det[1847] smackdown[1841] wrestled[1839] piya[1824] cervantes[1807] heures[1804] fils[1773] chua[1771] uppsala[1765] cotabato[1754] jose[1738] 
topic11=al[90513] ali[31307] islamic[29672] pakistan[28932] iran[27595] khan[26711] iranian[23927] mosque[23089] arab[21416] ahmed[19995] mohammad[19483] ibn[19123] syria[17976] thai[17756] abu[17592] muhammad[17546] saudi[17159] iraq[16853] arabic[16492] muslim[15818] pakistani[15164] thailand[14804] islam[14777] bangkok[14755] egypt[13893] ahmad[13464] el[13382] abdul[13079] mohamed[12695] iraqi[12160] afghanistan[11857] sheikh[11689] egyptian[11666] persian[11621] bin[11101] hassan[10870] shah[10687] aleppo[10406] arabia[10209] abdullah[9898] mohammed[9752] kuwait[8719] cairo[8577] ibrahim[8553] yemen[8281] rahman[8137] raion[8013] dubai[7739] afghan[7635] syed[7622] emirates[7459] sudan[7269] nakhon[7111] hasan[6897] bahrain[6754] muslims[6663] mirza[6591] imam[6348] baghdad[6345] hussein[6323] jordan[6263] morocco[6255] ismail[6228] maccabi[5926] sidi[5778] amir[5768] oman[5754] reza[5687] moroccan[5481] islamabad[5477] taliban[5308] sharif[5299] abd[5244] libya[5225] malik[5205] khalid[5157] shia[5146] province[5144] ul[5144] damascus[5095] sultan[5062] omar[4899] karim[4646] rashid[4639] hamid[4607] algeria[4550] medina[4479] khalifa[4474] arabian[4340] kabul[4328] mahmoud[4320] khaled[4293] din[4197] amin[4195] ambassador[4127] lebanese[4106] minister[4069] lebanon[4042] tunisia[4024] dhabi[4014] 
topic12=station[187036] railway[171967] bangladesh[82225] train[39954] trains[34326] road[31680] rail[28715] dhaka[28236] bus[28011] metro[27462] stations[27297] opened[26982] express[26722] junction[26120] passenger[25978] km[25396] uganda[24995] services[23384] district[22797] depot[22301] airport[21027] transport[20602] railways[19291] vijayawada[18672] platform[17594] town[16795] bangladeshi[16680] route[16098] village[15227] transit[15146] closed[14560] gauge[14403] cultivators[14045] situated[13798] traffic[13619] lines[13440] operated[13272] platforms[12980] section[12593] townland[12534] passengers[12458] construction[12209] branch[11973] govt[11924] kolkata[11916] terminus[11701] bengal[11294] delhi[11245] class[11063] jaipur[10823] halt[10439] chittagong[10422] terminal[10377] freight[10366] track[10272] wales[10244] via[9966] buses[9787] tram[9691] central[9564] cambridgeshire[9442] hossain[9361] kampala[9242] division[9027] transportation[8939] tangail[8936] queensland[8665] india[8591] derbyshire[8503] bengali[8454] nearest[8430] street[8391] ugandan[8334] goods[8255] stop[8187] shaheed[8149] side[8090] upazila[8027] aged[7874] railroad[7787] location[7783] western[7779] tracks[7708] rapid[7579] saurashtra[7539] projecting[7534] curacy[7534] zone[7478] household[7465] routes[7358] tramways[7352] chowdhury[7313] howrah[7290] facilities[7193] coast[7177] southern[7130] eastern[7109] code[7077] bridge[7072] trams[7035] 
topic13=pcc[11934] subterminal[6771] palsy[6568] wrexham[6258] pls[4319] aif[3312] antibody[3135] pd[3030] mykolaiv[2966] burrell[2901] manish[2722] cardiff[2700] sclerosis[2541] axillary[2492] zeller[2371] drooping[2334] motte[2240] psl[2221] toxin[2172] merthyr[2136] vejle[2089] sogn[2065] monmouthshire[2046] rhondda[2018] carcinoma[1969] bot[1878] caerphilly[1862] bridgend[1802] carmichael[1766] taf[1760] pk[1744] distal[1735] monoclonal[1731] sajid[1711] nines[1708] melanoma[1696] dbu[1672] nci[1656] physiotherapy[1615] blum[1610] mdm[1557] dione[1540] cervical[1535] mutations[1531] lymphoma[1528] antibodies[1521] snr[1516] selectivity[1474] tumour[1465] llandaff[1418] thyroid[1411] nanoparticles[1404] lesions[1388] bcl[1368] glamorgan[1354] whitchurch[1338] cynon[1332] ortho[1331] pkr[1306] jacobson[1293] marrow[1288] castell[1285] sternberg[1275] vertebrae[1272] transcriptional[1265] cdt[1262] chemotherapy[1262] apoptosis[1257] chirk[1253] nrg[1238] gait[1233] holyhead[1204] sma[1199] siegel[1194] protease[1175] janssen[1172] nanomaterials[1171] kazuma[1139] epstein[1129] taff[1129] gwilym[1117] akt[1106] tecnico[1100] proximal[1098] dystrophy[1098] orpheum[1087] therapist[1085] genital[1081] epo[1076] tia[1073] idw[1050] ord[1044] hpv[1034] arbeiter[1033] prognosis[1012] parañaque[1008] humerus[1008] autoimmune[1005] insulin[1004] horner[1002] 
topic14=surname[20701] david[19433] michael[19124] james[17093] player[16746] paul[15780] robert[15362] george[15139] jack[14993] tom[14721] smith[14293] steve[13863] joe[13580] peter[13297] frank[13200] mark[13122] richard[12767] chris[12041] jim[11898] tackles[11871] politician[11632] mike[11316] scott[11261] ryan[11178] aggies[11108] williams[11060] taylor[11014] william[10890] bill[10802] bob[10345] jones[10341] martin[10272] lee[10254] kevin[10116] footballer[10112] harry[10093] davis[10020] allen[9822] brian[9802] barry[9757] halfback[9709] tony[9671] ben[9667] charles[9366] jr[9327] australian[9136] sam[9127] wilson[9057] gary[8964] directed[8853] andrew[8736] starring[8699] johnson[8542] fred[8369] canadian[8277] brown[8274] thomas[8158] alex[8135] billy[7902] ian[7864] mitchell[7824] matt[7693] jason[7627] tim[7620] jimmy[7593] alan[7522] pat[7501] brien[7473] kelly[7447] actor[7428] graham[7376] stephen[7317] lewis[7294] miller[7233] murphy[7216] van[7187] eddie[7182] daniel[7142] ray[7095] craig[7089] refer[6923] anderson[6893] moore[6886] nick[6855] jeff[6845] gordon[6767] eric[6735] dave[6596] howard[6559] anthony[6537] ross[6521] bruce[6515] linda[6466] matthew[6455] russell[6406] henry[6397] snooker[6389] patrick[6363] calli[6232] joseph[6232] 
topic15=hungarian[25236] hungary[16810] budapest[14378] kor[11210] eun[7707] mediacorp[6145] samsung[5732] magyar[4881] tc[4611] nemzeti[4602] istván[4342] faroese[3953] lászló[3950] koi[3766] ferenc[3711] ffu[3648] nagy[3300] nokia[3287] aac[2833] smartphone[2775] gábor[2675] sándor[2665] péter[2639] és[2523] encryption[2506] iot[2504] callsign[2424] ktv[2373] farkas[2273] usr[2232] yoshimoto[2224] szabó[2216] hu[2210] sidelight[2197] brickwork[2156] myx[2120] kento[2098] esperanto[2065] profesional[2044] se[2007] canoeist[1997] combinator[1989] zoltán[1979] afd[1966] lajos[1952] andrás[1928] szabolcs[1920] szeged[1919] militare[1879] zemplén[1873] arad[1860] vas[1838] yume[1811] tsubasa[1810] sia[1809] snapdragon[1797] huawei[1795] miklós[1771] bt[1771] qa[1767] ini[1764] ando[1686] wma[1685] tok[1685] győr[1683] tibor[1658] reg[1629] károly[1598] airtel[1537] bács[1526] dab[1522] tdt[1505] lexikon[1504] fujifilm[1459] ong[1458] ogura[1453] artforum[1435] sms[1394] erb[1392] pécs[1379] torun[1375] wearable[1373] cbr[1349] asp[1348] ege[1336] itu[1333] wifi[1322] ob[1307] messaging[1305] nsa[1305] kodak[1303] veszprém[1302] zhe[1301] voip[1291] mária[1286] lz[1281] eto[1275] thieme[1265] tr[1259] verizon[1259] 
topic16=vidhan[16251] damselfly[12497] csx[10474] mla[10186] ethiopia[6522] ethiopian[5162] sena[5101] kalyan[4929] shiv[4922] mandir[4507] branchlets[4407] breuning[4337] melaleuca[4113] addis[3970] dnq[3943] gables[3924] inmates[3707] thane[3319] ababa[3294] greensboro[3225] hrs[3046] vihar[3014] djibouti[3000] uab[2899] shelby[2848] wcc[2751] chakravarthy[2621] bahujan[2507] словарь[2506] gamecocks[2393] psychical[2351] modesto[2317] gauri[2261] bandra[2255] eucalypt[2245] palghar[2207] jayachandran[2110] liliana[2098] fayette[2093] roja[2045] kathi[2039] curran[2014] pfeiffer[2003] aparna[1990] nashik[1961] potts[1913] byard[1890] somali[1871] dusted[1857] sash[1813] knepper[1784] storekeeper[1755] anant[1710] bsp[1685] merrimack[1677] sawant[1676] chelyabinsk[1640] samford[1637] boardman[1607] tobin[1607] calhoun[1586] adama[1582] psd[1581] taft[1577] septa[1572] swp[1570] ashland[1568] bronson[1564] zootopia[1560] troup[1559] paas[1536] tana[1529] trenton[1519] sheva[1518] donati[1512] subiaco[1509] etv[1500] decatur[1496] spiritualist[1494] corcoran[1485] sarita[1475] milford[1470] dedham[1468] jaki[1462] igcse[1453] roxbury[1450] rosenwald[1429] yougov[1426] amal[1422] dieterle[1412] halfpipe[1412] carver[1392] nadya[1390] sion[1388] ossetia[1387] gunnarsson[1378] argento[1371] timi[1364] wilmington[1362] ncp[1357] 
topic17=wta[22884] mathematics[19980] mathematical[18654] theory[15908] martina[12298] equations[11340] geometry[11221] mathematician[11124] quantum[10372] equation[10254] graph[10104] function[9940] theorem[9247] differential[8559] algorithm[8528] ibadan[8515] problem[7394] algebra[7304] finite[7232] linear[7083] functions[6858] algebraic[6848] probability[6793] lucie[6597] space[6190] navratilova[6141] analysis[6099] shrestha[5983] algorithms[5853] method[5820] matrix[5807] dimensional[5764] computational[5708] numerical[5632] vector[5614] model[5299] graphs[5052] variables[5006] stubbs[4987] evert[4908] vertex[4726] topology[4581] solution[4574] methods[4558] nonlinear[4433] random[4397] value[4352] distribution[4330] hingis[4328] we[4326] physics[4320] given[4295] mathematicians[4236] partial[4228] optimization[4201] defined[4182] biju[4122] bolded[4115] constant[3980] obscurely[3976] example[3944] vertices[3910] dynamics[3853] mechanics[3847] sania[3833] point[3830] numbers[3799] problems[3781] variable[3685] values[3658] case[3556] discrete[3543] sequence[3529] polynomial[3510] sum[3501] complex[3493] plane[3484] jelena[3472] applications[3450] geometric[3435] approximation[3382] let[3361] metric[3332] fluid[3276] properties[3273] formula[3252] topological[3252] models[3215] cube[3207] petrova[3184] statistical[3136] triangle[3119] definition[3108] integer[3095] proof[3094] hyperbolic[3076] symmetry[3041] triangles[3038] destino[3029] field[3021] 
topic18=cricket[70887] cricketer[25748] matches[20404] puerto[20315] class[19353] wickets[17806] match[16789] venezuela[15988] campeonato[15007] rico[14417] innings[13909] runs[13602] batsman[13525] odi[13431] bowler[13413] wicket[11960] trinidad[11463] icc[11291] debut[11263] arm[11172] bowling[10778] right[10526] colombia[10297] tobago[10118] trophy[9932] cuba[9050] muisca[9012] rica[8974] costa[8634] cricketarchive[8589] overs[8499] handed[8115] rican[7947] scored[7937] twenty[7854] caribbean[7792] mirren[7678] kent[7570] indies[7422] ranji[7247] honduras[7244] test[7216] kilmarnock[7171] panama[7041] cuban[7038] caracas[7011] venezuelan[7004] sri[6832] batting[6760] ground[6622] espncricinfo[6605] partick[6577] clube[6541] changsha[6514] barbados[6181] que[6082] mcc[6022] dominican[5956] lanka[5949] warwickshire[5856] guyana[5742] uruguayan[5675] raith[5610] nicaragua[5533] cricketers[5523] gómez[5511] middlesex[5377] medium[5366] julio[5197] xi[5186] surrey[5160] rovers[5123] ecuador[5067] sussex[5065] colegio[4853] highest[4805] uruguay[4732] bowled[4678] balls[4600] futebol[4569] nottinghamshire[4539] scorer[4337] leicestershire[4325] vida[4318] guatemala[4254] domestic[4201] herrera[4196] honduran[4071] jamaica[4011] joaquín[3955] pakistan[3931] liberia[3751] fast[3713] mendoza[3650] glamorgan[3606] campos[3602] rivas[3522] guadalajara[3515] havana[3489] zimbabwe[3468] 
topic19=communes[17975] brewery[13119] lyon[12060] beer[11272] france[10976] french[8692] saint[8391] toulouse[8128] senegal[7447] faso[7334] burkina[7199] commune[6923] chargé[6803] affaires[6650] department[6298] benin[5931] brewing[5898] marseille[5866] michelin[5823] loire[5758] senegalese[5614] baku[5591] metz[5380] dsq[5101] dakar[5032] rouen[5018] haute[4886] lettres[4601] vie[4553] château[4362] calais[4288] podiums[4160] grenoble[4146] tarn[4120] havre[4050] littérature[3972] autres[3915] bordeaux[3893] maison[3718] aix[3532] ale[3522] beers[3471] le[3435] digitisation[3398] caen[3298] étienne[3276] niger[3275] chn[3272] techcrunch[3264] arrondissement[3221] cfa[3186] perpignan[3121] la[3018] arras[3000] ajaccio[2971] breweries[2961] chef[2959] margined[2911] terre[2859] troyes[2818] québec[2791] mali[2772] gaston[2709] abidjan[2700] sur[2666] derulo[2637] seigneur[2625] aliyev[2602] brewer[2600] chapelle[2561] ind[2479] auxerre[2473] guerre[2459] beatport[2430] du[2419] reims[2355] et[2333] en[2311] poulsen[2306] nigerien[2301] autódromo[2275] mauritania[2251] vieux[2238] guadeloupe[2237] bhr[2227] jenn[2223] dieu[2205] loup[2159] collège[2148] nord[2148] xixe[2147] département[2126] fih[2117] fournier[2115] flávio[2075] porte[2073] redlands[2056] seine[2049] clément[2021] pape[2002] 
topic20=bridge[81869] highway[51163] road[48105] route[44978] slate[21135] bays[19114] odonata[18852] bridges[15837] intersection[15582] gastropods[13944] farmhouse[13383] terminus[12993] arched[12851] crosses[12836] river[12707] tunnel[12539] curves[11325] truss[11240] sr[11008] us[10609] border[10163] junction[10083] traffic[9801] expressway[9762] intersects[9092] whorls[9050] creek[9016] roofed[8883] span[8882] vermont[8873] crossing[8738] intersections[8593] connects[8569] northeast[8519] runs[8470] travels[8467] roads[8388] begins[8290] highways[8208] interchange[7978] continues[7808] sc[7653] wfc[7244] description[7082] kentucky[6930] northwest[6882] street[6792] outbuildings[6730] avenue[6590] homestead[6586] sh[6449] motorway[6351] spans[6313] passes[6059] octagonal[6035] lane[5832] length[5814] toll[5746] arch[5618] northern[5475] roadway[5473] enters[5210] footpath[5122] raipur[5026] roofs[4995] southeast[4956] section[4909] ends[4868] meadows[4855] geograph[4723] adac[4604] lanes[4576] cottages[4563] eastern[4505] devonian[4428] carries[4427] steeply[4304] rural[4300] interstate[4217] construction[4126] km[4085] transportation[4084] southern[4079] rfu[4067] sills[3996] connecting[3987] paleontology[3924] concrete[3854] crossings[3802] bypass[3798] cambrian[3787] parkway[3766] deck[3763] covered[3710] junctions[3700] stratigraphy[3695] quarried[3654] ordovician[3636] ammonites[3609] segment[3564] 
topic21=soo[11930] hee[10040] idaho[10022] yoon[9916] jae[9198] dong[8736] kang[7681] kyung[7475] joo[7268] namibia[6703] africa[6399] seung[6155] jeong[5723] boise[5592] wac[5543] natal[4922] african[4544] pretoria[4390] kwazulu[4226] cameroun[4148] cape[4008] sang[3945] ahn[3921] hwan[3871] namibian[3686] hae[3534] tubercles[3508] hyo[3402] rogério[3391] spokane[3354] ju[3289] lesotho[3179] baek[3097] transvaal[2979] agarwal[2916] bae[2912] chae[2813] ting[2758] cassa[2733] sik[2694] apartheid[2667] comers[2647] cho[2565] gu[2522] grenada[2519] sook[2443] risparmio[2380] sarsfield[2342] uddin[2287] watanabe[2234] lê[2176] anura[2152] malian[2132] rockingham[2065] kwang[2054] telenovelas[2051] matti[2040] divya[1972] gi[1946] regionale[1941] stellenbosch[1933] за[1921] miho[1905] grenadian[1887] jeeva[1870] afrikaans[1854] everard[1848] anc[1783] aya[1781] subramaniam[1772] gyu[1767] carnarvon[1763] cardona[1749] kyun[1730] maki[1705] whitman[1677] stapleton[1676] fondazione[1670] amphibia[1645] momo[1614] reykjavik[1601] diop[1594] tremblay[1581] custer[1568] hailey[1563] hsien[1553] mateus[1548] walla[1530] eparch[1526] jeremih[1517] saa[1453] chieti[1450] keough[1436] dimitar[1421] bloemfontein[1402] grenadines[1399] yun[1393] dal[1390] président[1387] muller[1382] 
topic22=czech[39417] prague[22089] fivb[19394] slovak[13613] steeplechase[9901] susheela[9603] bratislava[8232] hc[8147] republic[7745] slovakia[7266] czechoslovakia[7142] czechoslovak[7131] blocker[7112] bhosle[6804] brno[6514] iihf[6327] sv[5948] yesudas[5609] jiří[5353] arun[5225] soundararajan[5212] petr[4895] václav[4784] dq[4670] rajan[4561] sangeet[4520] ghantasala[4500] mani[4338] ostrava[4169] cz[4154] kk[3985] wr[3975] josef[3956] bohemia[3916] iyer[3832] miloš[3808] praha[3792] karel[3621] maharaj[3512] tochter[3417] biswas[3399] jan[3365] srinivasan[3357] suman[3346] bhojpuri[3316] františek[3183] maa[3172] jana[3052] pavel[2892] vladimír[2887] tomáš[2881] slavia[2833] naresh[2827] andrej[2784] mahadev[2759] zeman[2746] moravia[2732] sk[2715] jaroslav[2688] mfk[2643] aravind[2641] michal[2616] jozef[2609] sparta[2591] antonín[2559] ilaiyaraaja[2559] nad[2538] shaan[2482] miroslav[2474] balakrishna[2447] ttt[2272] plzeň[2256] bohemian[2241] olomouc[2207] kunal[2197] ján[2148] grambling[2136] vestnik[2117] škoda[2098] anagennisi[2074] zdeněk[2054] liberec[2033] galindo[2016] pallavi[2007] krishnamurthy[1999] garg[1961] raga[1916] usha[1859] regionals[1850] ladislav[1838] sonu[1837] tabla[1823] jayaraman[1823] pradhan[1811] ashwath[1789] mulher[1786] jakub[1765] tatran[1760] jawahar[1753] federación[1738] 
topic23=administrated[25161] power[21522] locomotives[18333] storeys[17026] plant[15320] locomotive[15079] cornice[14926] class[13844] mine[13587] coal[13500] mw[12841] dam[11890] gas[10465] obliquely[10410] mining[10378] steam[9735] fremantle[9610] electric[9046] energy[8999] capacity[8502] diesel[8337] electricity[8170] hydroelectric[7501] malta[7468] solar[7368] mines[7089] cars[6783] oil[6654] kw[6479] railways[6440] station[6066] maltese[6004] steel[5801] railway[5757] engine[5705] wind[5515] nuclear[5340] car[5184] iron[5114] subcostal[4836] construction[4822] ore[4779] busan[4573] production[4456] vr[4376] tender[4146] gwangju[4126] reservoir[4096] dockyard[4091] turbine[4069] colliery[4029] engines[4011] water[4003] project[3948] tenders[3668] operated[3635] tons[3520] type[3417] rabbitohs[3405] furnace[3373] turbines[3311] quickie[3287] tenths[3259] boiler[3258] traction[3248] fuel[3171] foundry[3128] storage[3049] storefront[3041] miners[3029] transversely[3021] renumbered[3000] tonnes[2949] factory[2948] gozo[2918] generation[2903] bracketed[2900] tank[2898] installed[2861] kv[2856] dfl[2815] renewable[2796] units[2763] fuzhou[2760] owned[2724] grid[2722] fluted[2712] snapchat[2692] copper[2689] mill[2676] valletta[2660] semicircular[2640] petroleum[2631] shaft[2618] delivered[2610] wheel[2605] wheels[2574] supply[2535] stations[2471] facility[2462] 
topic24=game[92200] software[56005] data[39318] app[37672] users[32711] mobile[30274] video[26346] games[25900] user[25583] computer[25037] android[23089] system[22762] player[22091] platform[21675] microsoft[20663] web[20518] code[19984] developed[19919] google[19621] windows[19399] technology[19255] digital[18659] online[18543] systems[18467] players[17161] cloud[16571] content[16469] application[16324] internet[15098] features[14897] available[14828] version[14605] devices[14470] using[14464] development[14431] startup[14063] information[13345] open[13127] virtual[13112] design[13060] allows[12975] network[12855] gaming[12618] product[12379] source[12311] project[12289] device[12216] ibm[11896] access[11812] interactive[11623] linux[11611] applications[11551] developers[11253] server[11185] card[11161] pc[10865] developer[10722] tools[10636] interface[10298] phone[10269] mode[10258] camera[10224] kickstarter[10093] com[9914] apple[9883] security[9826] computing[9746] os[9715] free[9705] cards[9688] announced[9654] page[9627] graphics[9560] hardware[9515] launched[9419] release[9413] file[9381] xbox[9332] products[9133] model[9080] designed[9061] create[9044] support[9027] http[8836] machine[8733] search[8653] programming[8604] uses[8533] control[8428] provides[8352] different[8212] database[8147] management[8125] files[8046] platforms[8022] memory[7969] technologies[7915] tool[7889] storage[7856] electronic[7810] 
topic25=football[168081] basketball[130011] coach[119979] conference[103829] ncaa[103399] league[83100] tournament[79026] nfl[75899] games[67297] head[64610] game[61909] yards[60557] record[60153] baseball[57477] division[56715] player[56233] schedule[53107] draft[50894] opponents[46909] represented[45635] finished[45572] soccer[44432] michigan[43819] outscored[43125] stadium[43102] overall[40673] professional[39639] round[39500] regular[38094] big[37131] players[36514] championship[35566] athletic[34236] men[33883] play[33235] texas[33105] teams[32709] women[32124] roster[31482] bowl[29882] california[29635] signed[29311] san[28450] florida[27678] tigers[27384] defensive[26615] arena[26332] compiled[26119] arizona[26052] led[25717] field[25675] points[25571] mac[24176] senior[24033] junior[23842] selected[23249] bio[22776] carolina[22588] hockey[22445] drafted[22315] standings[21219] rushing[21166] coaching[20760] pm[20725] nba[20101] cal[19849] sports[19802] playoffs[19793] seasons[19673] wins[19576] fiba[19400] eagles[19277] center[18727] week[18637] miami[18571] ohio[18449] vs[18433] georgia[18415] sophomore[18202] attended[18049] tackle[17990] guard[17912] chicago[17912] losses[17886] indiana[17777] washington[17775] missouri[17702] tie[17657] win[16981] illinois[16910] bracket[16848] finish[16704] minnesota[16690] association[16048] kansas[15981] ten[15802] lost[15738] coaches[15738] diego[15485] broncos[15482] 
topic26=costal[33443] mollusk[27646] protein[27158] cancer[24854] disease[23213] cell[21830] treatment[20435] gene[20141] drug[19786] cells[18736] clinical[18618] patients[18287] medical[15996] virus[14914] health[14070] brain[13899] proteins[13141] dna[13061] diseases[12475] human[11691] receptor[11244] genetic[10699] blood[10497] bacteria[10478] patient[10405] effects[10392] genes[10201] paralympics[10161] marijuana[9937] genome[9770] syndrome[9732] subtotal[9479] disability[9440] rna[9369] drugs[9356] humans[9335] tissue[9292] medicine[8943] research[8745] therapy[8697] symptoms[8597] bacterial[8540] activity[8438] molecular[8421] function[8343] infection[8324] gastropod[8120] disorders[8035] type[7917] surgery[7887] cerebral[7685] skin[7434] indexed[7407] associated[7295] study[7138] pain[7133] strain[7120] abstracted[7089] animal[6954] disorder[6904] expression[6864] non[6681] tumor[6618] acid[6536] impairment[6535] breast[6341] related[6327] development[6253] viral[6175] hiv[6123] muscle[6105] diagnosis[6100] bone[6052] trials[5994] binding[5957] isolated[5955] mollusc[5934] amino[5873] cause[5866] transcription[5854] risk[5850] animals[5834] growth[5831] liver[5828] encoded[5788] membrane[5774] studies[5742] biological[5692] viruses[5686] slug[5665] specific[5664] receptors[5542] phase[5516] inhibitor[5476] host[5448] vaccine[5440] lung[5304] classification[5265] bac[5252] species[5248] 
topic27=anime[16382] japanese[13782] oricon[13299] manga[12823] japan[10491] jammu[10016] kashmir[9667] nakodar[9023] ntv[8552] tba[7103] ni[6954] nhk[6796] theme[6105] asahi[6018] tokyo[5720] anjali[3907] mato[3671] akita[3499] kottayam[3336] kannur[3322] ga[3285] volumes[3237] sakura[3188] khalsa[3115] ultraman[3102] priya[3093] idol[3083] tv[3043] mello[3026] ending[2961] yokohama[2939] uta[2898] ai[2898] kimi[2897] suzuki[2849] nana[2834] siddharth[2823] cerrado[2793] ishq[2723] shōnen[2676] suraj[2617] takahashi[2539] minami[2480] odia[2461] ishikawa[2435] gilgit[2431] sagar[2379] viswanathan[2332] kita[2308] ame[2286] sundaram[2261] avex[2144] giri[2143] seema[2134] volume[2129] várzea[2110] yamamoto[2108] kashmiri[2101] animation[2082] tocantins[2050] mata[2020] deen[2016] kishan[1998] raza[1998] amparo[1930] manaus[1893] tabi[1848] não[1839] ghulam[1836] atsushi[1799] ondo[1795] illustrated[1795] kobe[1780] storyboard[1768] hiroki[1758] tankōbon[1751] kubo[1745] rondônia[1731] cba[1727] dhillon[1718] azad[1716] weekly[1708] amala[1703] komatsu[1699] professionnelle[1698] serialized[1683] satyam[1673] dub[1647] puccini[1645] multan[1629] adaptation[1592] nami[1542] diya[1537] kumi[1526] toei[1525] aaj[1504] trax[1502] lovely[1492] uday[1478] shueisha[1451] 
topic28=korean[35847] kim[30051] korea[29837] lee[14779] jung[14661] skating[14240] min[12254] jin[11575] hyun[10194] isil[9998] ji[9854] seoul[9568] choi[8645] woo[8191] cha[7692] seo[7518] han[7183] champ[7146] hangul[7079] skate[7043] tcr[6875] tae[6806] ho[6804] sung[6786] jang[6767] tehran[6548] yeon[6521] skater[6485] zh[6296] jong[6224] ri[6025] park[6014] yong[5749] isu[5572] joon[5053] shin[4997] joseon[4669] hanja[4664] young[4536] figure[4510] dns[4338] ye[4329] mi[4156] iran[4082] incheon[3969] ktm[3962] mrt[3765] konitz[3660] bab[3655] nam[3650] oh[3619] hwa[3537] il[3496] jo[3399] abbas[3363] tabriz[3329] assyrian[3224] medalist[3188] hong[3174] ki[3133] prix[3105] seong[3090] lrt[3071] na[3039] dae[3020] jun[2967] yeong[2903] acb[2896] ara[2863] su[2801] olímpico[2782] yi[2705] yang[2614] sun[2611] seon[2501] hao[2499] se[2430] ae[2293] fs[2255] pyongyang[2240] wang[2230] chang[2224] ra[2201] ro[2136] geun[2115] fam[2107] chong[2087] jilin[2062] baloncesto[1975] pang[1975] saff[1937] roh[1930] raqqa[1915] nat[1914] bala[1868] yanbian[1848] bahá[1830] irib[1818] province[1777] shiraz[1765] 
topic29=wildcards[10346] efl[8624] ipc[7423] gambia[6841] ecac[6205] bitcoin[5972] fcs[5506] thani[5111] torneo[4644] gambian[4098] colgate[3635] pf[3454] beira[2869] carex[2780] apache[2684] jiangxi[2670] mozambique[2628] js[2609] sedge[2475] maputo[2459] villes[2417] utsa[2363] selfie[2338] batten[2233] ziyang[2168] taka[2033] iphone[2029] jur[1931] jax[1897] mussel[1879] nt[1870] yeo[1826] swaziland[1813] jaro[1790] pattaya[1750] swazi[1745] mozambican[1742] mady[1728] ipad[1675] dp[1653] marques[1647] brazzaville[1638] argentino[1547] unimproved[1525] nla[1512] wp[1501] password[1484] ange[1454] baruah[1453] kora[1452] harpe[1435] bourne[1430] bangui[1430] io[1415] noticias[1398] authentication[1385] malawian[1382] friedl[1382] bluetooth[1381] lv[1372] toomey[1353] suriya[1334] audiovisual[1320] botoșani[1313] bom[1277] mamadou[1270] cutler[1259] trina[1256] redshirting[1249] torrent[1235] halliday[1227] aiaw[1226] mccloud[1222] foss[1220] php[1204] tunde[1202] nanchang[1183] comix[1176] pmpc[1163] css[1162] lamin[1155] navale[1153] longerons[1150] noire[1145] apps[1145] folder[1144] hoyas[1134] mazur[1114] gnat[1100] zeke[1099] handa[1096] hacker[1086] ssl[1079] arends[1070] ocr[1067] mirosław[1055] passwords[1055] hendra[1052] engström[1050] backend[1048] 
topic30=historic[193472] building[179428] register[113178] places[90102] listed[88132] buildings[79081] street[75166] brick[69329] roof[67461] story[62903] tower[52445] style[51431] hotel[49037] architecture[48015] stone[46478] district[46319] revival[44738] hall[40948] church[40597] architect[40090] windows[39147] designed[38031] contributing[37870] floor[36623] construction[35864] park[35037] structure[34527] property[34467] jpg[34109] gable[33391] frame[32457] square[31410] site[31288] facade[30480] front[30413] entrance[29545] side[29462] houses[28506] constructed[27928] architectural[27149] center[26613] design[26172] listings[25793] town[25776] added[25669] features[25651] library[25337] museum[25245] heritage[25105] file[24933] residential[24332] monument[24328] interior[23329] barn[23091] walls[23080] road[22749] room[22435] opened[22346] dwelling[21879] avenue[21874] queensland[21736] bay[21618] castle[21502] central[21111] memorial[20558] arkansas[20480] grade[20276] complex[20266] courthouse[20261] commercial[20261] window[20093] corner[20039] mill[20002] wall[19997] restaurant[19843] wood[19638] farm[19548] rooms[19356] residence[19224] store[19116] rectangular[18970] rear[18560] architects[18369] block[18136] demolished[17979] concrete[17959] centre[17855] office[17202] timber[17167] originally[16922] feet[16856] completed[16784] columns[16679] limestone[16569] plan[16388] garden[16332] location[16292] indiana[15871] cemetery[15863] floors[15857] 
topic31=depressariidae[18966] gelechiidae[18681] tornus[16393] blooms[14372] spots[13740] transverse[11411] margin[11176] dot[10092] markings[8052] elegans[7697] turrids[7547] lecithoceridae[7223] mohan[7141] queensland[5973] phagwara[5929] rajkumar[5473] anal[5347] subspecies[5228] xyloryctidae[5205] turridae[5192] weatherboard[4475] faint[4401] radha[4392] dull[4199] uchicago[4103] botany[3967] lepidoptera[3943] suture[3734] animalia[3702] durga[3701] streaks[3638] moths[3565] surya[3556] chitra[3465] modi[3428] rounded[3417] zoology[3386] guiana[3231] cand[3210] puri[3015] autostichidae[2988] kimberley[2954] tick[2929] sangeetha[2858] undefined[2777] upanishad[2738] dichomeris[2729] crab[2723] guntur[2699] mycologist[2696] sarma[2683] gujarat[2631] edu[2626] hanuman[2615] elina[2588] puillandre[2563] mushroom[2560] bissau[2540] sinuate[2526] rosy[2500] entomology[2486] thakur[2483] kanchana[2449] gecko[2396] nee[2373] tanjong[2365] drosophila[2317] meenakshi[2235] lip[2219] replication[2193] vasu[2172] attenuated[2157] purple[2114] girish[2082] attains[2061] genomic[2039] sastry[2007] tentacles[1990] mohanty[1986] upendra[1942] shuai[1906] raja[1902] prabha[1900] butterfly[1879] karna[1862] drepanidae[1852] mathura[1835] parasitology[1833] buhari[1823] ramachandran[1818] veera[1809] uma[1787] chakra[1773] ajit[1758] genomes[1738] venkateswara[1717] rajputs[1716] indiewire[1696] radhakrishnan[1693] kutch[1692] 
topic32=ludhiana[27629] congo[6766] ipsc[5463] ellington[5445] braxton[5417] motherwell[5190] congolese[4947] wcha[4442] roach[4309] mehldau[3485] lehigh[3285] gillespie[3269] dizzy[3139] thelonious[3039] getz[3035] mcbride[2802] mcintosh[2749] cohn[2699] bop[2652] niu[2651] brubeck[2642] verve[2548] jobim[2537] sims[2525] hons[2489] adderley[2488] rensselaer[2458] campground[2458] wabash[2441] hodges[2366] rollins[2360] binghamton[2353] kalev[2327] susquehanna[2274] kinshasa[2251] goswami[2250] rcd[2220] pinal[2202] juvenil[2155] drc[2055] smt[2036] kohler[2023] deshpande[2022] operetta[2018] trombonist[2014] vibraphonist[1982] schuyler[1973] utica[1933] bennington[1915] bachelors[1894] malone[1875] rutland[1866] zoot[1857] erie[1857] eldridge[1851] sandnes[1779] lacy[1764] fisk[1743] giuffre[1741] humphreys[1739] blanchard[1729] mlc[1706] ucl[1704] horvath[1698] macklemore[1695] erling[1684] mance[1679] nao[1675] rah[1667] jeunes[1666] hurley[1664] adirondack[1640] banaras[1627] bley[1617] lackawanna[1599] crouch[1581] fairleigh[1566] dewey[1551] mulligan[1545] highschool[1544] asu[1541] diliman[1521] suny[1494] acha[1488] snapper[1487] tamar[1480] lipscomb[1477] zorn[1472] haynes[1453] dutton[1450] shim[1447] lombardo[1447] ruff[1446] scranton[1437] bhagwan[1434] monk[1430] swarthmore[1421] laine[1407] sayre[1406] cheatham[1403] 
topic33=ky[21829] segunda[16119] alaska[13736] whorl[9457] oblast[8538] lsu[6648] deanery[6135] parishad[5905] bardhaman[5176] pct[4973] apa[4624] porches[4565] selo[4450] potomac[4144] yukon[4066] transom[3943] macon[3584] purulia[3258] anchorage[3222] gabon[3144] gana[3085] banga[3071] littéraire[2998] krai[2983] roanoke[2935] bolivar[2911] confluent[2881] bankura[2799] trinamool[2766] fredericksburg[2559] fairbanks[2440] taney[2405] appomattox[2333] midshipmen[2313] bethel[2210] chattanooga[2165] vassar[2146] swanson[2119] máximo[2081] diocesan[2025] juneau[2001] burdwan[1981] piney[1932] boonville[1888] danville[1883] eritrea[1871] pmc[1858] natchez[1834] greenfield[1808] meade[1800] charlottesville[1798] archdeaconry[1767] positio[1685] sampson[1648] cuny[1628] hardt[1626] nome[1619] sáenz[1617] mckinsey[1613] eritrean[1600] photojournalist[1563] anam[1554] manassas[1538] becca[1518] antietam[1515] tiller[1509] flor[1505] residentiary[1479] sarasota[1474] kamchatka[1453] asmara[1449] tyumen[1442] territorial[1434] wheeling[1427] ibarra[1425] scc[1420] rajeswari[1414] mim[1407] linares[1396] kirti[1393] everglades[1390] garo[1387] hooker[1385] faridpur[1383] parganas[1381] haley[1367] valdez[1345] odell[1341] hopson[1340] mccallum[1336] juventud[1331] bashkortostan[1329] haines[1329] stedman[1328] sigman[1322] townsite[1311] esta[1302] ase[1283] yancey[1281] mikhailovich[1279] 
topic34=são[30792] brazilian[28272] brazil[26472] da[23147] portuguese[21663] paulo[19983] rio[17515] janeiro[16598] do[14436] verde[14094] portugal[13943] silva[12169] praia[11930] cape[10281] porto[9975] paulista[9747] joão[9311] dos[8387] santos[8246] josé[6865] grande[6854] brasil[6452] santo[6088] vitória[6037] lisbon[5796] antónio[5779] pereira[5651] amazonas[5633] oliveira[5531] pedro[5384] verdean[5197] serra[5050] ribeira[5050] ferreira[5009] fogo[5006] sul[4899] carlos[4751] paraná[4710] souza[4601] bahia[4568] maria[4482] gomes[4425] das[4405] vicente[4267] novo[4229] pará[4080] luiz[4072] rodrigues[4056] minas[4052] almeida[4045] martins[3975] vila[3948] santa[3916] mendes[3836] luís[3830] santiago[3794] ponta[3760] antão[3720] lopes[3666] dias[3650] amazon[3540] island[3536] guimarães[3534] mindelo[3473] boa[3463] madeira[3462] ramos[3421] filipe[3397] gerais[3391] vasco[3327] jorge[3325] fernando[3299] garcia[3284] costa[3267] os[3205] sal[3187] maio[3163] brasileiro[3069] globo[3051] carvalho[3028] nicolau[3011] catarina[2993] goa[2947] joaquim[2935] augusto[2921] henrique[2900] quito[2893] manuel[2891] botafogo[2875] sousa[2864] andrade[2829] cardoso[2823] rocha[2780] antônio[2773] cruz[2747] sawan[2733] fonseca[2730] brava[2698] alegre[2684] monteiro[2673] 
topic35=german[84172] der[60185] von[50600] und[47759] berlin[37595] germany[35106] die[30779] hans[24378] für[18196] munich[17907] hamburg[17775] karl[17350] austrian[17069] vienna[15894] des[15042] im[14835] friedrich[14703] leipzig[14059] das[13599] johann[13314] wilhelm[13097] heinrich[12632] bundesliga[12613] austria[12586] franz[12337] zur[11947] georg[11765] frankfurt[11725] verlag[11711] ernst[11659] hermann[11196] fritz[11108] ein[10913] ludwig[10645] deutsche[10470] otto[10185] isbn[9480] rudolf[9401] stuttgart[9059] bavaria[8413] baden[8380] werner[8373] wien[8284] geschichte[8155] carl[8090] cologne[8081] wolfgang[7677] rhine[7604] aus[7585] bonn[7571] zu[7306] saxony[7149] nazi[7128] max[7057] auf[7041] erich[6807] heinz[6711] dem[6688] walter[6657] den[6607] josef[6540] bremen[6539] gustav[6514] prussian[6484] württemberg[6436] huber[6361] mit[6315] düsseldorf[6305] müller[6279] eine[6228] weimar[6117] johannes[6082] münchen[5845] heidelberg[5771] bavarian[5667] kurt[5575] am[5530] bahn[5490] klaus[5483] mainz[5445] zum[5422] swiss[5387] deutschen[5295] graz[5236] jena[5235] spd[5213] dfb[5183] bei[5115] zürich[5087] theodor[5049] richter[4960] münster[4950] adolf[4914] brandenburg[4892] gerhard[4887] christoph[4797] paul[4736] fischer[4617] schmidt[4584] rhineland[4560] 
topic36=stakes[22960] jalandhar[20788] barcelona[15972] horse[15099] kenya[8786] kenyan[8683] amritsar[8654] race[8209] racing[7840] lengths[7382] catalan[7077] catalonia[6824] derby[6576] horses[6528] handicap[6371] filly[6301] colt[6248] jockey[6147] trainer[6020] stud[5769] races[5603] dressage[4930] stallion[4688] winner[4479] nairobi[4447] trained[4320] bred[4249] ridden[4186] triathlon[4065] stable[3992] mile[3949] runners[3928] equestrian[3809] thoroughbred[3805] win[3613] breeders[3611] breeding[3609] ayr[3498] sire[3386] prix[3380] keelboat[3357] sailboat[3241] andorra[3205] mallorca[3179] eventing[3172] epsom[3023] girona[2869] ernakulam[2813] fillies[2811] broodmare[2807] oaks[2801] run[2775] mare[2680] inflorescences[2666] kentucky[2640] hnl[2638] winners[2628] bhattacharya[2615] distance[2601] pedigree[2542] turf[2539] ironman[2492] catalunya[2431] reus[2310] sant[2252] harness[2219] josep[2158] foals[2136] tarragona[2120] rider[2084] mares[2070] farm[2064] dam[2056] balearic[1990] flat[1948] stables[1938] grizzlies[1927] churchill[1879] miquel[1858] gakuen[1828] pounds[1816] lleida[1815] grade[1798] jaume[1797] cup[1785] reina[1775] coloma[1761] gelding[1759] weld[1742] farrington[1741] complutense[1741] francesc[1729] bay[1727] maiden[1713] belmont[1704] park[1680] carruthers[1669] winning[1667] sabadell[1656] pace[1636] 
topic37=pb[13434] sb[13170] trump[10652] tunisian[10576] mustangs[10395] rook[10320] rider[8933] cyclist[8862] tunisia[7723] tunis[7608] nas[6726] vuelta[6583] caf[6415] pot[5088] darts[4954] bike[4877] bicycle[4828] fb[4675] poker[4496] mustang[4487] tn[3641] runway[3311] pekan[3117] cyclists[2975] doping[2797] abeokuta[2499] allentown[2445] amish[2434] cycling[2401] kao[2236] sdn[2185] sneha[2114] bmx[2104] finley[2091] puteri[2090] drone[2047] bicycles[2028] casino[2007] bikes[2002] inactivated[1966] bf[1885] itt[1862] slc[1832] yasir[1824] dart[1812] bhd[1794] nav[1769] kf[1763] jg[1734] psm[1712] sprinters[1682] kh[1677] sse[1670] criterium[1640] meritorious[1617] redesignated[1605] cycliste[1599] motocross[1546] cr[1543] fargo[1541] vb[1533] tt[1517] aoa[1515] akmal[1507] awang[1478] elgin[1474] estero[1469] rc[1464] mennonite[1447] neuwied[1439] tamworth[1437] nv[1416] sukhoi[1416] dunlop[1412] curtiss[1411] leighton[1386] bhavana[1361] nb[1328] aces[1318] fas[1316] cx[1305] fédération[1296] mosman[1296] saiful[1291] minesweepers[1289] fp[1288] ati[1262] oa[1252] fokker[1242] mcconnell[1229] cst[1217] bayonne[1216] koe[1211] anak[1200] uavs[1193] asphalt[1181] rr[1181] atv[1175] iata[1167] dh[1165] 
topic38=research[66060] education[65640] science[57921] institute[57702] professor[56700] students[52411] medical[49202] engineering[41422] hospital[38642] degree[37801] sciences[37695] award[36374] society[35746] academy[34619] medicine[31914] technology[30589] department[30450] health[30265] faculty[29353] fellow[29137] director[27706] president[26143] physics[25070] studies[24831] dr[23844] schools[23826] bachelor[23393] awarded[23144] association[22712] academic[22641] phd[20981] secondary[20952] indian[19839] women[19779] campus[19737] board[19691] india[19101] chemistry[19072] master[18978] worked[18933] program[18815] awards[18735] teaching[18308] graduate[18277] courses[18133] scientific[18030] center[17551] graduated[17412] student[17315] educational[16940] arts[16818] council[16160] foundation[16112] mathematics[15962] chair[15729] training[15628] laboratory[15425] contributions[15016] study[15005] computer[14526] ph[14491] founded[14480] prize[14461] management[14224] vice[14122] established[14103] teachers[13652] institution[13455] biology[13442] studied[13399] development[13181] doctorate[13003] teacher[12624] fellowship[12577] senior[12391] assistant[12347] scientist[12345] associate[12340] programs[12158] harvard[12135] girls[12115] earned[12036] undergraduate[12035] children[12034] library[11783] appointed[11751] elected[11744] economics[11678] committee[11564] higher[11466] lecturer[11451] california[11352] universities[11350] centre[11176] taught[11065] dean[11048] technical[11032] social[11004] medal[10968] doctoral[10926] 
topic39=hai[8470] gaon[7563] asha[7267] kapoor[6431] khan[6388] bollywood[6282] ki[6257] hindi[6214] hum[6039] dil[5762] ek[5625] rani[5603] zee[5353] kristiansand[5229] ke[4966] kumar[4819] prem[4813] kamal[4777] patel[4442] mehta[4285] pandit[4277] vest[4251] bengali[4066] kriegsmarine[4011] dutta[3883] meena[3864] kishore[3779] wunderlich[3687] ravindra[3601] shree[3445] bhatt[3410] mein[3293] bir[3278] lata[3276] begum[3272] rafi[3221] pandey[3204] yeh[3201] uttarakhand[3195] lund[3181] ashok[3159] na[3124] malmö[3097] priyanka[3068] vidya[3003] se[2930] oberleutnant[2901] sinha[2890] varun[2885] gaurav[2823] dey[2784] chandran[2784] supercharged[2743] tum[2742] eifel[2739] aur[2734] freiburg[2688] mangeshkar[2621] pyaar[2590] dinesh[2553] khanna[2549] tiwari[2535] satya[2526] aman[2473] bhai[2448] gaya[2436] malhotra[2386] mera[2366] deepa[2295] mohammed[2267] arora[2262] hatun[2260] castleford[2259] lyricist[2259] cine[2230] kiel[2225] ashish[2220] astana[2203] onna[2198] chopra[2179] sameer[2170] schweiz[2159] voss[2134] narendra[2134] teri[2086] beşiktaş[2086] hoon[2071] govind[2047] mukesh[2047] fri[2047] sultana[2045] saratov[2006] nicosia[1955] pyar[1946] kanta[1933] bhi[1931] sivan[1920] rennes[1915] hain[1897] bursa[1875] 
topic40=business[62311] bank[45296] management[42236] founded[39588] development[39578] million[38854] services[38106] companies[37701] nigeria[36136] ceo[32953] investment[31178] financial[30730] industry[29560] global[28739] products[27817] market[27164] economic[26174] organization[25446] nigerian[25403] executive[24896] firm[24463] board[23795] countries[23356] technology[22978] marketing[22843] social[22456] billion[22408] capital[22361] project[22334] fund[21965] president[21728] trade[21066] co[21000] africa[20931] chairman[20698] foundation[20602] food[20587] health[20255] finance[19930] media[19487] energy[19432] founder[19318] community[19003] director[18935] uk[18182] entrepreneur[18169] projects[18150] policy[17883] ltd[17875] education[17753] brand[17737] private[17729] funding[17581] association[17550] agency[17412] exchange[17407] sector[17367] corporation[17295] established[17162] india[16963] largest[16918] partners[16750] wheatbelt[16684] corporate[16636] insurance[16540] information[16275] program[16190] inc[16165] ministry[16131] online[16063] launched[16011] research[15944] european[15927] department[15868] office[15320] employees[15231] tax[15116] support[15069] limited[14933] banking[14904] owned[14863] stock[14791] organizations[14747] network[14505] security[14413] profit[14303] chief[14171] activities[13805] relations[13628] us[13620] businesses[13484] awards[13418] acquired[13364] resources[13349] non[13274] investors[13187] venture[13118] headquartered[13105] provides[13096] environmental[13068] 
topic41=church[184207] bishop[142661] catholic[79829] roman[57586] cathedral[49352] diocese[48368] pope[45495] parish[40096] priest[36907] archbishop[34780] ordained[29370] saint[28537] chapel[27905] titular[27380] consecrated[26688] papacy[25253] prelate[24380] appointed[22328] holy[20630] monastery[19511] churches[19174] cardinal[19067] religious[17061] apostolic[17038] episcopal[16749] italy[16431] abbey[16390] giovanni[16301] bishopric[15664] mary[14738] christian[14601] christ[14522] biography[14335] catholicism[14280] seminary[14230] san[14120] congregation[13975] convent[13940] bishops[13781] saints[13735] vicar[13669] di[13663] theological[13334] jesus[12991] rome[12699] virgin[12675] paul[12564] orthodox[12463] anglican[12396] francesco[12108] latin[12077] altar[12045] missionary[12002] baptist[11974] theology[11936] basilica[11743] lady[11291] rector[11271] organ[11034] province[11005] maria[10881] our[10626] dedicated[10464] italian[10344] pastor[10288] madonna[9993] rev[9860] nave[9757] town[9320] santa[9233] ezekiel[9035] baroque[8993] antonio[8823] pietro[8685] palazzo[8597] blessed[8580] mission[8565] temple[8542] rite[8488] della[8426] chaplain[8137] painted[8127] ecclesiastical[8107] francis[8035] co[8026] joseph[8007] depicting[8000] pius[7995] patriarch[7985] carthage[7761] frescoes[7731] altarpiece[7703] fr[7603] jesuit[7503] lutheran[7486] building[7470] founded[7466] battista[7298] ancient[7241] abbot[7204] 
topic42=journal[58343] professor[44799] research[42675] book[35563] studies[31926] editor[27187] science[25575] social[25242] philosophy[23378] press[23086] books[22043] society[21390] publications[20222] theory[19708] academic[18954] psychology[18327] isbn[17815] author[17415] articles[16041] political[15678] institute[15406] study[15332] sciences[15327] economics[15036] scientific[14726] pp[14136] reviewed[13987] review[13396] phd[13359] peer[13080] oxford[13006] language[12736] literature[12097] edited[11417] culture[11187] works[11033] cambridge[10854] sociology[10731] law[10729] historian[10691] education[10472] politics[10296] harvard[10228] citation[10126] journals[10101] vol[10000] ph[9973] analysis[9948] fellow[9891] human[9769] scholar[9752] faculty[9704] how[9689] association[9562] ed[9484] cultural[9309] co[9149] policy[9043] anthropology[8934] impact[8916] studied[8829] media[8742] thesis[8721] chief[8467] gender[8427] historical[8409] publication[8384] associate[8147] volume[8032] quarterly[7987] ethics[7948] papers[7935] taught[7931] linguistics[7922] publishing[7867] authored[7826] teaching[7793] feminist[7726] knowledge[7716] issues[7674] selected[7664] dissertation[7630] modern[7486] topics[7470] economic[7424] edition[7363] reports[7255] eds[7229] religion[7225] editorial[7140] lecturer[7126] department[7103] yanow[7080] bibliography[7017] doctorate[6960] humanities[6943] women[6786] london[6779] prize[6754] covering[6705] 
topic43=doubles[62010] singles[38664] tennis[30510] tournament[27274] semifinals[24132] atp[23542] qualifier[23072] runner[20096] clay[18443] quarterfinals[18148] entrants[16119] nr[15924] itf[15746] tournaments[15450] heats[15356] winner[13941] hard[12340] challenger[11742] open[11226] ranking[11085] partner[11048] semifinal[10665] seed[10456] prix[10207] qf[9065] seeded[8771] courts[8385] lil[8120] tour[8040] rankings[7760] contestant[7207] finals[6946] bye[6738] qualifying[6712] slam[6267] surface[6133] grand[6068] airdate[5986] eremophila[5964] title[5914] sf[5834] loser[5731] imdb[5660] contestants[5552] seeding[5168] carpet[5015] fastest[4736] wimbledon[4631] elena[4626] partnering[4617] teaser[4483] quarterfinal[4461] partnered[4219] women[4198] runners[4031] forster[3975] grass[3675] defeated[3626] titles[3608] wildcard[3537] northridge[3525] mixed[3469] davis[3443] michelle[3383] kendrick[3293] danielle[3148] yana[3119] gaga[3067] anna[3053] nicole[3053] laura[2949] petra[2900] maria[2894] sets[2893] finalists[2776] edm[2762] kristina[2688] andrea[2670] julia[2651] paula[2648] olga[2638] femina[2627] anastasiya[2622] snoop[2588] simona[2588] lukáš[2570] jessica[2550] soler[2535] stefani[2514] cr[2511] mullins[2471] jennifer[2467] sandra[2451] janeiro[2451] outcome[2445] billie[2436] samantha[2414] masters[2352] raven[2340] kwok[2322] 
topic44=roman[10874] rome[9933] portico[9040] lucius[8415] gaius[8213] consul[7477] goalscorers[6900] bc[6832] marcus[6377] gens[5447] quintus[5062] ovate[4888] civitas[4847] walsingham[4697] publius[4451] villanova[4309] ad[4131] stuccoed[3961] titus[3890] doric[3776] cicero[3468] premio[2967] aquila[2792] racine[2785] balustrade[2755] caesar[2632] fireboat[2608] virtus[2540] canvases[2500] naft[2421] galleria[2402] freedman[2291] lnb[2280] gladiators[2254] julius[2245] minerva[2169] omnium[2165] aulus[2139] cassius[2030] église[2022] severus[1985] gnaeus[1976] romaine[1976] palladian[1964] larissa[1933] tribune[1923] dio[1921] suffect[1899] poésie[1876] prefect[1868] daphne[1859] âge[1830] claudius[1822] ettore[1813] cornelius[1791] palmyra[1789] teniers[1787] ancient[1775] pavić[1773] barberini[1755] captaining[1742] poli[1722] alii[1704] urbino[1696] renzo[1676] bene[1675] blaenau[1672] adonis[1663] mazandaran[1661] consulship[1650] komnenos[1639] lodz[1635] morelli[1624] nazionale[1622] castellan[1613] consular[1606] pliny[1593] francesca[1576] doria[1572] tiberius[1560] monti[1548] legate[1546] leaden[1543] maximus[1522] antonius[1520] nero[1495] nuovo[1492] secundus[1482] editore[1481] lamia[1474] altieri[1470] tacitus[1469] manrique[1461] italia[1460] tentatively[1457] moselle[1452] paget[1451] conti[1441] scipione[1441] proconsul[1427] 
topic45=cast[106746] episode[104407] television[101916] tv[97481] films[97428] directed[95271] actor[86329] award[84352] awards[83283] role[80312] festival[77785] actress[77215] show[77209] drama[76848] director[72861] episodes[71451] filmography[63669] comedy[63262] production[61761] story[61191] plot[60452] theatre[60104] movie[59504] title[55736] love[55489] documentary[51693] producer[50330] starring[45898] short[45616] man[44930] produced[44609] miss[43617] roles[42418] written[41939] novel[41055] girl[40519] stars[39199] character[38285] young[37743] cinema[37252] play[37030] mother[35824] star[34945] premiered[34559] feature[34062] you[33957] reception[32984] appeared[32162] aired[32037] my[31905] writer[31751] father[30700] woman[30434] lead[30232] nominated[29931] characters[29326] release[28684] night[28631] stage[28334] acting[28280] supporting[27766] productions[27653] book[27639] entertainment[26206] co[26185] reviews[25469] theater[24723] channel[24384] broadcast[24226] selected[23074] video[22948] horror[22891] black[22884] voice[22818] animated[22709] wrote[22704] guest[22695] debut[22683] special[22518] boy[22165] live[22153] get[21959] network[21814] shows[21463] go[21445] filming[21406] friend[21402] pictures[21348] critics[21279] category[21205] starred[21168] children[21122] thriller[21068] wife[20865] nominations[20865] worked[20852] mr[20666] screenplay[20657] actors[20642] premiere[20495] 
topic46=party[204527] election[190588] elected[103424] assembly[97060] minister[91615] politician[82753] democratic[82621] elections[80227] council[73314] president[72764] votes[72189] republican[70852] legislative[68967] parliament[60172] candidate[57620] district[54525] political[53260] secretary[50612] constituency[48971] senate[48504] electoral[46748] committee[46693] mayor[46247] vote[44440] deputy[43741] representatives[43491] seat[43159] governor[42276] law[41649] liberal[41624] seats[40976] presidential[39158] candidates[38596] incumbent[38495] appointed[35910] union[35228] term[34940] results[34404] labour[33338] court[32926] representative[32692] chairman[32324] parliamentary[31656] labor[30604] affairs[30478] congress[30264] office[30070] vice[28940] cabinet[28177] justice[27945] sarpanch[27449] ambassador[27056] trump[25821] attorney[25269] leader[24881] democrat[23762] ministry[23742] conservative[23072] representing[22864] primary[22864] voters[22802] independent[22416] prime[22211] senator[21688] legislature[21617] politics[21590] worked[21353] socialist[21297] supreme[21184] degree[21151] graduated[20869] board[20853] chief[20760] judge[20434] education[19355] federal[19307] communist[19162] campaign[19050] executive[18950] lawyer[18823] social[18245] polling[17995] coalition[17986] re[17921] commission[17838] alliance[17605] represented[17408] defeated[17367] ran[17299] wisconsin[17248] commissioner[17087] serving[16986] joined[16954] position[16900] foreign[16606] voting[16471] parties[16068] municipal[15382] director[15360] mp[15269] 
topic47=japan[41130] japanese[37224] tokyo[21680] myanmar[16827] fuji[14322] prefecture[13669] mbc[11758] subdistrict[10368] shenzhen[10012] osaka[9678] peng[8471] nagano[7475] kyoto[7440] universiade[7133] burmese[7043] xiang[6925] burma[6891] lim[6834] feng[6677] tianjin[6492] yangon[6285] dalian[6170] jiangsu[6106] aung[5991] hokkaido[5983] niigata[5809] pilbara[5808] seok[5607] wa[5516] sbs[5503] nippon[5423] xie[5283] nagoya[5189] prefectural[5145] fukuoka[5139] asahi[4967] kalgoorlie[4941] hiroshima[4732] sumo[4543] edo[4385] ono[4227] anhui[4204] yao[4096] sapporo[4051] nhk[4048] buri[4023] shan[3936] nsw[3925] lei[3876] okinawa[3854] nakamura[3853] guizhou[3843] maung[3638] prema[3638] kyaw[3614] fukushima[3565] harbin[3551] chiba[3528] myung[3455] hainan[3433] meiji[3408] lateritic[3254] nara[3173] yamaguchi[3160] mandalay[3092] ganesan[3077] yunnan[3061] tokugawa[3059] zhi[3036] ldp[2998] qingdao[2997] zhuang[2914] china[2867] nagasaki[2860] emperor[2849] saitama[2821] sho[2801] dai[2782] sakai[2717] sendai[2697] aditi[2673] ji[2647] tun[2635] ito[2624] hsiao[2620] okayama[2617] haruka[2601] nan[2584] kagoshima[2575] maeda[2564] myint[2561] ningbo[2559] nihon[2555] xun[2510] sakhalin[2454] dong[2451] kanagawa[2430] keqiang[2417] kumamoto[2411] ku[2396] 
topic48=ochreous[46314] blackish[29099] yellowish[11788] vevo[6894] sheathed[5193] paler[4989] elongate[4490] buff[4326] everest[4287] leumit[4130] parramatta[3939] faintly[3568] sssi[3301] shining[3262] bluish[3190] mountaineers[3075] beitar[3057] blotches[2975] specks[2516] ariana[2466] sikkim[2432] collie[2078] crosse[2068] washburn[1992] pubmed[1909] sriram[1893] britney[1877] lilac[1801] osu[1757] mcnulty[1717] lindley[1707] hut[1700] streaks[1683] stripe[1682] waterfalls[1664] stillwater[1660] eardley[1644] jamieson[1641] azalea[1641] flume[1637] neogene[1611] pizza[1582] burger[1570] sportswear[1563] cassie[1547] subpopulations[1509] cofounder[1456] cyrille[1438] loir[1436] effie[1430] csiro[1421] gonville[1395] tinashe[1389] ashanti[1380] sturt[1368] emery[1366] hadassah[1350] behar[1348] minden[1347] tepals[1347] trekking[1324] azar[1306] murrumbidgee[1305] brickell[1302] bondi[1288] nir[1250] naas[1243] ashdod[1242] faint[1230] duffield[1227] whorls[1220] electropop[1215] werft[1204] guttenberg[1203] earlham[1202] hornbeam[1197] fries[1192] fasano[1192] badger[1183] sunfish[1180] riverton[1176] backstreet[1176] cramer[1173] dinoflagellates[1165] charly[1163] alana[1158] thallus[1149] sandwich[1147] tunbridge[1146] mcdougall[1138] hanni[1134] swimwear[1118] sherpa[1115] flaky[1113] cobbles[1109] heide[1103] prodromus[1089] rihanna[1055] caldera[1045] prathap[1045] 
topic49=army[98174] regiment[81470] military[62236] division[55636] infantry[55247] brigade[46857] corps[45109] commander[44190] battalion[38875] battle[38055] forces[36243] officer[35815] lieutenant[33570] artillery[33461] colonel[32259] command[32030] force[29849] rifle[26435] scout[24754] medal[23391] nd[22853] staff[21355] air[20957] unit[19476] troops[18918] chief[18888] promoted[18710] fought[18563] rd[18153] royal[17821] soldiers[17722] defence[17525] cavalry[16788] awarded[16734] civil[16662] commanded[16140] rank[15876] armed[15173] killed[14694] fort[14384] officers[14371] wounded[14349] units[13902] training[13870] operation[13623] camp[13384] guards[13136] operations[12895] german[12845] captain[12806] attack[12736] guard[12658] defense[12600] commanding[12460] drilliidae[11719] naval[11677] offensive[11584] combat[11520] brigadier[11499] front[11355] battery[11339] enlisted[11331] appointed[11282] tank[10776] navy[10681] campaign[10637] reserve[10443] intelligence[10280] fire[10083] scouting[10016] headquarters[9898] soviet[9865] deputy[9795] formed[9726] men[9679] regiments[9604] transferred[9584] cross[9561] commanders[9381] duty[9354] captured[9280] marine[9258] siege[9226] police[9112] security[9049] red[8994] joined[8910] volunteer[8893] squadron[8820] fighting[8796] assigned[8741] base[8726] soldier[8539] scouts[8409] px[8358] action[8314] field[8305] commando[8210] garrison[8149] legion[7905] 
topic50=kg[79138] wrestling[31050] win[21009] boxing[19512] heavyweight[16104] tko[14780] fight[13618] freestyle[13026] championship[12766] nov[11047] tbs[10817] title[9870] event[9570] mar[9536] sep[9387] judo[9336] champion[9277] middleweight[8683] match[8662] curling[8652] wrestler[8641] decision[8602] professional[8457] oct[8255] boxer[8105] ko[8039] dec[7854] feb[7775] unanimous[7722] tcu[7684] weight[7644] martial[7516] lightweight[7301] aug[7284] vs[7171] bout[7149] ud[6876] hawaii[6798] jun[6761] pts[6706] wrestlers[6318] hawaiian[6205] welterweight[5947] hispanicized[5860] submission[5816] loss[5806] ring[5553] ref[5506] round[5432] nbl[5431] taekwondo[5288] defeated[5191] quarterfinals[4985] div[4667] pro[4634] wbc[4613] iwrg[4488] fighting[4460] punches[4397] mma[4286] karate[4277] mixed[4154] kickboxing[4131] tournament[3985] light[3914] fights[3845] championships[3693] super[3671] sounders[3592] gnis[3588] record[3566] fencing[3376] fighter[3294] jan[3281] donbass[3236] doncaster[3188] night[3125] rua[3017] trapani[2936] opponent[2933] quota[2918] honolulu[2880] arts[2828] ultimate[2820] promotion[2791] cage[2696] avenida[2695] arena[2616] boxers[2600] fought[2575] events[2568] fighters[2567] matches[2565] kumite[2556] knockout[2546] date[2522] olivera[2499] lost[2492] amateur[2476] results[2467] 
topic51=gameplay[12152] pune[10372] lviv[10075] playstation[9759] panchayati[8950] bydgoszcz[5898] cuttack[5221] rk[5034] nintendo[5004] singha[4280] metacritic[3723] multiplayer[3593] srinagar[3546] ivano[3434] ds[3412] balaji[3353] mohun[3196] sega[3195] udaipur[3162] bhat[3143] ps[3143] oblast[3130] pokémon[3019] mladost[2932] frankivsk[2892] vimeo[2848] tucumán[2826] subotica[2796] ternopil[2610] sloboda[2568] sachin[2469] wii[2468] banja[2384] sxsw[2308] vadodara[2254] naves[2227] veronika[2216] petar[2209] katarzyna[2188] namco[2134] gabi[2133] cider[2106] hsinchu[2105] pölten[2083] chernivtsi[2040] zala[1937] sudheer[1936] wiz[1929] dk[1900] maratha[1878] marjan[1862] ulica[1856] adria[1796] bandai[1780] uzhhorod[1772] rathore[1742] gdansk[1707] épinal[1704] luv[1692] ua[1681] deconsecrated[1679] satara[1650] gornji[1646] hazrat[1625] jat[1600] gophers[1589] gdynia[1573] ato[1566] gazeta[1511] sanda[1486] möller[1478] yuko[1477] malwa[1461] nizamuddin[1457] mesto[1446] quilmes[1443] scooby[1434] xtreme[1421] chanda[1421] misiones[1410] fk[1404] saša[1404] vita[1394] hoshi[1392] jagir[1387] sopot[1372] choo[1367] pkp[1365] nitro[1356] moti[1340] sonic[1334] chhatrapati[1325] zielona[1320] jp[1312] mahal[1294] kole[1291] arakawa[1287] valjevo[1279] ahmednagar[1266] rajaram[1238] 
topic52=ngc[15404] star[13616] galaxy[13414] observatory[12048] solar[11438] planets[11073] planet[9815] eclipse[9471] constellation[9445] telescope[9154] earth[9134] astronomy[8733] sun[8447] asteroid[7412] magnitude[7253] stars[7208] minor[6669] astronomical[6647] astronomer[6222] discovered[5866] kitt[5595] xo[5532] orbit[5424] cluster[5329] galaxies[5087] mass[4957] stellar[4842] jupiter[4807] dwarf[4684] wsl[4507] cowdenbeath[4457] eclipses[4239] astrophysics[4224] tarun[4077] comet[4030] light[3995] kepler[3970] binary[3950] copulatory[3612] comets[3588] type[3526] pocock[3487] radius[3404] milky[3394] planetary[3384] cet[3338] discovery[3327] exoplanet[3325] observations[3295] galileo[3205] spiral[3187] survey[3166] system[3138] orbital[3114] orbits[3082] discoverers[2982] supernova[2970] prahran[2949] galactic[2930] distance[2878] spectral[2876] universe[2853] apparent[2803] au[2802] nebula[2782] herschel[2758] hubble[2748] hd[2747] visible[2738] temperature[2726] diameter[2684] hr[2651] infrared[2647] orbiting[2630] object[2623] variable[2608] faint[2565] yuka[2555] asteroids[2554] docent[2554] sigma[2524] moon[2464] globular[2456] shrubland[2429] neptune[2415] space[2385] brightness[2381] sn[2377] nasa[2357] gamma[2337] lambda[2332] roofline[2310] objects[2308] galapagos[2280] alpha[2277] anoop[2226] telescopes[2215] sky[2167] eso[2140] catalogue[2139] 
topic53=french[25749] jean[23543] france[14458] marie[14327] pierre[13758] philippe[12212] louis[12060] iaaf[11850] births[11435] françois[10502] count[10425] paris[10372] nationality[10049] events[9469] antoine[9347] jacques[9332] charles[9175] nicolas[8976] la[8541] deaths[8489] marcel[8426] rank[8158] le[8019] alexandre[7824] stade[7766] michel[7442] andré[7205] anna[7069] gallimard[7054] henri[6917] uci[6654] van[6433] gérard[6196] irina[6184] maria[6181] maurice[6139] bnf[6114] denis[6063] belgian[6013] olivier[5988] directed[5935] jeanne[5932] laurent[5929] jules[5823] madame[5770] claude[5752] saint[5614] ekaterina[5505] guillaume[5467] yves[5466] cast[5439] baptiste[5422] albert[5400] éd[5394] luxembourg[5383] furlongs[5322] paul[5310] robert[5306] universiade[5232] du[5177] sophie[5165] anne[4992] bests[4933] haiti[4844] bibliography[4814] rené[4814] married[4795] sainte[4785] submissions[4740] léon[4727] françoise[4713] armand[4698] foreign[4629] louise[4619] opéra[4583] von[4520] joseph[4500] brussels[4461] christophe[4460] georges[4434] germain[4409] frédéric[4401] andrey[4398] haitian[4382] prince[4284] elisabeth[4272] weeknd[4195] madeleine[4063] actes[4062] provence[4056] victor[3986] christian[3910] martin[3879] simon[3878] belgium[3858] sud[3849] andreas[3827] catherine[3800] comique[3792] marguerite[3754] 
topic54=consecrators[14133] consecrator[13244] auxiliary[11339] vietnam[10669] beatification[8909] vietnamese[8829] soundcloud[7535] archdiocese[6823] cambodia[6792] lough[5978] lega[5459] archery[4938] minh[4898] bari[4523] cambodian[4395] suffragan[3764] novara[3731] nguyen[3635] raghu[3566] palermo[3518] phnom[3453] serie[3325] beatified[3325] vicariate[3303] penh[3274] presbytery[3250] bodo[3155] khmer[3138] priory[3094] naver[3086] coolgardie[3085] recurve[3075] dewi[3068] soissons[2924] aspx[2855] hanoi[2852] kampong[2780] santi[2708] thanh[2685] calcio[2644] livorno[2638] catania[2632] venerable[2611] spezia[2609] buttresses[2589] negros[2523] belfry[2520] viet[2391] mac[2296] bareilly[2252] rectory[2227] vlad[2157] avellino[2126] varese[2123] ascoli[2090] rimini[2073] apr[2038] nam[2035] saigon[2019] ancona[1996] hagiography[1991] cagliari[1946] lecce[1943] dijk[1937] chi[1915] hoa[1865] exarchate[1853] ho[1847] ros[1834] mag[1831] mauretania[1820] carlist[1791] sasi[1781] udine[1758] bunga[1748] salerno[1719] uí[1692] laos[1689] malo[1687] huan[1668] severino[1664] harwich[1650] vercelli[1648] cheong[1648] indochina[1640] abbess[1611] kotte[1603] isola[1600] lop[1572] taranto[1570] myra[1560] livio[1543] kartli[1538] messina[1480] ní[1474] quang[1473] albans[1438] alessandria[1436] imperii[1402] progresso[1398] 
topic55=px[71769] rebounds[18292] freshman[14594] discogs[11613] hornets[10175] redshirt[9128] assists[7193] streptomyces[6854] bobcats[6790] steals[4649] aba[3939] gators[3740] ucf[3550] cavaliers[3096] georgetown[2769] conway[2601] lakers[2535] shl[2379] radford[2285] gator[2277] dynamos[2237] pnp[2141] zaria[2130] vmi[2085] hodge[2010] olsson[2001] nfb[1994] pacers[1990] ewing[1930] correia[1928] barros[1920] tor[1789] girton[1763] blazers[1730] donny[1645] ojo[1615] ginebra[1600] gatorade[1600] spg[1567] ima[1498] ridgeway[1452] semifinals[1354] padmanabhan[1326] hakeem[1319] yosuke[1309] polyhedron[1302] mcguire[1300] adidas[1291] efes[1275] scoring[1255] knicks[1236] ade[1205] layla[1173] penicillium[1169] buzzer[1168] phelan[1166] majored[1162] antigen[1135] hv[1120] feni[1111] finke[1100] honeycomb[1085] arifin[1084] antibiotic[1072] nahuel[1059] cohomology[1054] nordre[1028] mussoorie[1028] lejeune[1021] bowers[1018] pero[1016] mage[1005] homotopy[1002] polyhedra[981] hanka[976] nagel[972] lebron[968] guedes[951] vanya[936] cowritten[926] easement[923] ppg[909] glaucoma[875] sabo[875] gainesville[870] mobygames[858] antigens[856] bower[850] ppv[850] bharatpur[849] verdier[842] fgm[839] tallied[836] stainton[828] iverson[826] movimento[813] augsburger[793] schooler[790] ketone[790] romberg[779] 
topic56=king[43978] castle[30534] son[29120] prince[21403] ottoman[21204] empire[20991] emperor[19626] dynasty[19399] reign[18713] kingdom[17673] battle[15445] princely[14860] sultan[14258] ruler[13920] clan[13605] brother[13317] throne[12781] ruled[12726] daughter[12670] sources[12588] bc[12225] governor[12115] succeeded[11997] iii[11803] count[11731] father[11706] byzantine[11670] ce[11662] queen[11425] imperial[11397] royal[10656] kirkus[10163] princess[10148] army[10109] title[8827] treaty[8485] rajput[8472] fortress[8421] siege[8378] inscription[8357] married[8309] noble[8255] mentioned[8167] wife[8096] according[8055] khan[7994] palace[7994] sons[7884] tribe[7788] pasha[7757] rebellion[7724] rule[7605] military[7603] killed[7522] shah[7500] defeated[7309] duke[7274] safavid[7235] mother[7206] gujarat[7081] iv[7067] lands[6761] sent[6752] period[6695] chief[6638] ancient[6572] forces[6537] rulers[6399] crown[6192] kings[6189] captured[6163] court[6111] land[6100] consort[5992] medieval[5964] region[5919] capital[5906] bce[5899] probably[5879] troops[5831] sultanate[5684] conquest[5662] constantinople[5622] armenian[5595] georgian[5503] territory[5494] descendants[5464] roman[5439] bibliography[5433] revolt[5431] tbilisi[5377] successor[5363] led[5310] heir[5294] raja[5292] tribute[5289] ottomans[5242] town[5236] fort[5210] lord[5206] 
topic57=police[53815] women[53746] court[50040] law[44328] rights[41621] act[31618] prison[30057] case[29132] said[27749] political[27647] arrested[26013] party[23626] cannabis[23173] movement[23086] justice[21937] legal[21594] criminal[20966] president[20236] security[19543] union[19465] anti[19453] killed[19263] civil[18308] social[18277] violence[18180] murder[17823] attack[17412] children[17097] activist[16755] human[16734] investigation[16437] supreme[16358] organization[16272] workers[16214] men[16101] support[15808] crime[15741] trial[15604] stated[15598] we[15350] victims[15134] should[14555] authorities[14407] laws[14316] sentenced[14316] minister[14231] sexual[14217] freedom[14194] according[14142] committee[14129] accused[14099] cases[14089] right[14060] reported[14054] communist[13925] report[13846] officers[13597] claimed[13544] sex[13542] woman[13525] black[13475] protest[13417] military[13294] led[13251] media[13042] federal[12957] protests[12784] involved[12734] campaign[12555] decision[12538] lgbt[12482] arrest[12406] news[12193] incident[12060] strike[11974] community[11953] constitution[11912] officials[11865] illegal[11858] gay[11673] without[11601] council[11333] african[11292] prisoners[11200] sent[11138] commission[11094] charges[11059] newspaper[11055] camp[11029] leader[10947] person[10909] working[10897] department[10781] months[10748] issues[10725] article[10715] gender[10605] even[10581] peace[10566] bill[10543] 
topic58=scottish[61501] scotland[38163] edinburgh[33734] glasgow[31158] dundee[14221] aberdeen[13611] stirling[8945] thistle[8823] celtic[8718] dumbarton[8159] falkirk[8037] rangers[7864] dunfermline[7271] morton[6729] clyde[6309] loch[6286] hamilton[5689] queen[5447] clydebank[5193] albion[5153] inverness[4936] scots[4636] leith[4547] cairn[4362] lanark[4193] snp[4153] sutherland[4108] irvine[4049] berwick[3971] ayrshire[3930] highland[3910] argyll[3858] greenock[3855] alef[3854] sidings[3849] gaelic[3802] livingston[3684] galloway[3629] woolwich[3563] thomson[3533] blyth[3379] macdonald[3337] heart[3282] lothian[3152] arbroath[3139] gaels[2984] mackay[2963] fusiliers[2942] highlanders[2916] shetland[2876] flinders[2839] montrose[2821] hay[2792] ross[2786] hussars[2630] gazetteer[2599] faisalabad[2585] buchan[2573] orkney[2570] dunbar[2520] forsyth[2502] caledonian[2496] lanarkshire[2491] watt[2465] wizkid[2464] millar[2424] moray[2413] dumfries[2399] stela[2386] swinton[2360] bertie[2323] giza[2309] ratan[2285] dundas[2275] burgh[2264] fixtures[2263] andrews[2239] maitland[2236] aldershot[2213] infirmary[2191] balfour[2170] mckenzie[2163] blackwall[2129] régiment[2097] paisley[2093] melville[2069] barr[2043] emus[2037] guthrie[2031] macleod[2022] dela[2022] abercromby[2007] grafton[1992] blackwater[1988] caithness[1948] skye[1940] wingate[1939] strathclyde[1898] kerr[1878] baird[1830] 
topic59=gastropod[19895] snail[18727] brownish[17204] mandal[14931] fk[14810] dass[11818] spartak[8635] attains[8418] kathiawar[6662] ferruginous[5445] junagadh[4835] spacewatch[4228] trampoline[3803] dir[3669] saha[3441] submarginal[3178] manju[2974] bhavnagar[2837] bruckner[2815] baroda[2757] kantor[2573] hordaland[2556] iow[2548] nagendra[2488] ist[2481] yellowish[2432] birla[2380] lokesh[2370] preto[2344] poulenc[2312] kavi[2195] ndr[2180] sandown[2167] northam[2147] tarento[2109] mys[1991] eel[1924] catalina[1904] mau[1894] unplaced[1894] sohn[1892] longitudinally[1846] pavan[1830] të[1823] sudhakar[1807] wroclaw[1800] figs[1754] aerobatic[1746] gott[1738] uns[1730] partizani[1723] molluscs[1698] iridescent[1686] unser[1672] darter[1667] baumann[1659] freund[1638] alles[1637] mathur[1624] dürr[1617] sl[1603] centimeters[1592] graphene[1564] ryde[1561] schröder[1553] magazin[1542] fröhlich[1530] schauer[1528] bicolor[1526] sucker[1516] dich[1498] trond[1465] kanwar[1457] trøndelag[1455] wight[1430] ikaw[1410] förster[1404] gedichte[1390] ribbed[1385] geckos[1381] goby[1370] guppy[1368] hoff[1329] tereza[1327] riffles[1320] vidarbha[1305] laramie[1297] hau[1290] wingfield[1284] spots[1272] similis[1261] aerobatics[1260] hurwitz[1245] pumila[1240] rad[1225] ornata[1212] translucent[1211] kleiner[1206] stingray[1200] spirally[1197] 
topic60=chemical[16136] acid[13703] chemistry[13003] organic[10597] compound[10029] reaction[9887] synthesis[9552] carbon[7907] mineral[7890] compounds[7635] mnet[7138] acetate[6764] minerals[6709] ohl[6649] formula[6460] ch[6084] reactions[5807] hydrogen[5689] sodium[5420] structure[5331] newnham[5046] atoms[4973] methyl[4811] molecules[4732] ester[4562] gravelly[4526] oxide[4206] sulfate[4086] chloride[4062] nitrogen[4033] și[3964] molecular[3930] properties[3929] acids[3882] oxygen[3875] metal[3821] ion[3791] ovoid[3760] salt[3716] sulfur[3693] chemist[3662] pv[3611] atom[3597] oxidation[3544] ions[3540] liquid[3537] cl[3519] fluoride[3499] potassium[3393] enzyme[3387] solid[3342] crystal[3335] synthetic[3249] bond[3238] în[3182] electron[3163] phosphate[3139] polymers[3068] ligand[2983] molecule[2975] premios[2973] crystals[2925] solution[2922] lithium[2912] lng[2896] dota[2849] cobalt[2832] water[2782] coli[2768] complexes[2758] mg[2744] temperature[2725] iron[2719] na[2699] gsk[2672] complex[2638] abl[2631] nickel[2615] pharma[2553] ether[2549] dioxide[2540] synthesized[2535] gheorghe[2503] chromosome[2487] uranium[2474] catalyst[2460] salts[2458] metals[2451] soluble[2391] ring[2388] derivatives[2387] cu[2378] bromide[2358] copper[2355] zinc[2344] gas[2343] grigore[2295] nitrate[2290] catalytic[2284] icm[2280] 
topic61=cdp[12920] wine[10532] dinamo[9196] inseason[8298] inseries[7959] friendlies[7098] cska[6664] libero[5700] albion[5419] obispo[5218] winery[4988] hove[4730] fernandes[4675] northampton[4599] fulham[4481] notts[4238] chul[3927] sathya[3631] loughborough[3550] wines[3539] bournemouth[3448] gillingham[3298] muñoz[3252] luton[3245] swi[3228] everton[3221] marítimo[3113] hyaline[3050] grape[2930] bromwich[2898] kaiserslautern[2759] solberg[2643] vineyard[2465] cletus[2389] pref[2328] ava[2293] éditeur[2272] foursquare[2182] coursed[2180] shipley[2179] westlake[2164] tulloch[2138] hotspur[2127] golson[2116] soria[2088] badajoz[2079] vineyards[2068] maur[2036] hallam[2016] morecambe[1993] tóth[1957] metzger[1946] quilts[1926] ondrej[1919] glossop[1885] sportif[1883] ángela[1878] kimball[1828] rfa[1827] marriott[1817] schalke[1816] grapes[1815] cáceres[1804] boldklub[1790] grays[1783] bainbridge[1738] ashfield[1714] leiria[1700] roldán[1679] heck[1676] domaine[1666] merritt[1644] emerita[1629] napa[1541] barış[1539] howland[1533] tain[1531] apuestas[1530] hooch[1506] strýcová[1485] marwar[1475] blagoevgrad[1473] robles[1472] moline[1456] agostini[1455] rolland[1455] uesugi[1450] cellar[1439] burbank[1438] germantown[1429] pavol[1417] simi[1390] crum[1388] bsk[1383] linnea[1367] talavera[1366] chitti[1359] virgilio[1355] hitchin[1355] navarra[1339] 
topic62=nepal[23637] nepali[10959] grevillea[9052] swiss[8915] kathmandu[7680] basel[7447] canton[7003] nepalese[6466] bogotá[5626] rana[5335] mendis[4567] bahadur[4414] grimsby[4340] coins[4262] switzerland[4250] gopi[4236] coin[3906] bern[3872] thapa[3726] ukip[3501] roshan[3406] malla[3181] zürich[3148] duleep[3104] terriers[3017] medellín[2860] akash[2662] nasl[2605] hazare[2519] bochum[2448] argyle[2440] bundestag[2324] aomori[2312] annapurna[2308] volley[2282] nrw[2204] jäger[2134] kiran[2113] leb[2113] mint[2078] boyacá[2069] tiempo[2056] lucerne[2025] germaniawerft[1963] pratap[1946] gorkha[1915] oeste[1903] colombian[1899] bretagne[1894] kunwar[1892] tranmere[1888] laxmi[1850] banknotes[1812] farooq[1809] plon[1772] joakim[1737] domínguez[1698] bhanu[1676] gurung[1670] maoist[1641] minted[1640] pml[1626] redhawks[1622] zug[1600] wycombe[1596] cantons[1564] jons[1551] sita[1546] tapia[1546] durbar[1528] cundinamarca[1527] venegas[1525] kanazawa[1514] cantonal[1511] banknote[1508] leduc[1499] aeg[1476] socorro[1453] siècles[1436] caldas[1436] bikram[1412] sme[1394] antioquia[1380] yala[1364] baig[1364] carril[1355] cali[1353] worthing[1341] schmid[1330] rampur[1322] paisa[1315] hampden[1284] lalitpur[1282] restrepo[1281] jaeger[1280] jutra[1280] zeiten[1260] uribe[1259] tunja[1255] pati[1250] 
topic63=radio[68678] fm[63157] station[44147] tv[39087] channel[38915] news[32102] television[27641] broadcasting[25322] broadcast[25274] am[23622] owned[21525] network[17681] programming[14418] format[13922] pm[13739] stations[13247] show[12132] grupo[11974] naia[11804] watts[11577] program[10676] broadcasts[10675] media[10117] mhz[10040] khz[10025] sports[9092] licensed[8948] bbc[8933] digital[8918] channels[8877] programs[8488] host[7700] cable[7612] hossein[7311] evansville[7261] hosted[7061] carries[7029] launched[6812] nuytsia[6754] paraglider[6696] purdue[6509] abc[6414] cbc[6393] aired[6384] hd[6268] facebook[6082] https[6058] utep[5974] airs[5959] talk[5836] anchor[5717] ultralight[5612] satellite[5589] tehran[5566] air[5529] broadcaster[5450] cbs[5336] coverage[5310] transmitter[5221] nbc[5189] communications[5084] qom[5045] frequency[4997] entertainment[4983] shows[4893] operated[4837] saas[4779] programme[4739] boilermakers[4695] maxi[4643] live[4476] morning[4428] roubaix[4351] ary[4260] jtbc[4256] networks[4238] call[4146] fcc[4128] cnn[4115] sold[4103] current[4102] conus[4098] sony[4082] fox[4082] daily[4067] affiliate[4050] sky[4035] moved[3962] power[3900] quiz[3890] programmes[3882] esteghlal[3875] broadcasters[3774] www[3678] hour[3648] newspaper[3621] npsl[3617] corporation[3603] sign[3589] monday[3469] 
topic64=estonian[19662] pld[18647] pts[18624] tallinn[10902] gf[9847] estonia[8625] ga[8340] sheeran[6029] pim[5582] jamaica[5159] tartu[5101] diff[4776] sanremo[3993] srinivas[3886] jamaican[3858] chekhov[3838] reggae[3643] jeeves[3430] santosh[3315] anuradha[3281] margarete[3227] ska[3050] eesti[3003] karla[2943] dmytro[2924] tatiana[2881] chemnitz[2789] luhansk[2766] ivanovich[2760] gertrud[2721] ucd[2716] meri[2693] rsfsr[2684] shelly[2498] maroons[2472] dancehall[2459] kulkarni[2425] stepan[2350] galina[2313] gorky[2302] andriy[2299] nikolay[2231] meistriliiga[2148] erdmann[2145] mikkel[2134] boland[2120] shelbourne[2105] wooster[2030] duisburg[2009] popov[1964] guelph[1931] uyezd[1930] dtv[1920] от[1889] manne[1874] nayak[1843] ceramist[1837] semyon[1825] jaan[1821] kavya[1803] decca[1783] frauen[1778] sociedad[1774] года[1729] mishra[1728] webby[1716] fyodor[1708] oriel[1692] krauss[1684] rté[1678] milly[1676] placings[1663] vogel[1653] zwickau[1642] montpelier[1632] carrasco[1619] ifa[1615] verónica[1615] styne[1592] shakib[1581] artiste[1549] bolshoi[1529] bola[1508] bernt[1471] elmo[1454] eupen[1422] paulson[1387] shakey[1383] metcalfe[1375] brașov[1367] alka[1358] grigory[1354] arkhangelsk[1352] komsomol[1347] volkov[1346] yuna[1334] raghav[1331] bassey[1329] suzana[1323] televote[1319] 
topic65=turkish[39563] turkey[30760] cypriot[16832] istanbul[15260] ankara[10336] cyprus[9680] eparchy[6940] sarajevo[5938] zmir[5443] mehmet[4432] bendigo[4323] mustafa[4304] belediyespor[4257] adana[3956] chp[3708] viic[3594] ottoman[3553] ballarat[3415] krasnodar[3327] atatürk[3302] ahmet[3287] kemal[3255] limassol[3249] izmir[3233] spor[3069] gippsland[2781] masjid[2737] alp[2693] saeed[2685] konya[2675] ashraf[2600] marmara[2591] akp[2535] yemeni[2518] goulburn[2516] geelong[2500] anatolia[2441] kara[2436] bey[2431] eskişehir[2375] stv[2359] adil[2354] sana[2354] gauchos[2324] habib[2316] aoi[2273] siddiqui[2267] tff[2209] abubakar[2188] cochin[2185] bracknell[2185] trivandrum[2149] trabzon[2139] gazi[2121] swat[2116] sudhir[2080] alam[2070] province[2070] ali[2048] balıkesir[2044] podemos[2021] homs[1996] larnaca[1995] emre[1968] anatolian[1965] yekaterinburg[1942] batumi[1925] travancore[1924] mosque[1916] faisal[1910] pasha[1906] kaya[1865] junín[1864] shabab[1859] manchukuo[1819] khalil[1803] salam[1803] shahid[1795] shaheen[1785] adel[1776] faiz[1772] anadolu[1759] mhp[1759] hasan[1758] sahel[1756] oakes[1749] lucía[1693] samsun[1691] shaikh[1674] dsb[1650] nanterre[1636] türk[1626] kayseri[1615] gulshan[1614] vizier[1609] muğla[1606] cecilie[1601] juba[1594] brookmeyer[1591] mian[1590] 
topic66=fresno[13764] greyhound[10260] javelin[7619] dragonfly[7386] steroidal[5100] fullerton[5020] nguyễn[4844] bieber[4753] dog[4131] renard[4106] rohit[4029] dogs[3619] bearcats[3106] putra[2677] fawn[2485] vibraphone[2452] hwang[2433] sinaloa[2413] văn[2347] kamala[2200] mondo[2179] stetson[2054] klamath[2018] trần[1984] tonbridge[1900] hoàng[1895] viswanath[1878] marin[1857] greyhounds[1838] hare[1833] mcgowan[1773] gotra[1742] prodrug[1725] grimes[1702] spars[1702] zedd[1701] kern[1641] nik[1580] не[1551] hawker[1537] philpott[1509] showbiz[1504] trouser[1485] bonny[1472] sykes[1471] mavis[1468] markings[1458] medico[1440] bonita[1440] csu[1438] sequoia[1434] jui[1412] hanley[1398] hamer[1398] hoosier[1386] sap[1382] stitt[1377] tailplane[1358] bland[1353] yosemite[1352] gia[1343] furlong[1340] youn[1319] grocer[1319] kjeld[1311] ngọc[1295] furies[1271] tuolumne[1267] hồ[1246] leek[1246] spar[1232] phạm[1230] plywood[1195] orb[1191] roadrunner[1191] kramer[1184] jarman[1182] pandan[1182] kokomo[1176] chaz[1160] vanna[1144] monserrat[1140] inglewood[1134] beale[1127] ultron[1118] mattress[1112] joaquin[1112] netto[1105] wylie[1105] hickman[1095] hutchings[1078] davie[1070] shimmy[1056] allman[1052] lederer[1037] kaan[1036] fearless[1035] bessie[1021] tunstall[1015] linh[1012] 
topic67=paralympics[21536] ccaa[18081] agder[12558] brentford[6189] palearctic[4950] på[4141] pinkish[3514] islet[3361] exo[3350] metalcore[3151] manuela[2935] androgenic[2906] commodores[2860] odham[2859] gotland[2810] gowda[2685] figueroa[2520] kafr[2310] loma[2138] petersen[2125] pinoy[1999] superfast[1990] shinee[1972] streaked[1967] siempre[1930] nassar[1923] mep[1921] millennials[1881] shinde[1859] bajwa[1840] michaud[1801] boyband[1801] ssp[1773] julián[1753] señor[1738] heraklion[1727] hutt[1715] quartzite[1708] voronezh[1696] sculpin[1619] egger[1612] moser[1593] muerte[1586] combinatorics[1561] eugenia[1526] dorsally[1501] öztürk[1483] pineda[1479] indica[1472] omg[1465] mcalister[1410] vermillion[1396] nunatak[1389] capsid[1377] marquez[1356] aggarwal[1333] oulu[1313] jono[1312] amigos[1303] tijuca[1300] coker[1297] rethymno[1290] farrugia[1289] highbury[1268] alcalde[1261] garay[1218] janka[1217] aves[1192] bahía[1158] carvajal[1156] inermis[1153] quintero[1149] keating[1140] sombra[1131] sobre[1122] brevard[1118] yuval[1117] bohol[1110] lua[1102] acuña[1098] mackey[1096] cen[1087] acero[1080] alon[1069] greifswald[1065] downie[1062] rosalia[1059] vamos[1048] cayley[1047] futuro[1045] fama[1035] macgyver[1016] sangre[1014] byers[1007] småland[1004] nati[990] burch[965] viralzone[964] marañón[952] ofi[948] 
topic68=bandcamp[8826] jee[7305] tibetan[5637] mixtapes[5307] wd[5126] volkswagen[5120] mercedes[4802] benz[3995] psa[3798] taichung[3758] mahindra[3749] jeon[3620] opel[3555] lama[3330] navajo[3116] romeo[3096] tibet[3057] alfa[2825] xd[2696] mbs[2675] sewanee[2646] hyundai[2572] lego[2356] jaguar[2261] volvo[2229] changchun[2197] katya[2185] bts[2152] muthu[2112] chrysler[2079] lms[2077] vyas[2055] myeon[2051] amg[2029] khao[2016] gakuin[1993] tesla[1971] lubin[1921] turbo[1887] étoile[1887] supercar[1818] hokkaidō[1799] yamagata[1771] jinja[1763] subaru[1748] rinpoche[1723] lj[1722] lamborghini[1699] wolfsburg[1693] fader[1670] aiff[1669] albarn[1667] soko[1658] ods[1658] shao[1651] mitsubishi[1643] evgeniya[1638] erc[1631] surendra[1621] stig[1617] aberdeenshire[1577] mes[1577] rst[1569] dalai[1559] chameleon[1558] suwon[1536] zou[1519] ehime[1511] jeonju[1505] kuroda[1495] sema[1473] suan[1462] cummins[1450] mino[1430] anju[1423] muay[1409] kalu[1404] tiverton[1393] tomo[1392] roxanne[1392] wg[1380] fo[1354] bap[1343] mgr[1339] wheelbase[1335] cuentos[1310] huntly[1293] hsing[1292] knowle[1281] wyeth[1278] cmc[1273] naz[1272] jairam[1270] gl[1263] ordinariate[1255] trayvon[1253] kho[1250] comer[1249] gorillaz[1249] falcone[1208] 
topic69=ireland[53962] irish[42049] dublin[25400] hockey[24202] uci[22086] usl[20943] cork[19614] blotch[13495] munster[12579] galway[12321] gaelic[12242] icelandic[12101] leinster[12005] nhl[11835] pts[11596] tipperary[11523] limerick[11309] ice[11280] gp[10632] rakyat[10524] championship[10345] senior[10234] ulster[9570] dewan[8960] kilkenny[8806] bn[8768] gaa[8541] belfast[8297] lokomotiv[8061] waterford[8016] mayo[7571] connacht[7192] townlands[7123] goaltender[7102] iceland[7009] meath[6897] barony[6702] cavan[6699] sanath[6176] umno[6126] concacaf[5986] pdl[5949] kerry[5847] ie[5659] totals[5371] bruins[5195] derry[5158] donegal[5047] clare[4937] wexford[4898] sligo[4891] greenlandic[4717] playoffs[4423] antrim[4396] offaly[4361] neill[4359] dap[4319] kildare[4247] westmeath[4239] wicklow[4226] tyrone[4212] agg[4174] longford[4088] dundalk[4066] goalkeeper[4014] schuckert[3990] ofoverall[3888] cyclo[3816] kickers[3784] whl[3680] armagh[3652] monaghan[3620] roscommon[3554] patrick[3504] keane[3439] larsson[3395] na[3384] carlow[3338] geylang[3310] bk[3301] louth[3295] persson[3289] flyers[3133] muda[3104] debutant[3077] reykjavík[3068] ik[3059] siddique[3046] kells[3038] otl[3026] antigua[3010] gf[2962] leitrim[2946] mohamad[2907] toros[2887] mac[2846] mahathir[2835] connell[2819] kampong[2819] onn[2771] 
topic70=meyrick[28883] ghana[19003] mexico[15057] stigmata[12377] mexican[11865] pakistan[10731] ghanaian[9260] arunachal[8402] karachi[7206] sindh[6834] svg[6384] méxico[5867] accra[5295] khyber[5281] pakhtunkhwa[5132] kalan[4968] aztecs[4933] manipur[4871] lahore[4811] stenoma[4497] veracruz[4447] cantonment[4338] puebla[4098] ciudad[3749] ruiz[3746] monterrey[3713] chak[3656] mujer[3620] jalisco[3617] álvarez[3529] yucatán[3462] khurd[3431] gila[3402] nagaland[3311] mizoram[3283] peshawar[3195] pima[3170] sindhi[3167] icon[3144] volta[3114] chiapas[3076] artes[3073] michoacán[3044] stabling[3033] quetta[3032] racecourse[3001] unidos[2963] chihuahua[2949] escobar[2882] paso[2881] ayala[2871] erdoğan[2845] rubio[2841] cebu[2763] sonora[2722] lès[2718] balochistan[2701] baloch[2644] belize[2624] coahuila[2619] azam[2583] valdés[2570] pavn[2533] marne[2505] maya[2503] rawalpindi[2482] bagh[2458] hacienda[2456] mahavidyalaya[2413] southgate[2410] kabaddi[2396] hidalgo[2366] gonzalez[2324] kst[2317] rawat[2273] guadalupe[2245] recep[2244] acosta[2242] morelos[2241] querétaro[2193] punjab[2175] tripura[2165] nieto[2144] guanajuato[2120] meghalaya[2083] bellas[2074] ascot[2054] raphaël[2040] guerrero[2017] matheus[1985] cárdenas[1965] mexicano[1956] salas[1951] foaled[1933] mandi[1921] renaud[1914] pumas[1912] contreras[1886] isidro[1883] pueblo[1883] 
topic71=fascia[12441] bergfelder[9107] xu[7879] spotify[4734] ferrari[4615] matsumoto[3660] mei[3586] jia[3303] sina[3183] luo[3151] ueda[2971] yamada[2967] meera[2882] suresh[2694] shimizu[2670] asom[2595] lazer[2498] lola[2497] yoshida[2425] saroja[2403] ren[2393] longhorns[2390] hana[2380] chou[2311] zhen[2300] maserati[2285] ning[2272] miki[2224] vcu[2155] aoki[2091] vk[2084] amon[2062] peuples[2028] masaki[2009] japanese[1974] dandenong[1962] vauxhall[1916] qiao[1899] prêmio[1857] nico[1843] horan[1803] midwife[1787] kenji[1763] hashimoto[1742] qiu[1733] lian[1728] midwives[1724] mercedes[1711] jt[1698] daisuke[1679] midwifery[1670] natsu[1636] inoue[1625] bu[1615] dougherty[1613] bugatti[1596] jurek[1593] cctv[1586] hj[1562] sasaki[1560] hà[1539] lifes[1515] wada[1511] masahiro[1508] jacky[1485] chiu[1475] hom[1471] toho[1469] murtagh[1468] otoko[1462] hou[1419] kon[1408] tatort[1403] elim[1372] toki[1359] takashi[1356] itō[1352] berlinale[1348] linkin[1340] brabham[1333] nishikawa[1332] satoshi[1328] takeshi[1327] ichikawa[1323] ananya[1300] aligarh[1296] yo[1294] kadokawa[1289] mahalakshmi[1287] kazuki[1281] chantal[1278] satomi[1276] dadi[1263] yoshio[1258] immortals[1257] jma[1253] nakagawa[1243] cui[1239] uchida[1233] teasers[1228] 
topic72=romanian[22300] romania[12506] bucharest[11770] mohd[7098] balu[6943] rogaland[6286] jalan[6021] esports[5266] taman[5248] iași[4728] alexandru[4554] ion[4471] moldova[4300] moldovan[4270] nicolás[4049] constantin[3836] biathlon[3800] nicolae[3741] chișinău[3174] odessa[3036] bø[2777] vasile[2746] revista[2712] manolo[2657] galician[2638] kotor[2629] nordland[2583] lugo[2503] voz[2501] nadezhda[2495] música[2472] grete[2467] carioca[2384] oficial[2271] transylvania[2270] compostela[2252] marín[2240] galați[2221] mihail[2196] perdana[2172] sevilla[2167] yeovil[2141] ediciones[2051] karlsson[2050] siti[2025] editura[2012] amador[1962] haugesund[1953] liceo[1952] din[1950] tawny[1938] pula[1924] ortiz[1923] vigo[1921] borja[1913] tanjung[1885] galicia[1849] misaki[1834] mircea[1826] bogdan[1821] filho[1811] baru[1810] franziska[1808] nag[1807] rojo[1807] duda[1804] veiga[1800] aik[1796] otero[1791] dimitrie[1783] ríos[1782] seng[1773] dato[1757] moldavia[1751] rika[1731] hikari[1712] elin[1700] montoya[1695] mota[1677] rogelio[1674] televisión[1670] katja[1664] pamplona[1663] hulu[1659] ștefan[1653] jasmin[1653] merlo[1647] ulla[1639] svendsen[1627] viața[1618] salgado[1605] nilsson[1599] kahani[1597] tiraspol[1596] roque[1590] moldavian[1578] ustad[1568] besar[1559] blanco[1557] celta[1555] 
topic73=racing[61878] race[52831] tour[43002] golf[25206] ret[24978] championship[24516] stage[23978] car[23081] driver[19581] colspan[19181] ford[16548] prix[16362] speedway[15767] motorsport[15547] sprint[15405] laps[15243] formula[14896] points[14772] chevrolet[14642] cycling[14120] road[14077] races[13845] lap[13593] motorsports[13454] championships[13411] undrafted[12674] classification[12343] px[12093] pga[12073] grand[12026] gt[11409] standings[11316] honda[11276] overall[11150] rowspan[11090] finished[11048] results[10720] trial[10558] bib[10259] nd[10120] open[10103] rd[10087] circuit[10068] cars[9856] drivers[9730] cup[9287] motorcycle[9269] motor[9223] finish[9189] rally[9181] winner[9175] riders[8832] nascar[8759] strokes[8631] pos[8620] toyota[8490] classic[8431] renault[8121] cc[8101] event[7862] bmw[7831] racer[7278] track[7090] fastest[7041] pole[7034] course[6978] wins[6966] par[6767] challenge[6562] porsche[6319] class[6291] champion[6288] km[6245] stages[6048] gp[5967] holden[5914] giro[5890] heats[5783] raced[5568] driving[5510] raceway[5507] fia[5216] nhra[5136] speed[5135] european[5107] peugeot[5065] finishing[5057] professional[5028] favourite[4995] jersey[4977] ahl[4935] wrc[4766] teams[4738] pf[4702] winning[4694] nissan[4692] audi[4664] pro[4575] pepperdine[4549] bests[4547] 
topic74=wk[10860] barron[8910] netflix[7235] quoins[6680] chloe[6259] mercer[5273] crosby[5172] mcfarland[4816] hart[4752] grimm[4567] macerata[4563] evan[4362] wnbl[4272] rhinos[4034] newman[3969] drake[3852] fiddle[3797] synths[3630] thrones[3624] yds[3618] jerome[3595] maggie[3547] showtime[3519] dexter[3507] trumpeter[3501] polydor[3457] george[3435] sundance[3425] jimmy[3355] stars[3282] nash[3221] melton[3185] hammerstein[3147] tatjana[3147] savage[3144] irving[3135] certifications[3121] gordon[3114] doll[3097] harry[3064] harold[3062] vampire[3056] jazz[3056] tenor[3037] cast[3025] clapton[3017] frank[3014] peterson[3004] arlen[2994] ballard[2985] mack[2956] mccartney[2945] tomatoes[2918] purcell[2916] mastered[2915] joe[2894] ref[2890] meghan[2859] johnny[2855] nickelodeon[2849] flanagan[2841] wook[2793] finn[2768] saxophone[2748] cw[2743] roar[2743] nme[2739] berklee[2718] jonny[2708] weir[2692] rebecca[2691] django[2678] billy[2677] carlson[2672] lucifer[2669] jack[2667] thom[2649] jill[2644] brennan[2641] teddy[2636] lennon[2623] trainor[2620] corbin[2615] maynard[2611] duggan[2577] reeves[2563] ned[2551] walton[2545] lew[2542] torrens[2539] boogie[2533] clarinets[2531] starr[2516] parker[2508] alright[2506] patterson[2480] kenton[2468] freddie[2464] jazztimes[2459] graham[2449] 
topic75=align[68629] myrtaceae[26487] weightlifting[25240] jerk[22803] snatch[22460] tbd[20073] weightlifter[13566] purplish[13000] text[11057] myrtle[8911] till[8443] bar[8019] right[7879] color[6504] powerlifting[6354] ranchi[5916] style[5700] lbs[5065] kitts[4731] width[4163] id[3906] kalpana[3860] bodybuilding[3752] nevis[3638] roundish[3621] hijo[3444] olympia[3365] alyssa[2909] longlisted[2705] ifbb[2684] oligocene[2679] value[2636] schultz[2551] arild[2450] gills[2442] figwort[2398] pettis[2318] squat[2268] godoy[2196] rubiaceae[2172] rafał[2118] legend[2117] baena[2083] naeem[2071] iwf[2052] valentín[2020] arriba[2020] boulenger[2013] kongsberg[2011] nikolov[1923] oskaloosa[1920] shibuya[1902] bakker[1884] lillie[1840] height[1809] paleocene[1784] dinos[1755] damián[1742] shortlisted[1703] rgb[1695] sattler[1643] verdugo[1615] keselowski[1596] yuichi[1577] toth[1567] cosplay[1559] minuta[1551] bodybuilder[1516] barra[1516] papi[1515] fenn[1515] lusk[1508] itis[1507] variably[1487] center[1478] kristoff[1471] shading[1467] carthy[1463] maite[1460] cornejo[1426] malu[1400] sanyal[1398] hibiscus[1396] portela[1385] cauvery[1384] teamsters[1365] velvety[1345] orientation[1333] increment[1323] girdle[1320] physique[1318] tahar[1310] megami[1294] salar[1293] seale[1274] ninjas[1247] lykke[1243] timeaxis[1226] lomond[1224] wildcards[1218] 
topic76=hurling[13052] tanzania[8050] nk[6892] kaur[5747] doha[4922] dar[4595] radiata[4578] judoka[4399] haque[4060] mehdi[3454] llb[3441] makerere[3349] bundaberg[3290] salaam[3110] amman[2958] tanzanian[2911] es[2723] bou[2442] kyu[2252] izumi[2194] yui[2190] hnk[2145] majid[2102] zanzibar[2043] gorica[2013] hamad[1981] meerut[1968] osijek[1961] kano[1845] sava[1827] reece[1805] armbar[1799] aydın[1795] juma[1764] abdulla[1763] bagrat[1709] parsonage[1674] ilija[1612] saki[1595] takeda[1570] naturelle[1475] spurius[1449] hervé[1448] anderlecht[1426] prins[1424] paras[1419] barquisimeto[1382] bamba[1347] zolder[1338] meagher[1328] koda[1303] kaneko[1301] sahil[1299] liwa[1299] miura[1298] thorp[1287] unnikrishnan[1283] haiku[1261] ippon[1237] aga[1223] baghdadi[1223] leica[1216] ragnhild[1209] orrell[1204] asst[1168] roni[1162] charleville[1162] mahmood[1146] kühne[1133] waza[1130] amalfi[1117] tori[1106] charleroi[1092] lancia[1089] grote[1080] marcelino[1078] masa[1076] suzan[1068] castroneves[1061] haren[1055] wickham[1054] melford[1052] taos[1048] tanganyika[1046] manama[1046] kyoko[1039] bronte[1036] ahsan[1035] oni[1034] silke[1007] saigo[1007] swahili[1007] insecta[994] taku[986] audun[982] pastore[975] berchem[969] sugimoto[957] umag[953] fortes[951] 
topic77=norwegian[55424] israel[37991] norway[35185] israeli[33172] oslo[22689] aviv[17411] tel[17358] hapoel[15719] jerusalem[14067] og[13351] palestinian[12041] lebanon[11778] palestine[11298] bergen[9468] fjord[8950] beirut[8821] lebanese[8499] saskatchewan[8314] thorell[8223] stavanger[7957] cfl[7828] syrian[7775] haifa[7772] jewish[7571] hebrew[6959] olav[6782] jordanian[6545] trondheim[5323] calgary[4956] hye[4907] norsk[4553] levi[4482] roughriders[4477] tuc[4284] argonauts[4249] cadastral[4198] gaza[4115] telemark[4106] wnit[4074] haddad[4002] edmonton[3949] winnipeg[3901] eskimos[3761] knut[3661] tromsø[3624] bet[3604] arab[3589] norske[3560] nrk[3540] saskatoon[3474] brampton[3465] stampeders[3463] tikva[3450] petah[3445] mads[3242] idf[3062] beit[3034] moshe[3030] cohen[3015] kirke[3010] ammonite[2994] jaffa[2963] zionist[2885] ole[2846] sham[2825] municipality[2774] regina[2736] bjørn[2734] knesset[2717] alouettes[2715] roadrunners[2687] byes[2682] assad[2655] cats[2634] kristiania[2603] kiryat[2571] arne[2511] netanya[2501] christiania[2484] paus[2467] kjell[2428] johanne[2417] inger[2411] hamas[2371] copse[2370] redblacks[2362] helge[2347] melkite[2330] sidon[2265] bhaskaran[2233] sverdlovsk[2232] iaa[2228] palestinians[2227] terje[2217] nazareth[2173] redistributed[2135] artzit[2110] storting[2099] hezbollah[2098] johansen[2096] 
topic78=ship[63684] navy[54900] ships[40412] boat[34181] naval[31140] vessel[24679] submarine[23670] dnf[22584] islands[21722] class[21198] hms[20871] fleet[18668] vessels[17600] port[17165] gun[16738] sea[16412] guns[16058] crew[15820] island[15657] launched[15257] sailing[15238] admiral[14945] boats[14825] royal[14210] lst[13678] torpedo[12938] submerged[12881] hull[12702] coast[12636] captain[12133] cargo[12081] citations[12031] maritime[12022] papua[11786] commissioned[11464] patrol[11224] shipyard[10999] sunk[10797] sailed[10565] command[10486] french[10421] ss[10338] destroyer[10195] guinea[10103] speed[9977] bay[9885] voyage[9594] laid[9510] convoy[9311] heatseekers[9289] marine[9103] type[9038] pacific[9015] keel[8934] submarines[8745] shipping[8714] tons[8526] cruiser[8484] beam[8289] fitted[8133] uss[8024] sold[7838] squadron[7756] steam[7602] sank[7434] sloop[7381] torpedoes[7240] captured[7210] admiralty[7163] ocean[7128] design[7117] ordered[7018] german[6982] harbour[6958] frigate[6929] draught[6919] deck[6894] length[6796] pounder[6741] sail[6730] arrived[6729] scrapped[6585] engines[6570] renamed[6553] gibraltar[6540] flotilla[6513] construction[6398] merchant[6345] reef[6303] yard[6277] shipbuilding[6266] armament[6248] ferry[6217] bow[6164] atlantic[6150] surface[5986] aboard[5904] transferred[5766] operation[5732] destroyers[5725] 
topic79=satellite[24459] antarctic[22725] intelsat[21189] space[14227] antarctica[13286] earthquake[11480] launch[10728] satellites[10241] storm[9803] mars[8988] utc[8736] nasa[8512] orbit[8430] earth[8387] glacier[8144] abstracting[8069] booklist[8001] crater[7995] tornado[7251] km[6914] spacecraft[6650] polar[6531] diptera[6407] rocket[6398] cyclone[6357] ocean[6335] ice[5947] damage[5802] ukr[5701] magnitude[5677] tropical[5661] geological[5595] launched[5419] scopus[5403] weather[5389] skylab[5254] mission[5242] hurricane[5034] laverne[4690] matadors[4674] expedition[4658] asc[4532] geostationary[4406] research[4405] arctic[4292] geology[4244] station[4181] krasnoyarsk[4150] atmospheric[4010] struck[3881] headland[3840] geophysical[3783] winds[3757] map[3709] iss[3512] payload[3450] greenland[3394] lunar[3314] ene[3311] spaceflight[3267] bsc[3242] climate[3184] seismic[3154] ospreys[3147] tornadoes[3127] intensity[3114] moon[3085] orbital[3076] data[3074] msc[3070] canaveral[3059] cape[3058] maps[2992] survey[2988] depth[2979] scale[2895] kg[2892] mapping[2868] band[2825] occurred[2823] farhan[2789] baia[2784] soyuz[2776] weibo[2773] impact[2763] system[2749] astronauts[2745] communications[2731] peninsula[2711] slough[2698] apollo[2697] koehler[2695] oceanography[2652] aşk[2632] scientific[2605] sandbox[2558] girija[2544] incubator[2535] энциклопедия[2533] murchison[2521] 
topic80=temple[59867] sri[29970] sinhala[23072] god[22275] gabled[20681] ancient[19881] jewish[19579] lankan[19299] hebrew[18544] text[16623] lanka[15110] rabbi[14935] buddhist[13386] inscription[13222] verse[12739] bible[12178] chapter[11930] translation[10968] goddess[10905] king[10824] translations[10112] lord[10060] greek[9957] synagogue[9736] inscriptions[9665] book[9585] isaiah[9564] deity[9469] verses[9328] temples[8916] hindu[8508] bc[8383] religious[8370] manuscript[8324] poem[8245] ad[8055] language[7919] manuscripts[7868] tomb[7829] crowdfunding[7447] yoga[7417] ceylon[7410] written[7267] mythology[7166] buddha[6920] word[6882] testament[6849] jews[6838] sarath[6786] latin[6750] sanskrit[6688] israel[6605] religion[6510] statue[6490] colombo[6385] according[6347] shrine[6277] ritual[6199] gods[6149] tradition[6072] texts[6008] vihara[5848] imma[5805] archaeological[5768] worship[5570] festival[5461] buddhism[5354] spiritual[5352] shall[5348] commentary[5334] ce[5331] maha[5283] translated[5255] extant[5234] christian[5214] son[5209] believed[5199] jerusalem[5136] sacred[5083] codex[5009] prophet[5000] means[4985] form[4958] medieval[4916] period[4838] tamil[4780] dedicated[4780] meaning[4776] origin[4754] rituals[4740] version[4697] composed[4654] jain[4625] said[4606] man[4520] deities[4471] legend[4419] divine[4383] mentioned[4356] egypt[4341] 
topic81=william[58108] sir[54551] married[53479] london[44852] son[41545] henry[37279] thomas[36992] george[36954] daughter[35766] james[34902] mary[33129] australian[31112] royal[30248] charles[30140] england[29537] australia[28183] edward[28108] wife[26559] lord[26497] robert[26450] wales[25954] elizabeth[25938] educated[24313] oxford[24305] queensland[23391] cambridge[22885] father[21559] richard[20869] sydney[20497] adelaide[20319] children[19188] margaret[18739] earl[18399] king[17868] appointed[17463] née[17395] melbourne[16513] brother[16426] mrs[16346] arthur[16328] sons[15707] walter[14458] frederick[14363] lady[14077] baron[13985] society[13963] francis[13946] victoria[13851] alexander[13740] daughters[13665] queen[13635] jane[13567] elected[13515] politician[13482] hugh[13263] smith[13119] edinburgh[13076] buried[12881] brisbane[12484] parliament[12287] samuel[12231] irish[11934] baronet[11877] whom[11820] anne[11566] hall[11395] church[11346] street[11163] david[11163] joseph[11139] sheriff[11103] aged[10936] welsh[10723] perth[10649] lived[10448] alice[10383] goble[10379] estate[10193] clerk[10133] victorian[10105] duke[10085] council[10048] ireland[10009] eldest[9837] secretary[9828] cemetery[9701] moved[9640] bibliography[9636] devon[9636] captain[9526] legislative[9512] peter[9371] councillor[9355] alfred[9347] mp[9234] ann[9184] merchant[9167] frse[9009] jones[8952] marriage[8949] 
topic82=greek[18199] greece[10392] madrasa[9325] gasser[6373] kurdistan[5984] kurdish[5903] thessaloniki[5279] hebei[4734] ef[4450] uaap[4283] bimonthly[3527] vosges[2929] nea[2649] hubei[2600] heilongjiang[2539] georgios[2348] crete[2339] pkk[2318] erzurum[2197] castletown[2149] patras[2107] burgen[2078] brito[2067] vorarlberg[2037] mosul[1873] queenstown[1863] dt[1816] kurds[1814] ioannis[1798] schenkel[1743] urmia[1656] mankato[1634] tripoli[1633] bootcamp[1623] künstler[1550] magpies[1546] bes[1531] ponomarev[1519] basra[1484] benghazi[1472] erbil[1466] nikolaos[1447] bemidji[1442] homonymous[1442] duluth[1430] keck[1413] yazidi[1408] peshmerga[1388] agus[1361] ogham[1343] corfu[1333] iza[1322] dimitrios[1318] ohno[1301] smyrna[1298] marisol[1291] nour[1290] selim[1274] leyla[1267] alona[1262] ano[1260] aegean[1253] shijiazhuang[1251] kyra[1243] jalal[1215] kermanshah[1213] wenzhou[1212] kavala[1183] cooley[1180] österreich[1176] schwab[1154] hy[1149] dara[1127] alexandrina[1106] qazi[1099] milos[1091] nabi[1080] tirol[1079] batra[1074] ioannina[1055] rolle[1044] strang[1042] magh[1042] thessaly[1031] fergus[1029] ionian[1029] scrope[1012] agios[1008] mcgregor[997] boaz[994] chios[980] peloponnese[970] achaea[965] mardin[964] tavistock[940] amrit[922] ilam[918] erdogan[903] natt[902] epirus[894] 
topic83=chinese[65519] china[63445] hong[44555] kong[42105] li[32461] malaysia[30882] taiwan[30324] wang[26335] chen[26005] zhang[22924] liu[18768] beijing[18737] taipei[16889] shanghai[16506] singapore[15356] yang[15281] lin[14596] yu[14089] wu[13459] huang[12099] taiwanese[12030] yuan[11990] wei[11893] malaysian[11797] zhou[11523] tang[10746] chan[10489] wong[10483] zhao[10269] zhu[10261] cheng[10173] lu[9929] ming[9900] penang[9763] kuala[9356] hunan[8829] chang[8738] guangzhou[8652] han[8598] yi[8494] sarawak[8485] wen[8378] lee[8194] macau[8108] asian[7974] sungai[7920] yan[7910] tan[7737] nanjing[7474] selangor[7372] ying[7257] chung[7154] qing[7053] lumpur[6970] asia[6932] zheng[6914] dynasty[6813] chu[6774] xiao[6760] brunei[6738] jiang[6719] ma[6643] guangdong[6550] jin[6488] hui[6196] mongolia[6149] fu[6119] province[6080] chun[6065] thailand[6053] wan[5962] yin[5944] liang[5929] sabah[5863] hu[5735] kampung[5635] ching[5607] republic[5529] johor[5529] zhejiang[5440] bukit[5349] kai[5313] mersin[5261] tong[5249] tai[5225] ho[5209] hua[5194] jing[5176] chiang[5144] gao[5128] chi[5122] mandarin[5090] guo[5033] kota[4906] cheung[4865] mongolian[4862] henan[4849] shi[4777] sichuan[4767] sun[4729] 
topic84=polish[93427] poland[52838] warsaw[37579] ski[23365] slalom[22392] kraków[18287] skiing[15553] vilnius[14536] alpine[13736] lithuanian[13590] cev[12782] plovdiv[11709] andrzej[11534] lithuania[11270] poznań[9815] stanisław[9656] jan[9501] downhill[8964] józef[8434] piotr[8374] voivodeship[8236] wrocław[7951] hs[7736] innsbruck[7686] michał[7616] salzburg[7599] jerzy[7184] gdańsk[7143] lublin[7106] łódź[6650] maciej[6633] skier[6586] kazimierz[6508] szczecin[6495] paweł[6494] tadeusz[6487] wisła[6464] krzysztof[6419] canoe[6298] minsk[6297] cross[6230] zindagi[6123] wojciech[5985] henryk[5978] marcin[5861] tomasz[5789] władysław[5770] polonia[5643] giant[5600] jacek[5521] łukasz[5486] aleksander[5443] linz[5437] jakub[5364] silesian[5191] winter[5008] franciszek[4958] marek[4933] austrian[4737] witold[4635] zbigniew[4633] silesia[4572] stal[4571] alps[4496] cherno[4459] karpaty[4444] stara[4439] adam[4438] polska[4424] tyrol[4369] sprint[4363] styria[4307] mountaineering[4287] karol[4248] austria[4242] lwów[4174] ghetto[4113] stefan[4084] grzegorz[4068] zagora[4058] heo[4045] cracow[4030] ewa[3981] katowice[3971] antoni[3964] ryszard[3943] agnieszka[3937] bastogne[3919] sejm[3865] poles[3857] lech[3789] jagiellonian[3782] garmisch[3772] jumping[3713] azs[3707] góra[3706] pomeranian[3698] partenkirchen[3616] ukrainian[3500] polski[3461] 
topic85=species[253309] genus[102197] mm[88244] forewings[76027] moth[70521] hindwings[67044] described[62031] wingspan[60098] dark[60072] grows[59966] grey[57691] flowers[56789] costa[56485] brown[47868] description[47822] australia[47817] whitish[47224] yellow[44206] endemic[43604] pale[42811] white[42627] marine[42097] distribution[40628] plant[39766] leaves[33514] western[31663] black[31192] dorsum[30684] base[30352] caladenia[29181] apex[28969] habitat[28895] sea[25712] length[25518] scales[25182] occurs[24995] spiders[24664] middle[24407] shell[23871] dorsal[23838] spot[23596] flowering[23298] height[22965] mollusca[22424] extinct[21458] native[21214] typically[20841] larvae[20816] tree[20709] leaf[20679] dots[20664] wide[20575] plants[20516] shaped[20310] cm[19854] taxonomy[19378] red[19219] genera[18811] erect[18331] wing[18205] basal[18054] commonly[17898] specimen[17681] colour[17561] fossil[17445] apical[17186] pink[16579] recorded[16339] gastropoda[16192] cream[16130] adults[16107] cell[15941] orange[15809] slightly[15700] green[15595] plical[15425] edge[15182] specimens[14721] reddish[14666] africa[14574] spider[14519] narrow[14509] petals[14464] regions[14411] beyond[14391] sandy[14241] flower[14159] disc[14139] southern[14036] veins[13995] males[13914] irregular[13901] light[13857] fruit[13855] contains[13801] subfamily[13588] belonging[13574] tall[13506] hairs[13499] posterior[13478] 
topic86=la[64653] spanish[52400] del[49322] italian[45906] el[45787] josé[34142] di[33055] juan[32014] spain[29987] san[27195] maría[26311] madrid[25332] división[25011] luis[24600] antonio[23787] carlos[21889] argentine[21017] argentina[20735] indonesia[19364] manuel[18463] chile[18216] buenos[18211] aires[18101] italy[16817] miguel[15970] santiago[15849] mexican[15763] pedro[15325] mexico[15097] indonesian[14908] gonzález[14717] en[14115] garcía[14097] peru[13962] los[13921] fernando[13775] alberto[13039] francisco[12990] il[12803] roberto[12585] santa[12478] rodríguez[12409] las[12264] barcelona[11820] jorge[11756] lópez[11750] fernández[11682] della[11524] nacional[11423] mario[11043] martínez[11021] rafael[10939] valencia[10862] jakarta[10799] colombia[10618] lima[10513] sánchez[10347] cruz[10137] quechua[10130] león[10080] rey[10059] franco[9779] chilean[9640] bolivia[9465] rosa[9396] pérez[9348] pablo[9323] berghahn[9126] universidad[9053] diego[9003] martín[8901] province[8881] domingo[8881] amor[8751] una[8742] sergio[8733] alejandro[8721] enrique[8679] marco[8585] basque[8549] javier[8549] cf[8516] carlo[8437] castro[8414] salvador[8359] giuseppe[8325] andrés[8102] eduardo[8100] ángel[8090] concession[8081] maria[8045] ana[8015] alfonso[7993] real[7990] toledo[7956] un[7918] córdoba[7753] paolo[7584] santo[7536] rosario[7507] 
topic87=serbian[43082] serbia[31513] albanian[25168] croatian[24905] belgrade[21130] bosnia[20467] herzegovina[18061] albania[17402] fiba[17081] croatia[16740] yugoslav[15061] futsal[14993] yugoslavia[14987] kosovo[14855] zagreb[13447] slovenian[12505] montenegro[11604] bosnian[11468] macedonia[10773] macedonian[10248] slovenia[9954] montenegrin[8559] novi[7977] nikola[7797] ita[7458] verandah[6907] ger[6524] barangay[6360] gbr[6173] serbs[6164] ljubljana[6084] vojvodina[5810] skopje[5795] swe[5750] serb[5728] luka[5671] moto[5456] jpn[5436] rijeka[5232] eurocup[5202] fra[4994] zvezda[4988] aleksandar[4686] dallara[4495] balkan[4310] marko[4310] podgorica[4289] bagan[3819] albanians[3669] ned[3502] aus[3499] warszawa[3496] republic[3474] sad[3447] ivan[3411] arg[3376] pts[3337] cze[3336] orf[3184] olimpia[3090] espanyol[3090] dušan[3009] mujeres[3003] slovene[2998] maribor[2993] croats[2991] basketball[2944] spa[2926] za[2888] vardar[2700] aut[2697] cyrillic[2682] outbuilding[2638] stefan[2607] mirna[2602] josip[2599] zoran[2552] radhika[2550] kumanovo[2537] olimpija[2508] pazar[2476] milan[2432] prvaliga[2378] pristina[2374] lazar[2346] esp[2342] republika[2333] motogp[2330] mal[2308] nll[2301] righthanded[2283] bulgaria[2276] grinstead[2218] por[2167] slobodan[2165] rsm[2120] matej[2079] mostar[2073] podium[2051] clapboards[2047] 
topic88=bulgarian[24878] serie[17571] sofia[14389] aarhus[13646] bulgaria[12981] ukrainian[12680] kyiv[8063] donetsk[7817] oblast[7628] banca[7476] kharkiv[5878] italia[5725] maccabi[5713] dynamo[5380] sakha[5275] italian[4960] ukraine[4887] brescia[4824] roma[4719] paok[4564] matchday[4522] stanbul[4501] levski[4413] coppa[4348] eintracht[4278] oleksandr[4249] varna[4194] vidyalaya[4104] спб[3993] juventus[3911] bayern[3773] milan[3744] milano[3720] superleague[3638] italy[3596] radnički[3520] di[3456] siva[3378] goalkeeper[3376] zenit[3325] venezia[3285] niš[3237] torino[3212] chernihiv[3180] dnipro[3172] concessionaire[3163] napoli[3123] coadjutor[3077] slovan[3059] dnipropetrovsk[2946] primavera[2938] fc[2891] shakhtar[2890] europa[2861] bergamo[2790] lazio[2777] tavares[2757] stadion[2740] pilipinas[2711] viljandi[2706] ateneo[2645] vicenza[2630] toto[2618] nandini[2600] pescara[2556] galatasaray[2550] tarnovo[2470] treviso[2428] sudha[2406] perugia[2398] bilal[2383] werder[2359] satyanarayana[2338] kaluga[2331] fenerbahçe[2319] verona[2308] burgas[2299] psv[2287] marini[2281] locality[2244] betis[2207] lynsey[2197] borisov[2187] loaned[2163] admira[2148] unni[2045] genoa[2004] nika[1989] hbf[1981] aleksey[1970] padova[1958] grasshopper[1952] cisterns[1947] beda[1921] fsv[1918] sturm[1901] nac[1892] viktoria[1850] cristian[1829] todor[1826] 
topic89=shrub[40829] discal[22612] orchid[15459] sepals[14596] labellum[13887] suffusion[13199] soils[12592] esperance[11960] hairy[11638] subgenus[10143] chancel[8933] epithet[8677] tinged[8459] subsp[7814] purple[7544] cloudy[6903] glabrous[6640] huskies[6014] sepal[5924] florets[5350] subspecies[5050] stamens[4884] deciduous[4798] connecticut[4663] aff[4291] herbarium[4272] geraldton[3951] sportive[3870] orchidaceae[3671] litchfield[3476] loamy[3333] creamy[3332] downwards[3100] tasmania[3036] petals[3009] glandular[2986] aisle[2713] vestry[2621] daisy[2573] petal[2545] spike[2431] acer[2247] basally[2191] borne[2156] woody[2137] conidae[2126] succulent[2126] leaflets[2100] flowered[2089] edges[2053] murugan[2049] mangrove[2040] saheb[2027] tuft[2006] raceme[2004] bracts[1999] prostrate[1965] markings[1924] bačka[1898] sacristy[1897] cones[1846] grosseto[1843] bushy[1837] woolly[1827] srikanth[1759] obovate[1746] quinnipiac[1722] konak[1696] horticulture[1680] ypg[1671] avon[1667] calyx[1652] bridgeport[1646] praveen[1596] fleshy[1556] elongate[1547] deadpool[1544] stalks[1527] petioles[1525] latif[1506] shingles[1497] trang[1471] glebe[1469] leathery[1468] shrubs[1465] bog[1435] storrs[1416] abou[1407] calcareous[1401] tapering[1399] underside[1390] fern[1358] rhizomes[1356] allium[1339] stipe[1336] solidago[1334] lobed[1317] monotypic[1312] fallujah[1304] creeper[1290] 
topic90=art[150497] museum[89238] gallery[56646] arts[54665] book[54202] painting[53902] works[53395] artist[53308] collection[46327] exhibition[45469] poetry[41290] magazine[40565] painter[39899] books[38773] novel[37909] artists[37145] paintings[34245] writer[33939] literary[31673] exhibitions[30051] literature[27964] stories[27832] award[27070] prize[26664] fiction[26242] library[26084] fine[25938] studied[25685] poems[25288] author[25180] women[25084] poet[24602] london[24271] editor[24167] worked[24116] portrait[23934] sculpture[23916] contemporary[23813] collections[23711] exhibited[23144] isbn[22700] design[21884] writing[21874] photography[21311] wrote[21009] academy[19564] fashion[19252] press[19033] paris[18530] children[18053] novels[17918] biography[17884] writers[17668] short[17372] newspaper[16980] photographer[16827] portraits[16702] publishing[16642] jpg[16028] culture[15909] publications[15543] society[15364] story[15362] journalist[15267] artistic[15011] painted[14994] awards[14916] moved[14621] cultural[14224] designer[14128] publication[13730] taught[13663] modern[13552] publisher[13397] visual[13302] sculptor[13274] institute[13104] father[12990] style[12966] studio[12544] curator[12410] edition[12387] working[12372] photographs[12337] festival[12275] paper[12265] landscape[12162] magazines[11965] edited[11927] critic[11912] created[11825] drawing[11792] novelist[11769] graphic[11756] woman[11734] translated[11734] illustrator[11673] married[11591] young[11580] creative[11571] 
topic91=ap[24037] bulldogs[12630] pac[10588] drexel[8789] intercollegiate[7965] comédie[6783] pdc[6692] sacks[6627] auburn[6497] caa[5989] sacramento[5821] byu[5297] receptions[4567] nxt[4524] collegiate[4415] bengals[4352] jaguars[4147] offense[4080] tulane[3822] tbc[3742] kickoff[3189] waived[2893] recruiting[2859] td[2846] fumbles[2757] crimson[2723] linemen[2652] scrum[2652] linebackers[2625] quarterbacks[2501] statesboro[2498] sfl[2449] estudiantes[2445] fl[2417] maly[2400] tiebreaker[2363] youngstown[2335] subba[2304] aris[2295] dartmouth[2281] tar[2270] frazione[2225] karina[2223] garonne[2187] strayhorn[2089] quintana[2066] int[2036] calle[2011] beasley[2000] tú[1974] devils[1973] mattia[1963] buckeyes[1886] suter[1871] sed[1853] sera[1849] fabienne[1832] tiebreakers[1826] selby[1808] mejor[1772] michèle[1766] reardon[1733] zimmerman[1707] sivakumar[1672] ident[1648] jusqu[1598] balestier[1557] colima[1531] defensed[1526] rereleased[1518] neves[1514] pritchard[1499] dillon[1481] meek[1474] graff[1444] heathcote[1442] poonam[1438] matías[1429] lombardi[1427] bulldog[1427] imagen[1422] brigham[1413] dl[1410] biagio[1407] sofía[1405] tomlinson[1399] valdosta[1373] folsom[1365] ubc[1355] shingo[1349] donahue[1333] dons[1315] corriere[1303] bree[1293] dodson[1292] limoges[1289] kemp[1288] aldridge[1277] ángeles[1272] stp[1268] 
topic92=india[102017] indian[74393] village[61563] fuscous[61128] pradesh[48007] singh[46439] tamil[41309] workers[36281] district[34999] population[33637] punjab[32825] rao[26056] literacy[25563] marginal[23217] sabha[23071] sri[22832] delhi[22383] uttar[22291] kerala[22142] census[21797] maharashtra[20994] nearest[20833] tehsil[20817] km[19869] telugu[19609] villages[19479] chandigarh[19217] janata[18838] kumar[18771] constituency[18709] malayalam[17984] andhra[17585] karnataka[17122] kapurthala[17007] bharatiya[16897] caste[16520] mumbai[16452] hindi[16084] raj[15043] ram[14816] bengal[14558] kannada[14537] krishna[13750] nadu[13667] rate[13484] chennai[13399] congress[12828] females[12690] prasad[12449] sharma[12367] demographics[12188] assam[12116] males[12015] male[11788] away[11787] marathi[11749] devi[11726] lok[11545] assembly[11543] airport[11437] madhya[11400] guru[11186] goa[11166] nagar[11159] punjabi[11049] bihar[11033] headquarter[10739] hyderabad[10555] bangalore[10365] average[10312] schedule[10220] legislative[10073] block[10057] female[9986] rajasthan[9956] per[9846] shankar[9651] temple[9600] streak[9574] gujarati[9488] mysore[9392] shri[9327] tribe[9268] composed[8964] shiva[8863] ratio[8626] reddy[8608] telangana[8581] bjp[8480] odisha[8356] rupees[8329] haryana[8203] ravi[8126] children[8125] labourers[7917] chandra[7879] hindu[7740] scheduled[7712] language[7470] sex[7454] 
topic93=swedish[50639] danish[37016] finnish[33286] sweden[31572] finland[25469] denmark[22607] copenhagen[19946] stockholm[19301] helsinki[12805] lagos[10783] pickard[9379] hansen[8688] nordic[8255] johan[7806] gothenburg[7090] townsville[6903] nrl[6837] greenland[6639] jensen[6353] norwegian[6167] erik[6119] anders[5858] lars[5660] norway[5538] carl[5490] andersson[5474] turku[5444] tampere[5239] henrik[5127] gustaf[5059] nielsen[5026] boca[4804] tomé[4797] magnus[4657] frederik[4541] svenska[4538] jens[4505] sven[4381] axel[4376] lahti[4268] bandy[4222] nils[4181] af[4103] hans[4019] johansson[4008] niels[3959] dansk[3941] olsen[3932] penrith[3766] warrington[3763] mineiro[3682] rochdale[3657] den[3592] pokal[3554] larsen[3527] faroe[3505] chesterfield[3489] gascoyne[3431] rasmus[3421] gunnar[3338] illawarra[3334] príncipe[3296] rotherham[3266] roosters[3217] eriksson[3172] sami[3168] accrington[3154] om[3127] bengt[3123] sofie[3054] ab[3046] colo[3008] scandinavian[2998] med[2976] leif[2975] björn[2941] en[2898] dahl[2889] svensson[2878] arne[2870] wakefield[2845] pekka[2814] primera[2788] andreas[2760] rovers[2748] ludvig[2737] helsingør[2610] ifk[2580] ratcliffe[2574] lindberg[2546] ola[2532] svalbard[2518] soares[2499] manly[2495] jul[2432] bengtsson[2401] lise[2380] göteborg[2361] åland[2352] jonas[2351] 
topic94=mf[92936] df[73533] fw[71987] aircraft[66593] airport[44520] air[42246] gk[33049] flight[29074] wing[28045] engine[25589] aviation[24055] soccerway[23381] squadron[22859] pilot[22179] cb[20868] glider[16985] pilots[16919] design[16147] cm[14986] flying[14823] airlines[14200] lb[14159] model[13945] raf[13310] rb[13139] designed[12589] weight[12431] airline[11646] airports[10365] rw[10352] force[9816] yacht[9589] lw[9506] cf[9492] fuselage[9384] missile[9305] cylinder[9148] specifications[9080] fighter[9012] span[8951] landing[8929] flights[8823] ratio[8581] range[8446] transport[8431] mandals[8343] fly[8302] crash[8224] engines[8217] undisclosed[8214] crew[7804] boeing[7737] diesel[7670] goalkeepers[7604] helicopter[7428] development[7309] aspect[7210] powered[7169] production[7105] base[7060] radar[7034] mounted[6809] training[6808] airfield[6733] operational[6578] jet[6560] free[6550] vehicle[6541] crashed[6534] flew[6529] rudder[6520] sized[6442] ourairports[6416] plane[6360] yachts[6226] passengers[6206] cells[6199] certified[6169] gear[6102] wings[6045] dm[5943] operations[5914] fuel[5799] vehicles[5713] cockpit[5700] squadrons[5646] transfer[5607] mirage[5585] fleet[5572] speed[5454] airways[5453] type[5354] zimbabwe[5347] produced[5343] accident[5306] propeller[5283] fixed[5266] rm[5203] hb[5002] tank[4995] 
topic95=fiji[6459] cotta[4755] antalya[4267] karthik[3835] sundar[3742] burundi[3621] hartlepool[3414] mcfarlane[3286] fijian[3244] maricopa[2823] agnew[2821] workington[2821] lichfield[2408] madhavan[2367] pinkney[2323] prebendary[2321] minogue[2212] broughton[2189] kylie[2188] vanuatu[2109] burundian[2094] carafa[2054] mara[1955] flo[1931] fragmenta[1895] hedley[1888] sidhu[1826] nomen[1799] ochraceous[1780] suva[1738] lidia[1728] plebs[1711] nicholls[1679] clemons[1648] burnett[1636] lucian[1628] sextus[1627] goff[1605] kupfer[1589] feroz[1538] sanju[1526] plenipotentiary[1520] kristine[1502] bde[1478] rajat[1476] spiro[1441] yeomanry[1425] jayaprakash[1397] wea[1394] eliminator[1385] sudarshan[1364] kincaid[1355] craiova[1347] wheaton[1344] waco[1334] plebeian[1334] mohabbat[1327] danforth[1321] brandi[1310] whisper[1303] elke[1289] leary[1281] decorah[1279] ako[1251] ramazan[1238] theodosia[1232] gonzales[1228] dayna[1207] philippi[1189] bahu[1181] praenomen[1151] prod[1103] paca[1061] septimus[1061] bujumbura[1058] arusha[1050] conservator[1032] crassus[1013] horváth[993] liviu[982] englisch[975] mccullough[973] frankel[964] nadi[954] solis[942] pompeius[937] tetyana[905] chisnall[893] tpb[892] cyndi[881] mikel[862] menderes[859] bodrum[853] rusk[850] harriett[849] alli[848] yoshioka[839] aurelia[828] spartacus[828] dumitrescu[822] 
topic96=greyish[20228] energy[15027] system[14604] water[13820] mm[13174] using[12704] materials[12224] sprinkled[11925] optical[11837] surface[11739] systems[11432] design[10580] model[10321] temperature[10260] process[10200] light[10087] magnetic[9542] material[9362] type[9206] low[9028] different[8533] heat[8491] method[8277] lens[8060] pressure[7897] metal[7786] carbon[7676] laser[7592] flow[7554] malware[7546] chamfered[7522] speed[7380] air[7314] test[7294] termite[7277] applications[7277] gas[7243] radiation[7238] power[7218] instrument[7200] control[7192] particles[7173] body[7109] electric[7108] production[7103] physics[7068] device[7040] developed[6959] models[6955] range[6914] technology[6690] technique[6667] vehicle[6661] signal[6656] physical[6634] designed[6600] field[6528] liquid[6495] streak[6430] machine[6402] size[6386] patent[6383] frequency[6380] effect[6366] steel[6347] devices[6297] nuclear[6236] layer[6192] standard[6158] mass[6099] electrical[6073] electron[6057] components[6043] uses[6031] plastic[6000] similar[5947] mechanical[5926] metadatabase[5903] equipment[5876] soil[5855] tube[5821] measurement[5789] techniques[5686] dive[5684] particle[5656] solar[5652] instruments[5651] polymer[5643] weight[5598] methods[5565] laboratory[5563] manufactured[5545] quantum[5516] axle[5499] plasma[5487] data[5485] beam[5470] thermal[5462] processes[5442] corrugated[5382] 
topic97=la[55765] le[54186] des[44276] du[41316] et[39682] paris[39461] les[38004] french[36498] france[23880] jean[20375] sur[17159] éditions[16258] théâtre[14794] prix[14576] en[14337] pierre[14210] saint[13256] école[11868] un[11634] ligue[11538] quebec[10960] au[10830] histoire[10790] une[10766] ou[10702] dans[9758] française[9639] michel[8844] jacques[8424] pour[8228] académie[8204] superliga[7843] françois[7586] fr[7528] henri[7457] georges[7247] algerian[7054] société[6932] français[6634] dictionnaire[6497] andré[6307] lycée[6245] claude[6059] monde[6020] est[5963] montreal[5957] nationale[5902] louis[5788] marie[5713] rue[5711] musée[5529] aux[5266] avec[5220] deux[4927] par[4889] petit[4426] siècle[4295] musique[4217] rené[4216] alain[4168] études[4164] sous[4152] pas[4150] université[4128] amour[4113] qui[4001] beaux[3974] eugène[3954] homme[3906] pincode[3810] canton[3727] montréal[3668] galerie[3667] monaco[3651] grand[3595] nouvelle[3582] algeria[3537] seine[3509] je[3502] temps[3483] bibliothèque[3483] strasbourg[3482] hôtel[3435] superdraft[3414] nuit[3398] grasset[3392] femme[3366] mer[3337] lausanne[3320] ville[3202] émile[3198] seuil[3133] hélène[3124] auguste[3083] nantes[3062] trois[3040] nouvelles[3018] paul[2992] historique[2977] palais[2966] 
topic98=taluka[17523] vijay[13273] raja[11467] babu[11436] panchayat[11055] prakash[8691] soundtrack[8529] taluk[7855] arjun[7137] joshi[6951] gujarat[6631] vijaya[6384] sahitya[6366] nrhp[6275] ramesh[6044] mukherjee[5965] nair[5932] mangalore[5329] chatterjee[5007] laterite[4830] akademi[4783] leela[4737] rajesh[4692] ganesh[4682] vikram[4629] playback[4568] sai[4481] leung[4418] rahul[4374] manoj[4303] jai[4219] deva[4187] bengaluru[4065] lam[4046] narayan[4021] pvt[3910] vinod[3873] prabhu[3839] ghosh[3805] banerjee[3794] ranga[3770] pooja[3741] mansard[3733] mahesh[3600] sanjay[3501] varma[3498] desai[3476] jaya[3396] flugelhorn[3389] wai[3364] samajwadi[3349] gopal[3294] mahendra[3254] rajya[3243] vivek[3227] anil[3210] sarkar[3185] directorial[3142] jeevan[3063] filmfare[3050] madhu[3020] abhishek[2923] rajiv[2909] bombay[2799] kala[2789] shyam[2782] lai[2719] ka[2717] aditya[2670] thomasville[2667] yuen[2667] lakh[2650] ovc[2645] rishi[2610] roy[2554] siu[2551] veena[2503] erigeron[2403] prasanna[2377] paschim[2365] cbse[2338] avengers[2314] ghats[2299] grossed[2244] rekha[2235] doordarshan[2215] ramanathan[2215] gandhi[2214] srinivasa[2213] jag[2190] kumar[2175] allahabad[2159] nikhil[2159] kolhapur[2152] dialogues[2107] kung[2068] srivastava[2065] vasantha[2042] indian[2035] shekhar[2035] 
topic99=série[23900] primera[15962] benfica[12387] spartans[9105] académica[8667] basket[7476] gd[7380] estádio[7267] sf[6919] xxx[6868] upi[6102] taça[6095] hoosiers[5591] rsssf[5387] desportivo[5068] wnba[4926] murali[4572] greek[4406] boavista[4246] bajnokság[4126] braga[3931] qatari[3817] chainsmokers[3751] trofeo[3707] rower[3459] libertadores[3216] fours[3168] trojans[3057] mx[2955] unc[2954] apertura[2901] athens[2863] supercup[2847] sg[2711] inna[2573] fiu[2538] ahly[2480] dozois[2348] guadalajara[2327] clausura[2323] zeus[2315] divisão[2280] pf[2280] xx[2239] cruzeiro[2214] bernardi[2209] ghazal[2187] argos[2151] uber[2074] pella[2063] națională[1994] unam[1929] ppg[1903] spartan[1894] zim[1853] prep[1830] mythology[1807] román[1803] pfc[1800] ue[1785] coxed[1769] aguirre[1763] araújo[1711] coxless[1702] pg[1691] substitutions[1688] trojan[1685] copa[1681] mangala[1653] scorer[1646] apollodorus[1634] ekstraklasa[1619] returner[1583] nymph[1566] mexicali[1558] vieira[1556] ap[1549] jk[1539] azul[1531] jahrhundert[1528] attica[1505] ammar[1500] sprinting[1491] goalkeeper[1485] ribeiro[1479] toluca[1472] dungeon[1471] celina[1465] starter[1463] diogo[1463] gujrat[1451] asociación[1446] ga[1443] lucero[1437] salah[1436] ucsb[1429] shihab[1425] apg[1410] maia[1405] thiem[1404] 

I find these topics for Wikipedia to be pretty good and clear topics. More data obviously gives better topics. I am still running the cohesion metrics for these for Wikipedia. Even if u_mass is supposed to be faster, it took me 4 days to run it just for the 25 topics on Wikipedia. So it would take me weeks to run it for all the 25-200 sized topic counts. If I ever finish it, maybe I will post some update.

I am sure there would be lots of interesting this there to explore via Wikipedia by increasing topic counts, looking at the relations between topics, how they evolve as the numbers increase and so on. Unfortunately, I am not paid for this and have too many other things to do..

So if I want to apply topic models, what would I do right now (NLP is getting lots of attention so who knows in a few years..)? Try a number of different topic distributions and parameters if possible, look at the models manually both in text and visually, and pick a nice configuration. Depends really if the topics are used for human consumption as such or just as some form of automated input.

If I needed to model large numbers of separate sets that are evolving over time, I might just use the cohesion metrics along with some heuristics (e.g., number of docs vs number of topics) to make automated choices, run the things as micro-services at intervals and use the results automatically. Tune as needed over time.

Fewer and more static sets might benefit from more tailored approaches.

Too long post, too much to do.