IntentHQ LogoEngineering Blog

Automatic topic-modelling with Latent Dirichlet Allocation

LDA automatically assigns topics to text documents. How is it done? Which are its limitations? What is the best open-source library to use in your code?

At IntentHQ we deal with a huge amount of data on a daily basis. Much of this is textual as we try to get an intimate understanding of articles, web pages, blog posts and opinions. Fortunately, text mining, a branch of data science, gives us a hand.

In this post we’re going to describe how topics can be automatically assigned to text documents; this process is named, unsurprisingly, topic-modelling. It works like fuzzy (or soft) clustering since it’s not a strict categorisation of the document to its dominant topic.

Let’s start with an example: the optimal topic-modelling outcome for Shakespeare’s Romeo & Juliet would be a composition of topics of circa 50% tragedy and 50% romance. Surely, topics like social networks, football and indian cuisine don’t appear in the play, so their weights would be all 0%.

One of the most advanced algorithms for doing topic-modelling is Latent Dirichlet Allocation (or LDA). This is a probabilistic model developed by Blei, Ng and Jordan in 2003. LDA is an iterative algorithm which requires only three parameters to run: when they’re chosen properly, its accuracy is pretty high. Unfortunately, one of the required parameters is the number of topics: exactly as happens with K-means, this requires a deep a-priori knowledge of the dataset.

A good measure to evaluate the performance of LDA is perplexity. This measure is taken from information theory and measures how well a probability distribution predicts an observed sample. To evaluate the LDA model, one document is taken and split in two. The first half is fed into LDA to compute the topics composition; from that composition, then, the word distribution is estimated. This distribution is then compared with the word distribution of the second half of the document. a a measure of distance is extracted. Thanks to this measure, in practice, perplexity is often used to select the best number of topics of the LDA model.

Under the hood, LDA models both the topics-per-document and the topic-per-word distribution as Dirichlet distributions (that’s why it appears in its name). By using a Markov Chain Monte Carlo (MCMC) method to sample and approximate the underlying Markov Chain stationary distribution (called Gibbs sampling), the whole process is iterative, pretty fast, convergent and accurate. Math behind LDA is fairly complex, but a simple example on how LDA works is contained in this video presentation of David Minmo, a world class researcher of topic-modelling:

Topic Modeling Workshop: Mimno from MITH in MD.

Start TL;DR

For the bravest, this is the graphical representation of LDA: grey circles represent the observable variables; latent (also called hidden) ones are white. Boxes represent collections (repetition) of variables.

Graphical representation of LDA

[Image taken from Wikipedia, CC-3.0 licensed]

Parameters of the model:

φ and θ are Dirichlet distributions, z and w are multinomials.

End TL;DR

On the Internet there are a bunch of libraries able to perform topic-modelling through LDA. Note that the acronym LDA is also refer to another technique with the same initials (Linear Discriminant Analysis), but the two algorithms are completely unrelated. Now, in the follow, our point of view of some open sourced Latent Dirichlet Allocation Implementations. For each of them, we’re pointing out strengths and weakness, as well as simplicity to install and use and scalability.

MALLET and TMT

Y!LDA

GraphLab

Gensim

LDA limitations: what’s next?

Although LDA is a great algorithm for topic-modelling, it still has some limitations, mainly due to the fact that it’s has become popular and available to the mass recently. One major limitation is perhaps given by its underlying unigram text model: LDA doesn’t consider the mutual position of the words in the document. Documents like “Man, I love this can” and “I can love this man” are probably modelled the same way. It’s also true that for longer documents, mismatching topics is harder. To overcome this limitation, at the cost of almost square the complexity, you can use 2-grams (or N-grams)along with 1-gram.

Another weakness of LDA is in the topics composition: they’re overlapping. In fact, you can find the same word in multiple topics (the example above, of the word “can”, is obvious). The generated topics, therefore, are not independent and orthogonal like in a PCA-decomposed basis, for example. This implies that you must pay lots of attention while dealing with them (e.g. don’t use cosine similarity).

For a more structured approach - especially if the topic composition is very misleading - you might consider the hierarchical variation of LDA, named H-LDA, (or simply Hierarchical LDA). In H-LDA, topics are joined together in a hierarchy by using a Nested Chinese Restaurant Process (NCRP). This model is more complex than LDA, and the description is beyond the goal of this blog entry, but if you like to have an idea of the possible output, here it is. Don’t forget that we’re still in the probabilistic world: each node of the H-DLA tree is a topic distribution.

Graphical representation of LDA

[Image taken from the original paper on HLDA: Blei, Jordan, Griffiths Tenenbaum, © MIT Press, 2004 ]

If you’re very curious on LDA, I’ve prepared a quick example here.

Enjoy!

Join the discussion in hacker news.