Topic Modeling: Deriving Insight From Large Volumes of Unstructured Data
The rise of social networks has led to an increase in unstructured data available for analysis, with a large proportion of this data being in text format such as tweets, blog posts, and Facebook posts. This data has a wide range of applications, for example it is often used in marketing to understand people’s opinions on a new product or campaign, or to learn more about the target market for a particular brand.
When dealing with large volumes of unstructured text data, it can be difficult to extract useful information efficiently and effectively. There is almost always too much data to read through manually, so a method is needed that will extract the relevant information from the data and summarise it in a useful way.
Topic modelling is one method of doing this. Topic modelling is a technique that can automatically identify topics (groups of commonly co-occurring words) within a set of documents (e.g. tweets, blog posts, emails).
An effective topic model should output a number of very distinct groups of related words, which are easily identifiable as belonging to the same subject. For example, if the topic model was trained on thousands of tweets related to diet, one group of words might include “gluten”,”glutenfree”, “coeliac”, “intolerance”, which would correspond to a “gluten free diet” topic. Another group of words might be “vegan”, “dairyfree”, “meatfree”, which would represent a “vegan diet” topic.
Latent Dirichlet Modelling (LDA) is one of the most popular approaches for topic modelling, and is what will be discussed here.
The first step is to collect and prepare the documents to be analysed. The text within the documents should be cleaned so that the words that define each topic make sense, and would be relevant only to that topic. Usernames, URLs, symbols and common words (e.g. and, or, I, a, etc.) should all be removed before running the model.
These cleaned documents are then passed to the topic model. The model iterates through all of the words in each document and identifies words that occur together frequently. Every document is iterated over until the model becomes internally consistent (i.e. it does not change how words are allocated to topics during subsequent iterations).
The model outputs lists of frequently co-occurring words in the documents, along with the probability of each word belonging to that list. Each of these lists represents a topic. These topics can be visualised in a way that shows their relative sizes and how distinct they are from one another.
This can be helpful in determining the overlap between topics, which may indicate if any of them should actually be merged into a single topic, and which topics are the most common within the documents. However, most of the interpretation of these lists of words into meaningful topics is a manual process and can be difficult if the words in the list are too common, or do not seem to be strongly related to one another.
The rise of social networks has led to an increase in unstructured data available for analysis, with a large proportion of this data being in text format such as tweets, blog posts, and Facebook posts.
Latest posts by Eric Axelrod (see all)
- Metadata Automation |Tableau Community - March 1, 2017
- How Amazon Will Ride Big Data To $1 Trillion Market Cap - January 22, 2017
- Why physicists are a good fit for data science jobs - January 16, 2017