Sentiment Analysis – a trawl through the techniques

Some colleagues and I first looked at sentiment analysis of tweets in 2011. Our task was to assess the effectiveness of a marketing campaign. The results revealed the campaign to have generated significant net-positive sentiment during its life and we were able to monitor and display the sentiment in near real-time. The timeline graph in Fig. 1 records both positive and negative sentiment measurements. The sharp drop off in activity after the end of the campaign is evident.

Arthur's_Day_Sentiment

 Fig. 1 – Sentiment Analysis of Tweets Mentioning the Arthur’s Day Campaign (#ArthursDay)

Here I am going to have a look at a number of approaches to evaluate sentiment of text data. I will start by going through an online, purely cut-and-paste approach, which involves no programming.  Later, I will add some commentaries and examples of more formal approaches, using more sophisticated tools such as sentiment lexicons and text-processing languages (Python and R).

  1. A quick, online, cut-and-paste hack – no programming involved.
  2. An Excel hack [pending]
  3. Sentiment analysis using Python, the AFINN lexicon and the NLTK toolkit [pending]

In all cases, I am going to use Twitter as the source of the text data. I am going to track the sentiment around both the US and England soccer teams as the FIFA World Cup 2014 approaches. The hashtag for the US national team is #USMNT and it has a rallying hashtag also, #1N1T (One Nation, One Team), which may be helpful in separating supporters’ tweets from opponents’ tweets, if desired. U.S. Soccer’s Twitter account is @USsoccer. The Twitter account of England’s Football Association is @England. The team hashtag is #3Lions, representing the 3 lions crest on the team uniform.

A prerequisite step is to using the Twitter search API is to click through to the developer section of the Twitter website and to register to get the necessary authentication tokens. These tokens are required to get access to Twitter data via the API.

1. An online cut-and-paste hack

  • Enter your search terms in the Twitter search console at: https://dev.twitter.com/console
  • I searched for the 20 most recent tweets, in english, which included#USNMT & #1N1T but did not include ‘@USsoccer’
  • My complete search term was: https://api.twitter.com/1.1/search/tweets.json?q=%23USMNT%26%231N1T-%40USsoccer&lang=en&result_type=recent&count=20. This search code is essentially generated by the console. The Twitter search API expects to see URL encoding (aka percent encoding) used in the search term. For example, the hashtag(#) is represented by %23 in the search expression. I referred to the tables at: http://en.wikipedia.org/wiki/Percent-encoding in entering some of the search information in this way.
  • When the search has completed, you will see the output in the console window, structured as JSON data. Click on the SNAPSHOT button to access another button labelled ‘View Raw Data’. Once clicked, the raw data can be viewed and then saved, as a txt file. Delete all header and trailing txt surrounding the JSON data in the txt file and resave the file.
  • I then used an online resource, http://konklone.io/json/, to convert the JSON data in the txt file to csv format. I simply cut and past the txt file data to the data entry box and saved the output generated as a CSV file.
  • Finally, I used a well-regarded, online sentiment evaluation tool, from NLTK.org, to assess the sentiment of each of the 20 tweets. As before, I cut and paste each tweet into the data entry box in turn and pressed the ‘Analyze’ button to get an evaluation of the sentiment of the tweet as ‘Positive’, ‘Neutral’ or ‘Negative’.
  • In this instance, possibly because we are still 4 weeks from the World Cup, sentiment was pretty low-key: 14 tweets were graded ‘Neutral’, 4 were ‘Positive’ and ‘2’ were ‘Negative.
  • That’s it. No programming involved and after a number of cut and pastes and a few saved files, we have a set of tweets graded for sentiment.
  • What about the results? Based on a read-through of the 20 #USMNT&#1N1T tweets, my interpretation is that 19 of the 20 tweets were classified properly. I interpreted one of the negatively classified tweets as positive. The misclassification was likely due to misinterpretation of a slang term.

What’s useful about this approach?

It’s fine for doing quick sentiment analysis of a low-volume of tweets and for getting a feel for some of the steps involved in sentiment analysis. The process is reasonably repeatable and sentiment can be evaluated without any programming. The classifier was assumed to be reliable based on the established reputation of the Natural Language Tool-Kit (NLTK.org) team, so, no work was involved in creating a classifier.

What can be improved?
A lot. There are established good practices in text analytics which would be beneficial to consider here such as conversion of all text to lower case, word-stemming etc.

More thought needs to be given to the ‘business’ objective at hand. For example, how should we handle retweets? should we take steps to try capture tweets from ‘independent’ tweeters only e.g. not @USsoccer? What volume of tweets represents a useful sample for the purposes of measuring sentiment? Is the geocode data of the tweets of value?

We assumed that the NLTK demo site is valid for measuring sentiment of this body of texts. Is there a better way? Can we test and validate the measurements?

More to follow.

Leave a Reply

Your email address will not be published.