Measuring Social Sentiment: Assessing and Scoring Opinion in Social Media
By Jackie Kmetz, Director, Data Strategy & Product Training
Opinion Mining
Most information processing techniques, including search engines, assume that data is factual. This assumption has become less accurate, however, due to the explosive growth of social media. There is now a large body of data that includes opinion, yet the standard tools have no way to assess that opinion in a meaningful way. As a result, recent and emerging research focuses on techniques to evaluate opinion in user-generated content. Opinion mining is a term frequently used to describe these efforts to find valuable information in the vast quantity of user-generated content. The mining metaphor is apt: opinion mining techniques need to be able to locate both trends in social sentiment and specific examples—just as miners need to identify both a seam and specific pieces of ore to extract.
What is Sentiment Analysis?
Opinion mining is a broad topic; this paper focuses on sentiment analysis, which is the aspect of opinion mining that has received the most attention. The goal of sentiment analysis is to determine the attitude, opinion, emotional state, or intended emotional communication of a speaker or writer. Sentiment analysis has broad application and encompasses work in classifying subjectivity, polarity, tonality, emotion mining, review mining, appraisal extraction, affective computing, etc. We focus here on two aspects of sentiment analysis:
- Quantifying sentiment for an entity or population over time.
- Retrieving examples or summaries of sentiment along those same dimensions.
For a broader introduction and survey of sentiment analysis, see [Liu09] and [PangLee08].
Evaluating Sentiment Analysis
Measuring the performance of a sentiment analysis solution is more complicated that one might expect, for several reasons:
- Sentiment is subjective. In a substantial minority of cases, human evaluators are uncertain or disagree about the sentiment contained in text.
- There are degrees of sentiment. One could argue that almost all text contains some amount of sentiment; it’s not simply the presence or absence of sentiment, but a matter of gradation. Even if several people agree on a qualitative label (say Very Positive), there is no quantitative measure of what that means. There is also no standard for qualitative labels.
- Sentiment evaluation relies heavily on context. Early work in review mining (movie reviews, book reviews) assumes subjectivity and measures only polarity (Positive vs.Negative). Other work in information retrieval includes relevance (is this text relevant to my query?) and subjectivity (and contains an opinion?), but not polarity. Polarity obviously changes in different contexts as well (“I love Coke but hate Pepsi” is Positive for Coke but Negative for Pepsi).
- Sentiment can also be measured at different granularities. For example, given the context of a query or subject, sentiment can be measured at the document, paragraph, sentence, phrase, or pattern level, or some combination of those.
In any analytics evaluation it’s crucial to measure what you care about from a business objective. Along those lines, we’ll discuss three different families of evaluations and see how they align to the ways people use sentiment in social media analytics.
Document-Level Evaluation
One of the most common ways to measure sentiment performance in social media analytics is accuracy at the document level. Some of this comes from early work on sentiment analysis in the social sciences (looking at kappa and other agreement statistics among annotators). It is also an obvious metric for those analysts building text classification models (minimizing document misclassification error is the same as document-level accuracy).
For people who are auditing a social media monitoring solution, it may also be the first idea: read a representative set of documents one-by-one and mark your agreement. This approach generally requires agreeing on a consistent set of labels or classes and assigning documents to those classes. It is popular to treat this as a three-class problem (for examplePositive, Negative, or Neutral), but there are many exceptions. There are many statistical measures of correctness given assignment of labels in a multi-class problem (error, accuracy, kappa, f-measure, etc.).
To the layman, this family of metrics may seem sufficient and broadly applicable. However, if you think about the way sentiment is used in social media analytics, document-level accuracy itself rarely matters in terms of the business problem. Instead, the primary use of sentiment is quantifying it on a population (tracking sentiment over time, looking for spikes, comparing products, comparing to competitors, measuring impact of advertising campaigns, etc.).
But does document level accuracy capture that? It is true that if you could classify documents perfectly, or had a completely unbiased model, then you could sum up a perfect quantification for any population. Unfortunately, in practice, all models have some bias. The consequence is that a model that does better at document classification may do worse at quantifying sentiment for a population (and vice versa).
When evaluating a sentiment solution according to its ability to classify documents, it’s important to keep a few other things in mind.
- The number of classes matters. If all classes are equally likely, then a random guess will get you 1/n accuracy (where n is the number of classes). It isn’t meaningful to compare a two-class polarity classifier (where random guessing will perform at 50% accuracy) to a three-class sentiment classifier (where random guessing will perform at 33% accuracy) to a four-class sentiment classifier (like the Visible® solution, where random guessing will perform at 25% accuracy).
- The distribution of the class labels matters. In a situation where the true distribution is 10% Positive, 5% Negative, and 85% Neutral, random guessing with equal class probabilities gets you 33% accuracy; guessing using the true distributions gets you 73.5% accuracy (.12+.052+.852); and always guessing Neutral gets you 85% accuracy (which sounds great but is completely useless).
- The population of documents matters. For example, classifying tweets is much easier than classifying blog posts: the difficulty of informal and idiosyncratic language is vastly outweighed by the constraint of a single, short thought or comment. In our experience, accuracy on Twitter tends to be about ten percentage points higher than on social media content overall.
Aggregate-Level Evaluation
The primary use of sentiment analysis in social media is to track the attitudes about a topic or opinions of a population. Since this is what we care about most, it makes sense to measure how well sentiment solutions do with this problem in particular.
Let’s look at an example where the true distribution for a population is 11% Positive, 8% Negative, and 81% Neutral, and our sentiment solution estimates the distribution to be 13%Positive, 10% Negative, and 77% Neutral. The most obvious metric to look at for quantification or estimation is the error, or distance from the true solution. The L1 distance is 2%+2%+4% = 8% from the true distribution. The L2, or Euclidian distance, is sqrt(2%2+2%2+4%2) = 4.9% from the true distribution.
While it may be tempting to try to turn these numbers into an “accuracy” measure, such attempts convolute the real performance of the solution. For example, you might be tempted to use a 1 – the estimation error as an accuracy measure; however it’s obvious that the estimation error is not constrained to 0-1 and you can end up with negative accuracy by doing so (for example 100%/0%/0% vs. 0%/100%/0% gives an L1 distance of 200% and an L2 distance of 141%).
As an example of some of the confusing accuracy claims out there, one social media monitoring vendor reports accuracy as the percentage of the time their estimation is closer to the true distribution than a random guess (distribution) would be. While this might sound reasonable at first, a little analysis shows that this metric isn’t very useful. Due to the well-known statistical properties of high-dimensional geometry (the “curse of dimensionality”), random points tend to be far from each other. Our three class example above (13%/10%/77%) is closer to the true distribution (11%/8%/81%) than 99% of random guesses. Is this 99% accuracy?
Here are a few recommendations for evaluating a sentiment solution according to its ability to quantify an aggregate-level distribution:
- Directly assess real versus estimated distribution to make conclusions. If you must summarize to a single metric, concentrate on transparent metrics like error or distance, or use statistically grounded measures like a chi-squared test if you are comfortable with them.
- Deconstruct the problem to get a better idea of performance. Rather than looking at all classes at once, measure how well the solution does on relevance, subjectivity (Neutral vs.Non-Neutral), and polarity (Positive, Negative, Mixed), assessing each one separately.
- When comparing claims, make sure they are on the same number of class labels, the same topics (true distributions), and type of data (Twitter vs. blogs vs. forums for example).
Information Retrieval Evaluation
Another important use case in social media is retrieving representative sentimented documents. For example, you might want to read examples of negative posts.
The nice thing about searching for examples is that posts don’t have to be labeled with discrete labels like Positive, Negative, or Neutral. Instead, a degree of belonging can be used. Results can be rank ordered and presented from most to least Positive. An individual post doesn’t have to be labeled with a single class but can be a mixture of various sentiment dimensions (for example, Very Positive and Somewhat Negative). There are many rank-ordering metrics that can be applied to this problem, but precision is a simple and useful metric.
Precision is a measure of the solution’s success in retrieving posts that are relevant to the label you specify. For example, if you asked for Negative posts and the solution presents 25 posts, 20 of which you agree are negative, the precision of the solution is 20/25 = 80%. For information retrieval tasks, it is often useful to trade off recall for precision (if there were 50 other Negative posts that were not presented, the recall of the solution is only 20/70 = 29%). This is because there are typically many results available, and the user experience is best when the first few pages of the results are highly precise.
If you are evaluating a sentiment solution according to its ability to retrieve appropriate examples, here are a few recommendations.
- An intelligent solution will present you the best results on the first few pages. This does not necessarily generalize to document-level accuracy overall.
- Look at each of the class precisions directly (Positive precision, Negative precision, etc.). Avoid micro- or macro-averaging precision numbers, which can be misleading.
- As with other measures, the number of classes matters (precision on a two-class problem is not comparable to precision on a four-class problem). The distribution of the classes also: matters a skewed distribution like 10%/5%/85% will give lower precisions for smaller classes.
- When comparing claims, make sure they are on the same number of class labels, the same topics (true distributions), and the same type of data (Twitter vs. blogs vs. forums, for example).
Comparing Automation to Humans
For most data, automated sentiment technology can not yet reach the quality of a smart, well trained, and careful human annotator. Automated solutions can, however, give comparable results to humans in certain real-world scenarios. Annotating documents for sentiment is a thankless, tedious, and usually low-paying job where it’s easy to produce low-quality work. Automated techniques are tireless, fast (can score long posts in milliseconds), consistent (they don’t make random errors), and can be improved over time.
One interesting observation is that automated techniques sometime make errors in an inhuman way. Even when the same number of mistakes are made by humans and automated techniques, customers sometimes have a hard time with the types of mistakes the automation makes. Mistakes made by humans tend (to another human) to be less blatant or more understandable. Mistakes made by automation can tend (to a human) to seem “obviously” wrong. If the degree or magnitude of the mistake is important to your application, that can be built into the evaluation metrics.
Finally, the tradeoff between human and automated sentiment analysis is, in practice, a false one. Humans can’t compete with the speed and consistency of an automated solution, but are needed for judgment, insight, and interpretation of those results. Both are necessary and work together in partnership.
Business Applications of Sentiment
If you have looked at social media monitoring platforms to help you better understand what consumers are saying about your brand on the social web, sentiment has probably come up on more than one occasion. In this section, we look at what sentiment means from a business perspective, how it can be used by a business looking at social media, and factors to keep in mind when evaluating it.
A sentiment score can be an extremely useful in evaluating a large data set of social brand mentions. Sentiment scores can give users a straightforward way to segment and filter content based on positive or negative commentary, allowing them to isolate the themes or issues driving that sentiment. It also allows for dynamic and illustrative reporting of trends and market reactions, or situations like product recalls.
Each of these uses can help provide great insight into social data and can help propel a brand forward. It is important to remember, however, that while powerful and accurate scientific methods can be applied to analyzing what people are expressing online, understanding the context, sarcasm, intention, and wit used in human communications can require as much art as science. Keeping this in mind will prepare you to understand both the power and limitations of sentiment analysis.
The Nuances of Sentiment
One of the challenges of understanding and applying a sentiment analysis solution in a business setting is that sentiment is not a one-dimensional result with a universally agreed upon set of criteria. This is particularly true of social media content, and it means that evaluating the performance and accuracy of any solution is complex.
Removing the automated functionality and relying on human scorers wouldn’t solve the problem: according to Lexalytics, human analysts agree on sentiment scoring on average only 80 percent of the time. This means that a sentiment score will be wrong 1 out of every 5 times, even with strict judging by individuals who do their best to take context, subject matter, relevance, humor, and sarcasm into consideration. Subjectivity, variances in interpretation, and the context of what is being expressed create challenges for both humans and machines.
So where does that leave businesses that depend on the accuracy of analytics for critical business decisions, or early detection of negative incidents? Significantly better off than they were before. Sentiment analysis is still valuable, particularly when used alongside other indicators such as volume change and frequency analysis. It is important, however, to have realistic expectations of what is possible and an understanding of the limitations that exist.
Evaluating Sentiment Accuracy across Solutions
Let’s talk about practical ways of evaluating solutions you may be considering. As demonstrated above, accuracy is a measure that can be calculated, but it is also a figure that can be misrepresented or tailored to support a given scenario. So it is important to look behind the numbers and make sure the total solution is suited to your organization’s needs.
Start with the obvious and compare accuracy rates, but be sure to keep an apples-to-apples perspective when doing so. For example most solutions tend to divide sentiment into three classes—Positive, Negative, or Neutral (no sentiment expressed). This is a perfectly acceptable approach to classifying sentiment; however, keep in mind that the accuracy figures of a three-sentiment classification solution are inherently different from those of a solution with a four-sentiment classification that includes a Mixed sentiment score. The first has a one in three chance of randomly getting it right, while the other has a one in four chance simply because there are more choices.
In addition to the accuracy figures, consider the vendor’s explanations about how accuracy is calculated, how they address and improve the system’s accuracy and learning over time, and what types of quality checks and improvements they are making. Ask about the type and amount of data they used to create, train, and refine their system.
Given that that the highest benchmark of accuracy, human scoring, yields at best 80 percent, it is important to look beyond the percentages and random sampling and evaluate how the sentiment measure fits into the overall analytics solution. Does the platform allow you to sort, filter, and search for content based on the sentiment? Can you drill down into results, change and manipulate your criteria on the fly, and create an unlimited number of queries to answer all of your questions?
Consider the number of sentiment classes. Does the solution provide enough to for your business purposes? For example, is it adequate to have Positive, Negative, and Neutralcategories, or would identifying Mixed posts be beneficial? (Mixed indicates both positive and negative sentiment within a single post.) A Mixed sentiment score, would enable you to identify when consumers are on the fence about making a purchase decision—perhaps the price is right, but a concern about replacement-part availability is what is keeping someone from buying.
Conclusions
In summary, keep the following questions in mind when evaluating social sentiment solutions and choosing the platform that is right for your business:
- What are my social media goals and how does sentiment fit into the equation?
- What type of reporting do I want to do and how does sentiment help me do that?
- At the industry level, brand level, or for very specific issues?
- Can I segment, filter, search , and sort by sentiment in the platform?
- How refined do I want my sentiment—Are Positive and Negative enough or do Mixed and Neutral sentimented scores matter?
- Is my subject matter more contextual, or does a simple keyword/phrase match identify relevant content? The more contextual the subject matter is, the more time you or an analyst will be spending reading content for relevance to the subject matter, and sentiment will be potentially less meaningful.
- How does the vendor define accuracy for sentiment and based on what criteria? Is this the same for all vendors I’m considering?
- How is sentiment applied? At the phrase, sentence, paragraph, or document level? Does it depend on the context of the query? Does this make sense for the type of subject matter and depth we want to understand?
- Was the sentiment system built for social media? Is it based on the unique communication styles and forms of social media data?
- Does the platform allow me to override or otherwise earmark a sentiment score I disagree with or want to report differently?
To learn more about Visible’s approach to social media sentiment assessment read our white paper The Visible Approach to Assessing Social Media Sentiment.
References
Liu, B. 2009. Sentiment analysis and subjectivity. Handbook of Natural Language Processing, Second Edition.
Pang, B. and Lee, L. 2008. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1-2):1-135.