Data Curation for Social Media Listening

Every self-respecting Market Research agency boasts of delivering robust insights and high-quality data-driven results. But when it comes to social media listening, what is this data, who is qualified to provide it, and who determines whether it is relevant?
 

This is where Data Curation comes in. According to Wikipedia, curation entails a ‘range of activities and processes done to create, manage, maintain and validate a component’. While the term has commonly been associated with art works and library professionals, Data Curation is becoming all the more important in Online Market Research and Active Web Listening processes in order to warrant data quality and reliability; after all, one should keep in mind that oftentimes quality is what makes data meaningful, not volume.

Data curation is a process that requires a seamless combination of technology and human knowledge. Its nature is dual and its vitality became clear upon working on a project for a leading brewery company. The task was to measure their recent campaign’s online presence and performance.
 

1. Harvesting the Data

The first point of departure was to use web crawlers and data aggregators that were set to look through ‘historical’ social media data (March-May 2013) for the campaign’s tag line, a 5-word phrase. This process was able to access millions of websites; in fact, it combed through practically all Google-accessed websites and returned a whopping 22,448 mentions. Still, upon going through this data, it was rather hard to find a comment/mention that actually referred to the campaign. So, we had to adjust the monitor and add the brand’s name and certain more specific keywords for the search. This time, the result we got was 3,558 mentions. Again, however, these results mostly referred to the brand, but either to previous campaigns, or were not in this campaign’s context. Tackling this ambiguity involved some more human intervention: to look for the brand name and keywords specific to the campaign in question. After deduplication, the total number of data specific to the latest campaign was 177, excluding YouTube and Twitter mentions!

This is indicative of how much ‘noise’ there is out there, and how much of the data is ultimately unnecessary or irrelevant. Therefore, had the brewery group turned to a technology company monitoring social media that does not utilise curation, they might have been under the impression that thousands of people talked about their latest campaign online.
 

2. Sentiment Analysis

When we got to 177 instances from all the accessible websites on the planet, we then had to find out what was being said about the brand’s campaign across the obvious social media platforms, such as Twitter, Facebook and YouTube. The algorithm was able to segregate the data automatically into positive, negative and neutral comments but we next had to curate these comments to ensure the highest level of accuracy possible. More specifically, the analyst would go through the data to verify that the classification algorithm annotated it correctly, also taking into consideration context and tone where appropriate. Sentiment Coding is not to be confused with Data Curation; rather, it falls under it, since the latter both eliminates the unnecessary data, and improves automated sentiment analysis. Before moving further, it should be acknowledged that algorithms in such cases cannot be expected to deliver 100% accuracy.

This is only natural, if we take into account the complexity of language and the ways we use it. According to research on research, only 76% of the times will humans agree on sentiment annotation of mentions, thus we would place the maximum achievable sentiment accuracy by an algorithm at 80%. The results of a text classification algorithm can be gradually improved through the input of curation. Consequently, human intelligence is not only preferable, but necessary. Having said that, once the algorithm is adequately trained for a specific language and product category, and starts reaching sentiment accuracy close to 80%, then the effort in data curation is dramatically reduced. What is more, the analyst will also be able to keep an eye open for a good verbatim, or that genius comment that might lead to connecting another dot and generating another nugget or business insight for our client.

 




Share this article: