Are you willing to risk your life for Social Listening and Analytics? The 6 Steps of Taxonomy Creation
During 2 out of the 5 days that Brussels was on lockdown, specifically on November 23rd and 24th, the LT-Accelerate conference took place near the Brussels city centre. LT stands for ‘language technology’ and it is mainly about using automated ways to understand and extract value from big data sets with unstructured data.
Out of the 90 registered delegates, only around 30 showed up. Some of the speakers chose to present from the comfort of their homes or offices using Skype. For those of us who were actually in Brussels at the conference it did not really feel like we were putting our lives at risk, but obviously the ~60 people that did not make it thought that it was unsafe to visit the city during a Level 4 state of emergency. Restaurants, schools, the Metro, and shopping malls were all closed during these days. The night before the conference the city was eerily empty since residents and visitors were encouraged to remain at home and away from crowds. The soldiers carrying machine guns and the very obvious presence of the police added to the overall impression that something bad could happen at any moment.
The conference participants
There were only 5 companies from the market research sector – including DigitalMR – and the rest were mainly technology companies and end-users such as Xerox, Wolters Kluwer, and AOL. The likes of Google and Twitter also had speaker slots.
It would be an omission not to praise the organisers Philippe Wacker from LT-Innovate and Seth Grimes from Alta Plana for running a successful conference utilising technology despite the difficulties.
Rich Social Media Analytics
The DigitalMR presentation was about Rich Social Media Analytics making use of taxonomies. Most of the work that goes into creating a hierarchical taxonomy about a product category – say beer – is done manually at this point. Some automated processes to aid the taxonomy development exist but it still takes 3 weeks to create a 3 level taxonomy with over 75% coverage of the data set to be analysed. The taxonomy development is done using a combination of top-down and bottom-up methods, in any language, by native speakers. At least one of them needs to understand the subject sufficiently well (subject matter expert) and the rest only need to be native speakers of the relevant language and have common sense as far as “common” sense can exist.
Here are the 6 steps of taxonomy creation for social listening analytics, the DigitalMR way:
- Define the product category or subject (such as beer or parliamentary elections)
- Start with picking the main hierarchy 1 topics or themes
- Harvest social media posts based on a query that significantly reduces noise (i.e. the number of irrelevant posts)
- Identify the frequency of keywords and phrases in the harvested text corpus and rank them accordingly
- Pick the topics/themes that are relevant from the ranked lists and find their place in the taxonomy, initially under a hierarchy 1 topic
- Check the taxonomy coverage and precision at least on hierarchy 1 level
Over 75% coverage (topic annotation) of all the posts in the data set will provide a good enough recall for the subjects people discuss the most on social media. In science literature there are a number of ways to automate the creation of a taxonomy, but none of these methods are good enough for market research purposes, especially when it comes to FMCG (or CPG for our readers in the US). The next evolutionary phase in the taxonomy creation business for market research will be a more automated approach that will reduce the time required to create one from 3 weeks maybe down to a few days.
The rest of the presentations
What impressed me the most during this conference was the large number of use cases for sentiment and semantic analysis of unstructured data beyond market research. Also, I found it a bit odd that most presenters involved in sentiment analysis use dictionaries that define positive and negative words as opposed to a supervised machine learning approach. I got the impression that there is a general aversion from using humans to create a training data set each time a new product category or subject is analysed. It is understandable that technology companies will want to just use software to address a use case, and thus develop a fully scalable approach minimising the involvement of human curators and analysts.
One of the presenters, Jean-Francois Damais from Ipsos, said: “Do not get rid of the analyst just yet”. We do agree with this thesis but then again we come from the same sector as Ipsos i.e. market research. This is a recognition that there is no good enough, fully automated approach for sentiment and semantic analysis, for objectives related to customer insights and marketing research.