ESOMAR Revises its Stance on Social Intelligence Accuracy
The precondition to discovering any insights (or foresights), be it unique, actionable or otherwise, is to base them on data of known and acceptable accuracy. When it comes to social intelligence, there are users of machine learning based solutions that don’t actually know how to objectively measure their accuracy.
Short of being data scientists themselves, there is a myriad of things that buyers of social intelligence solutions are expected to know in order to make educated purchasing decisions; to that end, thankfully, ESOMAR has published a guide. This is great progress, and there could not be a more credible source for a buyers’ guide on social intelligence than ESOMAR.
In this article we are sharing DigitalMR’s answers to the 26 Questions that made it into the ESOMAR guide. The 26 questions are those that buyers of unstructured data analysis solutions should be asking vendors who offer them. All of the questions are important and for some there are no right or wrong answers, but the vendors must have an answer at their fingertips for all 26 of them. If they respond to any of the questions with a “what do you mean?” if you are the buyer… run!
The 26 questions and the DigitalMR answers are structured in five sections below.
A. Company Profile and Capabilities
1. What is the company’s core business – the services offered, and verticals served?
DigitalMR is a technology company in the market research sector offering platform access as well as end-to-end market research services to Agencies, FMCG, Retail, Financial Services, Telecoms, Tourism & Hospitality, Healthcare, Automotive & NGOs.
2. What are the typical deliverables?
- Online and offline dashboard
- Annotated data in CSV files
- Excel tables with aggregated data
- Periodic Executive Summaries
- PPT reports with conclusions and recommendations
- Presentations and action plan meetings
3. How is pricing determined?
The pricing for social intelligence is based on product category, language (not country) and period covered. A rule of thumb is that an average product category is defined by up to 12 competitive brands. These 12 brands are used as keywords for harvesting from the web. The frequency of reporting and the delivery mechanism also have an impact on cost.
The pricing for any text or image analytics processing and annotation through an API, regardless of data source, is charged per annotated post or image.
4. Are there case studies that can be shared?
Yes, for many different product categories and languages and in different formats e.g. PDF decks, infographics, one pagers and demo dashboards.
B. Data Sources and Types
5. What data sources does the company rely on?
For Social Intelligence DigitalMR harvests data from social media and any public website such as Twitter, Blogs, Forums, Reviews, Videos, News and also Facebook and Instagram with some limitations that apply to all data providers.
The DigitalMR text and image analytics technology is source agnostic and can therefore ingest client data from open ended questions in surveys, transcripts of qualitative research, call centre conversations or any other source of unstructured data.
6. How does the company gather the data?
For social intelligence DigitalMR uses all the available methods to harvest data from public sources i.e. direct APIs, Aggregator APIs, Custom crawlers and scrapers, RSS feeds etc. When doing so DigitalMR abides by the ESOMAR code of conduct, the law and the Terms & Conditions of the sources.
For client data - see answer to Q6 - the client can share its own data by email, on FTP, on cloud drives or through APIs.
7. Does the company provide historical data from its sources?
For social intelligence yes - as long as the posts still exist online at the time of harvest.
C. Software Design and Capabilities
8. What types of unstructured data analysis is the software capable of producing?
Text, images, audio and video can be harvested from the web or taken from other sources (see answer to Q5). listening247 - the DigitalMR software - does offer the capability of data harvesting from online sources. It provides buzz (word counts), sentiment, 7 pairs of opposite emotions such as ‘Love Vs Hate’, and semantic (topic) analysis. The topic analysis provided is inductive (bottom-up) and top down. Topics can be broken down in sub-topics and sub-topics in attributes and so on. listening247 can also analyse images for objects, brand logos, text (extraction) and image theme (aption). It uses 3rd party technology to turn audio to text, followed by its own text analytics capability to analyse for sentiment, emotions and topics.
9. Does the software use machine learning or an engineered approach to produce the analyses?
The listening247 software represents the implementation of years of R&D funded by the UK government and the EU. It includes supervised, semi-supervised and unsupervised machine learning as well as deep learning for data “cleaning”, sentiment, emotions, topics and image annotations. For data “cleaning” and topic annotations DigitalMR uses a combination of engineered approaches and machine learning. All listening247 custom models and set-ups continuously improve their accuracy. The user can also provide improvements to the supervised machine learning models by adding training data any time.
10. What is the resolution of automated text analysis?
The text analysis is done at document, paragraph, sentence, phrase, or keyword mention level. This is the choice of the client. The analysis extracts named entities, pattern-defined expressions, topics and themes, aspects (of an entity or topic), or relationships and attributes – and it offers feature resolution, that is, identifying multiple features that are essentially the same thing as the example in the guidance (Winston Churchill, Mr. Churchill, the Prime Minister are a single individual.)
The sentiment or emotions analysis is ascribed to each of the resolved features or at some other level; the user may choose the resolution of e.g. sentiment/emotion and semantic annotation.
11. Does the software provide document level data (e.g.individual posts to social media or specific survey open end) or only analytics based on document aggregation (i.e. quantitative analysis on a dashboard without the capability to drill through to the verbatims)?
listening247 provides document level data with the capability to drill through to the posts/verbatims, making it possible for users to verify the accuracy of all the annotations made by the models.
12. In which languages can each of the automated analyses mentioned in questions 7-9 be carried out at the advertised accuracy?
In literally all languages, including the likes of Arabish (Arabic expressed in Latin characters) and Greenglish (Greek expressed in Latin characters), since the automated analyses are done using custom models specifically created for the particular product category and language. The only trade-off is that it takes 1-3 weeks to create the set-up that guarantees the accuracy as advertised.
13. Does the company use third party software or Web services (APIs) to produce the analyses or has it developed its own capability for market research purposes?
DigitalMR uses its own proprietary software and models to produce all the analyses. It provides fully configured customised models; the end user is not responsible for that training but has the option to participate or improve if they wish to do so.
14. Can the system extract or infer a data subject’s demographic characteristics such as age, gender, income, education, and geography, and, if so, how (e.g. via metadata extraction, text analysis, or record linkage to external systems)? What validation processes are applied?
When it comes to social intelligence, limited demographics are available in the meta-data of normally harvested posts - see Q6. Any and all demographics can be inferred/predicted using a custom machine learning model which is trained to classify authors based on the way they write. The accuracy of prediction can be validated by testing it on new annotated data that was not used to train the model.
15. Is there any data sampling involved or needed, and if sampling is required or offered, what methods are applied?
For social intelligence DigitalMR typically harvests and reports all the posts from all the keywords and sources included. This is called census data as opposed to sample data. Data sampling is only done at the training data generation part of the process when the approach used is supervised machine learning. A random sample of 10% or up to 20,000 posts whichever is smaller is used as training data annotated by humans.
When it comes to sources other than the web, lower samples are needed to train the machine learning algorithms in order to reach the minimum accuracy.
16. What is the intended, target function of the system or service?
listening247 was originally designed for market research purposes (in any language) thus the focus is on data accuracy and data integration with other sources such as surveys and transactional/behavioral data for insights. A few years down the line, it is now also being used for sales lead generation and identification of micro/nano influencers.
D. Data Quality and Validation
17. How is the data cleaned to ensure that only relevant documents are used for the intended analysis?
For social intelligence, listening247 uses a combination of boolean logic and machine learning models to eliminate irrelevant posts due to homonyms. The priority and focus during the set-up period of a social listening tracker is to include all the synonyms (also misspellings, plurals etc) and exclude all the homonyms. Typically the data processed is over 90% relevant i.e. only a maximum of 10% is noise.
18. At the resolution mentioned in Q9 what is the minimum guaranteed accuracy of the analysis carried out by the software?
DigitalMR offers a money back guarantee for the following precisions in any language:
- Sentiment ≥75%
- Topics ≥80%
- Brands or Keywords ≥90%
Recall is usually at similar levels but it is not deemed as important as precision for market research purposes because if we end up with say 50% of all the data (50% recall) the sample is still hundreds if not thousands of times higher than the samples we use to represent populations in surveys.
For image captioning the committed Bleu-1 score is ≥75%.
19. Is the user able to check the accuracy by themselves without any support from the software vendor?
20. What is the method for identifying spam in social media?
Different users have different definitions of spam. These are identified at the beginning of the project and eliminated during the set-up process described under Q17 by using a combination of boolean logic queries and custom machine learning models. Clients are also enabled to flag and remove spam themselves should they find any.
E. Ethical and Legal Compliance
21. Does the company comply with the relevant legal data protection requirements in the jurisdictions in which it sources, processes, and shares data?
Yes absolutely. Even more than that since DigitalMR complies with the ESOMAR code of conduct which is stricter than the local laws.
22. What specific processes are in place to ensure the above described compliance?
DigitalMR abides by the ESOMAR code of conduct and not only stays informed about changes with the laws and terms & conditions of specific sources it actually gets actively involved in making sure the clients/users of these services stay well informed (e.g. the initiative to create this document under the auspices of ESOMAR). DigitalMR uses the highest standards of security in storing and transmitting data.
23. What codes of conduct and industry standards does the company abide by?
The codes of conduct and industry standards including the ICC/ESOMAR International Code on Market, Opinion and Social Research and Data Analytics; the Market Research Society in the UK (MRS).
24. How does the company ensure that data subjects are not harmed as a direct result of their data being collected, processed, and shared?
By abiding to the codes of conduct mentioned in Q23. In the occasions when an author of a post is contacted by DigitalMR the etiquette of the medium where the post was found is strictly followed and the medium/platform allows such contact and is usually expected by the authors of such posts. No offers are made unless the author indicates acceptance in the process of following the contact etiquette.
25. How does the company safeguard the privacy of data subjects in what it shares with users?
Only data from public sources are shared with users without masking. If the data is not from a public source then it is only offered in aggregated form or masked.
26. What information security practices are in place to ensure the security of data? Does the company allow clients to audit said processes?
Most of the data in social intelligence is public but in the occasions when the data is owned by the client or is sourced from a non-public source cutting edge security measures are used. DigitalMR uses secure sites and encrypted transmissions to protect the data in its custody.
All the communication from and to listening247 happens through a Secure Sockets Layer (SSL) to ensure the encryption of communication client-server. In addition our hosting partner has successfully completed multiple SAS70 Type II audits, and now publishes a Service Organization Controls 1 (SOC 1), Type 2 report, published under both the SSAE 16 and the ISAE 3402 professional standards as well as a Service Organization Controls 2 (SOC 2) report. In addition a PCI (Payment Card Industry) DSS (Data Security Standard) Level 1 certificate has also been received. The users are welcome to carry out their own audits.
This is the end of the guide which can also be found on the ESOMAR website.
We realise that we may be giving away our competitive advantage by helping create the guide and by publishing our answers to the questions, however this is one of those cases whereby the aphorism “a rising tide lifts all boats” applies.
Nothing will please me more than to hear from you with comments or questions on this article. Please tweet to @DigitaMR_CEO or send me an email.
Share this article: