i
Removing bad data improves your estimates by up to 20% and shortens the time it takes to write an insight debrief.
Removing speedsters and straight-lining respondents is the first step to ensure you are providing quality research data. Outliers are extreme answers which don't represent your target market. While they can represent genuinely unusual individuals, they are overwhelminly comprised of unengaged respondents and AI 'bot' responders.
Anomolous responses flatten your data. Removing these cases makes it easier to build a story, and deliver more accurate and defensible market measures.
This online tool is free to use. Your SPSS file is processed in the EU and subject to GDPR.
NB: Your SPSS file is not stored/saved/retained.
This report is generated on-the-fly and cannot be retrieved after you leave this webpage.
anomalies were found. Of these are notably suspicious and should be removed.
These are examples where a respondent has:
Please note that the algorithms cannot distinguish between fraudulent and honest responses with 100% certainty, so not all the outliers definitely represent 'bad' data. The algorithms can only detect unlikely/odd/weird/strange cases, those that reside outside the wide space where the majority of respondents are found. Though fraudulent, unengaged respondents or 'bots' may be identified, anomalies can also be genuine. For example, 'a pensioner who commutes on a bike and owns a PlayStation 5'. Such a case, although unlikely, might be your very real but eccentric aunt.
Regardless, most analysts tend to remove as many anomalies as possible before attempting to build explanatory models. Retaining these responses, even if authentic, turns up the noise in the data, and drowns out the signal. That said, overzealous cleaning harms the representation of your data.
As such it is advised that you do not follow the algorithms blindly, particularly if many outliers are flagged. Case by case consideration is always advisable.
The algorithm expects to find some anomalies in the data, and will infrequently return with no recommendations. If your data has been already cleaned, there is a large risk that the algorithm may grasp onto the mnost divergent segment of respondents, and regard them as the outliers! Consider the diagnostics below.
To assist your due diligence, each respondent has received a score to reflect a level of suspicion.
To assign these scores the algorithms have considered your data in multidimensional space or the relationship amongst all the answers in your questionnaire. The plot below shows only 3 of these dimensions (20 to 30 questions required before an algorithm can suspecting anomalies). i.e. This plot is not intended to be insightful, rather it offers an intuitive, simplified depiction of how the algorithms viewed your dataset.
It is important to appreciate that, by definition, each respondent will be anomalous for different reasons. The following plots are courtesy of Lunberg and Lee (2017) , and are designed to evidence why each respondent is considered to be anomalous. The bottom of a waterfall plot starts as the expected value, and then each row shows how each answer leads us to suspect the respondent is anomalous (orange) or real (blue). You can see how the combinations of these certain answers in your survey begins to raise suspicion.