Living with outliers: How to detect extreme observations in data analysis

Dalson; Lucas Silva; Antônio Pires; Caio Malaquias

Auteurs

Dalson Universidade Federal de Pernambuco https://orcid.org/0000-0001-6982-2262
Lucas Silva Universidade Estadual de Ciências de Saúde de Alagoas https://orcid.org/0000-0002-5013-6278
Antônio Pires Universidade Federal de Pernambuco https://orcid.org/0000-0001-5468-3407
Caio Malaquias Universidade Federal de Pernambuco https://orcid.org/0000-0003-3189-2024

Mots-clés :

outliers, extreme cases, atypical observations, outlier detection, outlier treatment

Résumé

This paper provides a practical guide to identifying outliers. We outline five statistical methods specifically designed to spot extreme observations: (1) standardized scores, (2) interquartile range, (3) standardized residuals, (4) Cook's Distance, and (5) Mahalanobis's Distance. To enhance the learning experience, we share both raw data and R scripts, empowering researchers to apply these techniques to their own data. Outliers are often viewed with skepticism by data analysts due to their potential adverse effects such as violating assumptions, hindering visualizations, leading to biased estimates, and altering the direction of coefficients. By following the procedures outlined in this paper, scholars in a variety of fields can make substantial progress in the quality of their data analysis.

Téléchargements

Les données relatives au téléchargement ne sont pas encore disponibles.

Références

Atkinson, A. C. (1994). Fast very robust methods for the detection of multiple outliers. Journal of the American Statistical Association, 89(428), 1329–1339.

Atkinson, A. C., & Mulira, H.-M. (1993). The stalactite plot for the detection of multivariate outliers. Statistics and Computing, 3(1), 27–35.

Belsley, D. A., Kuh, E., & Welsch, R. E. (2005). Regression diagnostics: Identifying influential data and sources of collinearity (Vol. 571). John Wiley & Sons.

Benoit, K. (2011). Linear regression models with logarithmic transformations. London School of Economics, London, 22(1), 23–36.

Blalock, H. M. (1979). Social Statistics. Revised. New York, NY: McGraw-Hill. Box, GEP, & Cox, DR (1964). An analysis of transformations. Journal of the Royal Statistical Society, Series B (Methodology), 26(2), 211–252.

Bluman, A. G. (2013). Elementary statistics: A step by step approach: A brief version. McGraw-Hill.

Bollen, K. A. (1988). “If You Ignore Outliers, will they Go Away?”: A Response to Gasiorowski. Comparative Political Studies, 20(4), 516–522. https://doi.org/10.1177/0010414088020004005

Bunge, M. (2016). Mechanism and Explanation: Philosophy of the Social Sciences. https://doi.org/10.1177/004839319702700402

Cateni, S., Colla, V., & Vannucci, M. (2008). Outlier Detection Methods for Industrial Applications. Advances in Robotics, Automation and Control. https://doi.org/10.5772/5526

Chandola, V., Banerjee, A., & Kumar, V. (2007). Outlier detection: A survey. ACM Computing Surveys, 14, 15.

Cook, R. D. (1977). Detection of Influential Observation in Linear Regression. Technometrics, 19(1), 15–18. https://doi.org/10.2307/1268249

Cook, R. D., & Weisberg, S. (1982). Residuals and influence in regression. New York: Chapman and Hall.

Davies, L., & Gather, U. (1993). The identification of multiple outliers. Journal of the American Statistical Association, 88(423), 782–792.

De La O, A. L. (2013). Do conditional cash transfers affect electoral behavior? Evidence from a randomized experiment in Mexico. American Journal of Political Science, 57(1), 1–14.

Finlay, B., & Agresti, A. (1997). Statistical methods for the social sciences. Dellen.

Fox, A. J. (1972). Outliers in Time Series. Journal of the Royal Statistical Society. Series B (Methodological), 34(3), 350–363. JSTOR.

Gerring, J. (2004). What Is a Case Study and What Is It Good for? The American Political Science Review, 98(2), 341–354. JSTOR.

Grubbs, F. E. (1969). Procedures for detecting outlying observations in samples. Technometrics, 11(1), 1–21.

Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E., & Tatham, R. L. (2009). Análise multivariada de dados — 6ed. Bookman Editora.

Hawkins, D. M. (1980). Identification of outliers (Vol. 11). Springer.

Hoaglin, D. C., Iglewicz, B., & Tukey, J. W. (1986). Performance of some resistant rules for outlier labeling. Journal of the American Statistical Association, 81(396), 991–999.

Hoaglin, D. C., & Welsch, R. E. (1978). The hat matrix in regression and ANOVA. The American Statistician, 32(1), 17–22.

Hopkins, D. J., & King, G. (2010). A method of automated nonparametric content analysis for social science. American Journal of Political Science, 54(1), 229–247.

Iglewicz, B., & Banerjee, S. (2001). A simple univariate outlier identification procedure. Proceedings of the annual meeting of the american statistical association, 5–9.

Iglewicz, B., & Hoaglin, D. C. (1993). How to detect and handle outliers. ASQC Quality Press.

Imai, K., King, G., & Rivera, C. V. (2016). Do nonpartisan programmatic policies have partisan electoral effects? Evidence from two large scale randomized experiments. Unpublished manuscript, Princeton University and Harvard University Retrieved from https://gking. harvard. edu/files/gking/files/progpol. pdf.

Johnson, R. A., & Wichern, D. W. (2002). Applied multivariate statistical analysis (Vol. 5). Prentice hall Upper Saddle River, NJ.

Lewis, T., & Barnett, V. (1994). Outliers in statistical data. John Wiley & Sons.

Mahalanobis, P. C. (2018). On the generalized distance in statistics. Sankhyā: The Indian Journal of Statistics, Series A (2008-), 80, S1–S7.

Mahoney, J. (2001). Path-Dependent Explanations of Regime Change: Central America in Comparative Perspective. Studies in Comparative International Development, 36(1), 111–141. https://doi.org/10.1007/BF02687587

Mendenhall, W. (1982). Statistics for management and economics (4th edition). Duxbury Press.

Moore, D. S., & McCabe, G. P. (1999). Introduction to the Practice of Statistics. W.H. Freeman.

Nunnally, J., & Bernstein, I. (1994). Psychometric Theory: 3rd (Third) edition.

Osborne, J., & Overbay, A. (2019). The power of outliers (and why researchers should ALWAYS check for them). Practical Assessment, Research, and Evaluation, 9(1). https://doi.org/10.7275/qf69-7k43

Osborne, J. W., Christianson, W. R., & Gunter, J. S. (2001). Educational Psychology from a Statistician’s Perspective: A Review of the Quantitative Quality of Our Field. https://eric.ed.gov/?id=ED463316

Pyle, D. (1999). Data Preparation for Data Mining. Morgan Kaufmann.

Ramaswamy, S., Rastogi, R., & Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. ACM SIGMOD Record, 29(2), 427–438. https:// doi.org/10.1145/335191.335437.

Ross, W. H. (1987). The Geometry of Case Deletion and the Assessment of Influence in Nonlinear Regression. The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 15(2), 91–103. JSTOR. https://doi.org/10.2307/3315198

Seo, S. (2006). A review and comparison of methods for detecting outliers in univariate data sets [Master’s Thesis, University of Pittsburgh]. http://d-scholarship.pitt.edu/7948/

Stevens, J. P. (1984). Outliers and influential data points in regression analysis. Psychological Bulletin, 95(2), 334–344. https://doi.org/10.1037/0033-2909.95.2.334

Verardi, V., & Croux, C. (2009). Robust Regression in Stata: The Stata Journal. https://doi.org/10.1177/1536867X0900900306

Walfish, S. (2006). A review of statistical outlier methods. A review of statistical outlier methods, 30(11), 82-88 [4 p.].

Weber, S. (2010). Bacon: An Effective way to Detect Outliers in Multivariate Data Using Stata (and Mata). The Stata Journal: Promoting Communications on Statistics and Stata, 10(3), 331–338. https://doi.org/10.1177/1536867X1001000302

Williams, R. (2016). Outliers. University of Notre Dame. https://www3.nd.edu/~rwilliam/stats2/l24.pdf

Zeller, R. A., Zeller, R. A., Zeller, & Carmines, E. G. (1980). Measurement in the Social Sciences: The Link Between Theory and Data. CUP Archive.

Living with outliers

How to detect extreme observations in data analysis

Auteurs

Mots-clés :

Résumé

Téléchargements

Références

Téléchargements

Publiée

Comment citer

Numéro

Rubrique

Langue

Informations

Mots-clés

Numéro courant

Naviguer

Faire une soumission

Redes sociais