Living with outliers
How to detect extreme observations in data analysis
Mots-clés :
outliers, extreme cases, atypical observations, outlier detection, outlier treatmentRésumé
This paper provides a practical guide to identifying outliers. We outline five statistical methods specifically designed to spot extreme observations: (1) standardized scores, (2) interquartile range, (3) standardized residuals, (4) Cook's Distance, and (5) Mahalanobis's Distance. To enhance the learning experience, we share both raw data and R scripts, empowering researchers to apply these techniques to their own data. Outliers are often viewed with skepticism by data analysts due to their potential adverse effects such as violating assumptions, hindering visualizations, leading to biased estimates, and altering the direction of coefficients. By following the procedures outlined in this paper, scholars in a variety of fields can make substantial progress in the quality of their data analysis.
Téléchargements
Références
Atkinson, A. C. (1994). Fast very robust methods for the detection of multiple outliers. Journal of the American Statistical Association, 89(428), 1329–1339.
Atkinson, A. C., & Mulira, H.-M. (1993). The stalactite plot for the detection of multivariate outliers. Statistics and Computing, 3(1), 27–35.
Belsley, D. A., Kuh, E., & Welsch, R. E. (2005). Regression diagnostics: Identifying influential data and sources of collinearity (Vol. 571). John Wiley & Sons.
Benoit, K. (2011). Linear regression models with logarithmic transformations. London School of Economics, London, 22(1), 23–36.
Blalock, H. M. (1979). Social Statistics. Revised. New York, NY: McGraw-Hill. Box, GEP, & Cox, DR (1964). An analysis of transformations. Journal of the Royal Statistical Society, Series B (Methodology), 26(2), 211–252.
Bluman, A. G. (2013). Elementary statistics: A step by step approach: A brief version. McGraw-Hill.
Bollen, K. A. (1988). “If You Ignore Outliers, will they Go Away?”: A Response to Gasiorowski. Comparative Political Studies, 20(4), 516–522. https://doi.org/10.1177/0010414088020004005
Bunge, M. (2016). Mechanism and Explanation: Philosophy of the Social Sciences. https://doi.org/10.1177/004839319702700402
Cateni, S., Colla, V., & Vannucci, M. (2008). Outlier Detection Methods for Industrial Applications. Advances in Robotics, Automation and Control. https://doi.org/10.5772/5526
Chandola, V., Banerjee, A., & Kumar, V. (2007). Outlier detection: A survey. ACM Computing Surveys, 14, 15.
Cook, R. D. (1977). Detection of Influential Observation in Linear Regression. Technometrics, 19(1), 15–18. https://doi.org/10.2307/1268249
Cook, R. D., & Weisberg, S. (1982). Residuals and influence in regression. New York: Chapman and Hall.
Davies, L., & Gather, U. (1993). The identification of multiple outliers. Journal of the American Statistical Association, 88(423), 782–792.
De La O, A. L. (2013). Do conditional cash transfers affect electoral behavior? Evidence from a randomized experiment in Mexico. American Journal of Political Science, 57(1), 1–14.
Finlay, B., & Agresti, A. (1997). Statistical methods for the social sciences. Dellen.
Fox, A. J. (1972). Outliers in Time Series. Journal of the Royal Statistical Society. Series B (Methodological), 34(3), 350–363. JSTOR.
Gerring, J. (2004). What Is a Case Study and What Is It Good for? The American Political Science Review, 98(2), 341–354. JSTOR.
Grubbs, F. E. (1969). Procedures for detecting outlying observations in samples. Technometrics, 11(1), 1–21.
Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E., & Tatham, R. L. (2009). Análise multivariada de dados — 6ed. Bookman Editora.
Hawkins, D. M. (1980). Identification of outliers (Vol. 11). Springer.
Hoaglin, D. C., Iglewicz, B., & Tukey, J. W. (1986). Performance of some resistant rules for outlier labeling. Journal of the American Statistical Association, 81(396), 991–999.
Hoaglin, D. C., & Welsch, R. E. (1978). The hat matrix in regression and ANOVA. The American Statistician, 32(1), 17–22.
Hopkins, D. J., & King, G. (2010). A method of automated nonparametric content analysis for social science. American Journal of Political Science, 54(1), 229–247.
Iglewicz, B., & Banerjee, S. (2001). A simple univariate outlier identification procedure. Proceedings of the annual meeting of the american statistical association, 5–9.
Iglewicz, B., & Hoaglin, D. C. (1993). How to detect and handle outliers. ASQC Quality Press.
Imai, K., King, G., & Rivera, C. V. (2016). Do nonpartisan programmatic policies have partisan electoral effects? Evidence from two large scale randomized experiments. Unpublished manuscript, Princeton University and Harvard University Retrieved from https://gking. harvard. edu/files/gking/files/progpol. pdf.
Johnson, R. A., & Wichern, D. W. (2002). Applied multivariate statistical analysis (Vol. 5). Prentice hall Upper Saddle River, NJ.
Lewis, T., & Barnett, V. (1994). Outliers in statistical data. John Wiley & Sons.
Mahalanobis, P. C. (2018). On the generalized distance in statistics. Sankhyā: The Indian Journal of Statistics, Series A (2008-), 80, S1–S7.
Mahoney, J. (2001). Path-Dependent Explanations of Regime Change: Central America in Comparative Perspective. Studies in Comparative International Development, 36(1), 111–141. https://doi.org/10.1007/BF02687587
Mendenhall, W. (1982). Statistics for management and economics (4th edition). Duxbury Press.
Moore, D. S., & McCabe, G. P. (1999). Introduction to the Practice of Statistics. W.H. Freeman.
Nunnally, J., & Bernstein, I. (1994). Psychometric Theory: 3rd (Third) edition.
Osborne, J., & Overbay, A. (2019). The power of outliers (and why researchers should ALWAYS check for them). Practical Assessment, Research, and Evaluation, 9(1). https://doi.org/10.7275/qf69-7k43
Osborne, J. W., Christianson, W. R., & Gunter, J. S. (2001). Educational Psychology from a Statistician’s Perspective: A Review of the Quantitative Quality of Our Field. https://eric.ed.gov/?id=ED463316
Pyle, D. (1999). Data Preparation for Data Mining. Morgan Kaufmann.
Ramaswamy, S., Rastogi, R., & Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. ACM SIGMOD Record, 29(2), 427–438. https:// doi.org/10.1145/335191.335437.
Ross, W. H. (1987). The Geometry of Case Deletion and the Assessment of Influence in Nonlinear Regression. The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 15(2), 91–103. JSTOR. https://doi.org/10.2307/3315198
Seo, S. (2006). A review and comparison of methods for detecting outliers in univariate data sets [Master’s Thesis, University of Pittsburgh]. http://d-scholarship.pitt.edu/7948/
Stevens, J. P. (1984). Outliers and influential data points in regression analysis. Psychological Bulletin, 95(2), 334–344. https://doi.org/10.1037/0033-2909.95.2.334
Verardi, V., & Croux, C. (2009). Robust Regression in Stata: The Stata Journal. https://doi.org/10.1177/1536867X0900900306
Walfish, S. (2006). A review of statistical outlier methods. A review of statistical outlier methods, 30(11), 82-88 [4 p.].
Weber, S. (2010). Bacon: An Effective way to Detect Outliers in Multivariate Data Using Stata (and Mata). The Stata Journal: Promoting Communications on Statistics and Stata, 10(3), 331–338. https://doi.org/10.1177/1536867X1001000302
Williams, R. (2016). Outliers. University of Notre Dame. https://www3.nd.edu/~rwilliam/stats2/l24.pdf
Zeller, R. A., Zeller, R. A., Zeller, & Carmines, E. G. (1980). Measurement in the Social Sciences: The Link Between Theory and Data. CUP Archive.