Living with outliers:

How to detect extreme observations in data analysis

Autores

Palavras-chave:

Outliers, Extreme cases, Atypical observations, Anomaly detection

Resumo

Data analysts often view outliers with skepticism due to their potential adverse effects, such as violating assumptions, hindering visualizations, and leading to biased estimates. In this paper, we present a practical guide for identifying outliers, which includes the step-by-step description of five statistical methods specifically designed to detect extreme observations: (1) standardized scores, (2) interquartile range, (3) standardized residuals, (4) Cook’s distance, and (5) Mahalanobis distance. To enhance the learning experience, we provide both raw data and R scripts, empowering researchers to apply these techniques to their own data. By following the procedures outlined in this paper, scholars in a variety of fields will be able to make substantial progress in the quality of their data analysis.

Downloads

Não há dados estatísticos.

Referências

Atkinson, A. C. (1994). Fast very robust methods for the detection of multiple outliers. Journal of the American Statistical Association, 89(428), 1329–1339.

Atkinson, A. C., & Mulira, H.-M. (1993). The stalactite plot for the detection of multivariate outliers. Statistics and Computing, 3(1), 27–35.

Belsley, D. A., Kuh, E., & Welsch, R. E. (2005). Regression diagnostics: Identifying influential data and sources of collinearity (Vol. 571). John Wiley & Sons.

Benoit, K. (2011). Linear regression models with logarithmic transformations. London School of Economics, London, 22(1), 23–36.

Blalock, H. M. (1979). Social Statistics. Revised. New York, NY: McGraw-Hill. Box, GEP, & Cox, DR (1964). An analysis of transformations. Journal of the Royal Statistical Society, Series B (Methodology), 26(2), 211–252.

Bluman, A. G. (2013). Elementary statistics: A step by step approach: A brief version. McGraw-Hill.

Bollen, K. A. (1988). “If You Ignore Outliers, will they Go Away?”: A Response to Gasiorowski. Comparative Political Studies, 20(4), 516–522. https://doi.org/10.1177/0010414088020004005

Bunge, M. (2016). Mechanism and Explanation: Philosophy of the Social Sciences. https://doi.org/10.1177/004839319702700402

Cateni, S., Colla, V., & Vannucci, M. (2008). Outlier Detection Methods for Industrial Applications. Advances in Robotics, Automation and Control. https://doi.org/10.5772/5526

Chandola, V., Banerjee, A., & Kumar, V. (2007). Outlier detection: A survey. ACM Computing Surveys, 14, 15.

Cook, R. D. (1977). Detection of Influential Observation in Linear Regression. Technometrics, 19(1), 15–18. https://doi.org/10.2307/1268249

Cook, R. D., & Weisberg, S. (1982). Residuals and influence in regression. New York: Chapman and Hall.

Davies, L., & Gather, U. (1993). The identification of multiple outliers. Journal of the American Statistical Association, 88(423), 782–792.

De La O, A. L. (2013). Do conditional cash transfers affect electoral behavior? Evidence from a randomized experiment in Mexico. American Journal of Political Science, 57(1), 1–14.

Finlay, B., & Agresti, A. (1997). Statistical methods for the social sciences. Dellen.

Fox, A. J. (1972). Outliers in Time Series. Journal of the Royal Statistical Society. Series B (Methodological), 34(3), 350–363. JSTOR.

Gerring, J. (2004). What Is a Case Study and What Is It Good for? The American Political Science Review, 98(2), 341–354. JSTOR.

Grubbs, F. E. (1969). Procedures for detecting outlying observations in samples. Technometrics, 11(1), 1–21.

Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E., & Tatham, R. L. (2009). Análise multivariada de dados — 6ed. Bookman Editora.

Hawkins, D. M. (1980). Identification of outliers (Vol. 11). Springer.

Hoaglin, D. C., Iglewicz, B., & Tukey, J. W. (1986). Performance of some resistant rules for outlier labeling. Journal of the American Statistical Association, 81(396), 991–999.

Hoaglin, D. C., & Welsch, R. E. (1978). The hat matrix in regression and ANOVA. The American Statistician, 32(1), 17–22.

Hopkins, D. J., & King, G. (2010). A method of automated nonparametric content analysis for social science. American Journal of Political Science, 54(1), 229–247.

Iglewicz, B., & Banerjee, S. (2001). A simple univariate outlier identification procedure. Proceedings of the annual meeting of the american statistical association, 5–9.

Iglewicz, B., & Hoaglin, D. C. (1993). How to detect and handle outliers. ASQC Quality Press.

Imai, K., King, G., & Rivera, C. V. (2016). Do nonpartisan programmatic policies have partisan electoral effects? Evidence from two large scale randomized experiments. Unpublished manuscript, Princeton University and Harvard University Retrieved from https://gking. harvard. edu/files/gking/files/progpol. pdf.

Johnson, R. A., & Wichern, D. W. (2002). Applied multivariate statistical analysis (Vol. 5). Prentice hall Upper Saddle River, NJ.

Lewis, T., & Barnett, V. (1994). Outliers in statistical data. John Wiley & Sons.

Mahalanobis, P. C. (2018). On the generalized distance in statistics. Sankhyā: The Indian Journal of Statistics, Series A (2008-), 80, S1–S7.

Mahoney, J. (2001). Path-Dependent Explanations of Regime Change: Central America in Comparative Perspective. Studies in Comparative International Development, 36(1), 111–141. https://doi.org/10.1007/BF02687587

Mendenhall, W. (1982). Statistics for management and economics (4th edition). Duxbury Press.

Moore, D. S., & McCabe, G. P. (1999). Introduction to the Practice of Statistics. W.H. Freeman.

Nunnally, J., & Bernstein, I. (1994). Psychometric Theory: 3rd (Third) edition.

Osborne, J., & Overbay, A. (2019). The power of outliers (and why researchers should ALWAYS check for them). Practical Assessment, Research, and Evaluation, 9(1). https://doi.org/10.7275/qf69-7k43

Osborne, J. W., Christianson, W. R., & Gunter, J. S. (2001). Educational Psychology from a Statistician’s Perspective: A Review of the Quantitative Quality of Our Field. https://eric.ed.gov/?id=ED463316

Pyle, D. (1999). Data Preparation for Data Mining. Morgan Kaufmann.

Ramaswamy, S., Rastogi, R., & Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. ACM SIGMOD Record, 29(2), 427–438. https:// doi.org/10.1145/335191.335437.

Ross, W. H. (1987). The Geometry of Case Deletion and the Assessment of Influence in Nonlinear Regression. The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 15(2), 91–103. JSTOR. https://doi.org/10.2307/3315198

Seo, S. (2006). A review and comparison of methods for detecting outliers in univariate data sets [Master’s Thesis, University of Pittsburgh]. http://d-scholarship.pitt.edu/7948/

Stevens, J. P. (1984). Outliers and influential data points in regression analysis. Psychological Bulletin, 95(2), 334–344. https://doi.org/10.1037/0033-2909.95.2.334

Verardi, V., & Croux, C. (2009). Robust Regression in Stata: The Stata Journal. https://doi.org/10.1177/1536867X0900900306

Walfish, S. (2006). A review of statistical outlier methods. A review of statistical outlier methods, 30(11), 82-88 [4 p.].

Weber, S. (2010). Bacon: An Effective way to Detect Outliers in Multivariate Data Using Stata (and Mata). The Stata Journal: Promoting Communications on Statistics and Stata, 10(3), 331–338. https://doi.org/10.1177/1536867X1001000302

Williams, R. (2016). Outliers. University of Notre Dame. https://www3.nd.edu/~rwilliam/stats2/l24.pdf

Zeller, R. A., Zeller, R. A., Zeller, & Carmines, E. G. (1980). Measurement in the Social Sciences: The Link Between Theory and Data. CUP Archive.

Downloads

Publicado

2023-11-28

Como Citar

Figueiredo Filho, D. ., Silva, L., Pires, A., & Malaquias, C. (2023). Living with outliers: : How to detect extreme observations in data analysis. BIB - Revista Brasileira De Informação Bibliográfica Em Ciências Sociais, (99). Recuperado de https://bibanpocs.emnuvens.com.br/revista/article/view/619