The text as data
challenges and opportunities for Social Sciences
Keywords:
Automated Content Analysis, Similarity between Texts, Classification Methods, Scheduling Methods, Big DataAbstract
Communication is a fundamental tool for human relations. It is through communication that values are constructed, social symbols are established, traditions are passed on, debates are realized, politics are materialized and political conflict is expressed. A focus in analyses of social scientists, the analysis of the content trasmitted in communication has always been restricted to the need for a great amount of research funds for the manual assessment of large collections. Changing this limited scenario, recent technological, computational and scientific developments allowed social scientists to analyse larger collections of documents with low cost. Currently, through the development of new methods, it is now possible to identify behaviors that could not be observed, to measure quantities that could not be quantified, and to test hypothesis that could not be tested. In this sense, the main objective of this study is to maintain Brazilian Social Sciences at the frontier of this process and present to the reader the latest methodologies for automated content analysis. Without exhausting its several possibilities, this article is a guide to the innovative area of researching text as data.
Downloads
References
ALDRICH, J.; MCKELVEY, R. A method of scaling with applications to the 1968 and 1972 presidential elections. The American Political Science Review, Washington, DC, v. 71, n. 1, p. 11-130, 1973.
BARBERÁ, P. Birds of the same feather tweet together: Bayesian ideal point estimation using twitter data. Political Analysis, Cambridge, UK, v. 23, n. 1, p. 76-91, 2015.
BARRON, A. et al. Individuals, institutions, and innovation in the debates of the French Revolution. Proceedings of the National Academy of Sciences, Washington, DC, v. 115, n. 18, p. 4607-4612, 2018.
BERINSKY, A.; HUBER, G.; LENZ, G. Evaluating online labor markets for experimental research: Amazon. com’s Mechanical Turk. Political Analysis, Cambridge, UK, v. 20, n. 3, p. 351-368, 2012.
BISHOP, C. Neural networks for pattern recognition. Gloucestershire: Clarendon Press, 1995.
BLEI, D. M.; LAFFERTY, J. D. Dynamic topic models. In: INTERNATIONAL CONFERENCE ON MACHINE LEARNING, 23., 2006, New York. Proceedings… New York: ACM, 2006. pp. 113-120.
BLEI, D.; NG, A.; JORDAN, M. Latent dirichlet allocation. Journal of Machine Learning Research, Cambridge, MA, v. 3, n. 1, p. 993-1022, 2003.
BRADY, H. The perils of survey research: inter-personally incomparable responses. Political Methodology, Oxford, UK, v. 11, n. 3-4, p. 269-291, 1985.
BREIMAN, L. Random forests. Journal of Machine Learning Research, Cambridge, MA, v. 45, n. 1, p. 5-32, 2001.
BUDGE, I. et al. Mapping policy preferences: estimates for parties, electors, and governments, 1945-1998. Oxford, UK: Oxford University Press, 2001.
CAMPBELL, S. PENNEBAKER, J. The secret life of pronouns flexibility in writing style and physical health. Psychological Science, Washington, DC, v. 14, n. 1, p. 600-65, 2003.
CAMPOS, L. A., FERES JR., J.; GUARNIERI, F. 50 Anos da Revista DADOS: uma análise bibliométrica do seu perfil disciplinar e temático. Dados, Rio de Janeiro, v. 60, n. 3, p. 623-661, 2017.
CARRUBBA, C. et al. Off the record: unrecorded legislative votes, selection bias and roll-call vote analysis. British Journal of Political Science, Cambridge, UK, v. 36, n. 4, p. 691-704, 2006.
CHANG, J. et al. Reading tea leaves: how humans interpret topic models. In: BENGIO, Y. et al. Advances in neural information processing systems. Cambridge, MA: MIT Press, 2009. p. 288-296.
CLINTON, J.; JACKMAN, S.; RIVERS, D. The statistical analysis of roll call data. American Political Science Review, Washington, DC, v. 98, n. 2, p. 355-370, 2004.
EFRON, B.; GONG, G. A leisurely look at the bootstrap, the jacknife, and cross-validation. American Statistician, Abingdon, v. 37, n. 1, p. 36-48, 1983.
EGAMI, N. et al. How to make causal inferences with text. Working paper. 2018. Disponível em: <https://bit.ly/2MtXMdq>. Acesso em: 21 jul. 2018.
FEINERER, I. HORNIK, K. tm: Text Mining Package. R package, [s.l.], 2018. Disponível em: <https://bit.ly/2KcAx2w>. Acesso em: 21 jul. 2018.
FLESCH, R. A new readability yardstick. Journal of Applied Psychology, Washington, DC, v. 32, n. 3, p. 221-233, 1948.
FOKKENS, A. et al. Offspring from reproduction problems: what replication failure teaches us. In: ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 51., 2013, Sofia. Proceedings… Sofia: Association for Computational Linguistics, 2013. (Volume 1: Long Papers). p. 1691-1701.
FONG, C.; GRIMMER, J. Discovery of treatments from text corpora. In: ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 54., 2016, Berlin. Proceedings… Berlin: Association for Computational Linguistics, 2016. p. 1-10. FREY, B.; DUECK, D. Clustering by passing messages between data points. Science, Washington, DC, v. 315, n. 5814, p. 972-976, 2007.
GARRETT, K.; JANSA, J. Interest group influence in policy diffusion networks. State Politics & Policy Quarterly, Thousand Oaks, v. 15, n. 3, p. 387-417, 2015.
GRIMMER, J. A Bayesian hierarchical topic model for political texts: measuring expressed agendas in Senate press releases. Political Analysis, Cambridge, UK, v. 18, n. 1, p. 1-35, 2010.
______. We are all social scientists now: how big data, machine learning, and causal inference work together. PS: Political Science & Politics, Cambridge, UK, v. 48, n. 1, p. 80-83, 2015.
GRIMMER, J.; KING, G. General purpose computer-assisted clustering and conceptualization. Proceedings of the National Academy of Sciences, Washington, CD, v. 108, n. 7, p. 2643-2650, 2011.
GRIMMER, J.; STEWART, B. Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, v. 21, n. 3, p. 267-297, 2013.
GRÜN, B.; HORNIK, K. Topicmodels: AN R Package for fitting topic models. Journal of Statistical Software, Innsbruck, v. 40, n. 13, p. 1-30, 2011.
HAND, D. Classifier technology and the illusion of progress. Statistical Science, Bethesda, v. 21, n. 1, p. 1-14, 2006.
HASTIE, T.; TIBSHIRANI, R.; FRIEDMAN, J. The elements of statistical learning. New York: Springer, 2001.
HERNAN, M.; ROBINS, J. Causal inference. Boca Raton: CRC Press, 2018.
HOPKINS, D. KING, G. A method of automated nonparametric content analysis for social science. American Journal of Political Science, Washington, DC, v. 54, n. 1, p. 229-247, 2010.
HOPKINS, D. et al. ReadMe: software for automated content analysis. Gari King, Cambridge, MA, 2017. Disponível em: <https://bit.ly/2Mq7HRl>. Acesso em> 21 jul. 2018.
IMBENS, G.; RUBIN, D. Causal inference in statistics, social, and biomedical sciences. Cambridge, UK: Cambridge University Press, 2015.
ITTI, L.; BALDI, P. Bayesian surprise attracts human attention. In: JORDAN, M. I.; LECUN, Y.; SOLLA, S. A. (Eds.). Advances in neural information processing systems: proceedings of the first 12 conferences. Cambridge, MA: The MIT Press, 2006.
IZUMI, M. Velhas questões, novos métodos: posições, agenda, ideologia e dinheiro na política brasileira. 2017. 113 f. Tese (Doutorado em Ciência Política) – Universidade de São Paulo, São Paulo, 2017.
JURAFSKY, D.; MARTIN, J. Speech and natural language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River: Prentice Hall, 2009.
KING, G. et al. Enhancing the validity and cross-cultural comparability of measurement in survey research. American Political Science Review, Cambridge, UK, v. 98, n. 1, p. 191-207, 2004.
KRIPPENDORFF, K. Content analysis: an introduction to its methodology. New York: Sage, 2004.
KROEGER, M. Plagiarizing policy: model legislation in state legislatures. Working paper. 2015. Disponível em: <https://bit.ly/2o0lpf5>. Acesso em: 21 jul. 2018.
LAUDERDALE, B. HERZOG, A. Measuring political positions from legislative speech. Political Analysis, Cambridge, UK, v. 24, n. 3, p. 374-394, 2016.
LAVER, M.; BENOIT, K.; GARRY, J. Extracting policy positions from political texts using words as data. American Political Science Review, Washington, DC, v. 97, n. 2, p. 311-331, 2003.
LAZER, D. et al. Life in the network: the coming age of computational social science. Science, New York, v. 323, n. 5915, p. 721, 2009.
LI, W.; LAROCHELLE, D.; LO, A. Estimating policy trajectories during the financial crisis. Working paper. 2014. Disponível em: <https://bit.ly/2MtZfjN>. Acesso em: 21 jul. 2018.
LIU, B. Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, London, v. 5, n. 1, p. 1-167, 2012.
LOWE, W. Understanding wordscores. Political Analysis, Cambridge, UK, v. 16, n. 4, p. 356-371, 2008.
______. Austin: do things with words. Conjugateprior, Princeton, 2015. Disponível em: <https://bit.ly/2BCFGAY>. Acesso em: 21 jul. 2018.
LUCAS, C. et al. Computer-assisted text analysis for comparative politics. Political Analysis, Cambridge, UK, v. 23, n. 2, p. 254-277, 2015.
MACQUEEN, J. Some methods for classification and analysis of multivariate observations. In: LE CAM, L. M.; NEYMAN, J. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press, 1967. (Volume 1: Statistics). p. 281-297.
MAGALHÃES, R. et al. Perspectives on political methodology: interview with Simon Jackman. Leviathan, São Paulo, n. 7, p.158-175, 2013.
MANNING, C.; RAGHAVAN, P.; SCHÜTZE, H. Introduction to information retrieval. Cambridge: Cambridge University Press, 2008.
MARON, M.; KUHNS, J. On relevance, probabilistic indexing and information retrieval. Journal of the ACM (JACM), New York, v. 7, n. 3, p. 216-244, 1960.
MCCARTY, N. POOLE, K. ROSENTHAL, H. Polarized America: the dance of ideology and unequal riches, Cambridge: MIT University Press, 2006.
MONROE, B.; SCHRODT, P. Introduction to the special issue: the statistical analysis of political text. Political Analysis, Cambridge, UK, v. 16, n. 4, p. 351-355, 2008.
MONROE, B. et al. No! Formal theory, causal inference, and big data are not contradictory trends in political science. PS: Political Science & Politics, Cambridge, UK, v. 48, n. 1, p. 71-74, 2015.
MOREIRA, D. Com a palavra os nobres deputados: frequência e ênfase temática dos discursos dos parlamentares brasileiros. 2016. 204 f. Tese (Doutorado em Ciência Política) – Universidade de São Paulo, SP, 2016.
NEUENDORF, K. The content analysis guidebook. Thousand Oaks: Sage, 2002.
OOMS, J. Tesseract: Open Source OCR Engine. R package, [s.l.], 2018. Disponível em: <https://bit.ly/2whiySw>. Acesso em 21 jul. 2018.
PANG, B.; LEE, L. Opinion mining and sentiment analysis. Foundations and Trends® in Information Retrieval, Hanover, v. 2, n. 1-2, p. 1-135, 2008.
PATTY, J.; PENN, E. Analyzing big data: social choice and measurement. PS: Political Science and Politics, Cambridge, UK, v. 48, n. 1, p. 95-101, 2015.
PEARL, J. Causality. Cambridge, UK: Cambridge University Press, 2009.
PENNEBAKER, J. W.; MEHL, M. R.; NIEDERHOFFER, K. G. Psychological aspects of natural language use: our words, ourselves. Annual Review of Psychology, Palo Alto, v. 54, n. 1, p. 547-577, 2003.
POOLE, K.; ROSENTHAL, H. Ideology and congress. New Brunswick: Transaction Publishers, 2007.
PORTER, M. F. An algorithm for suffix stripping. Program: Electronic Library and Information Systems, Belfast, v. 14, n. 3, p. 130-137, 1980.
POWER, T.; ZUCCO, C. Estimating ideology of Brazilian legislative parties, 1990-2005: a research communication. Latin American Research Review, Pittsburgh, v. 44, n. 1, p. 218-246, 2009.
QUINN, K. et al. How to analyze political attention with minimal assumptions and costs. American Journal of Political Science, Washington, DC, v. 54, n. 1, p. 209-228, 2010.
ROBERTS, M. E. Introduction to the Virtual Issue: recent innovations in text analysis for social science. Political Analysis, Cambridge, UK, v. 24, n. 10, p. 1-5, 2016.
ROBERTS, M.; STEWART, B.; TINGLEY, D. stm: R Package for Structural Topic Models. R package, [s.l.], 2018. Disponível em: <https://bit.ly/2wc0rOT>. Acesso em: 3 jul. 2018.
ROBERTS, M. E. et al. The structural topic model and applied social science. Advances in neural information processing systems workshop on topic models: computation, application, and evaluation. Cambridge, MA: Harvard University, 2013.
______. Topic models for open-ended survey responses with applications to experiments. American Journal of Political Science, Washington, DC, v. 58, n. 4, p. 1064-1082, 2014.
SLAPIN, J.; PROKSCH, S.‐O. A scaling model for estimating time‐series party positions from texts. American Journal of Political Science, Washington, DC, v. 52, n. 3, p. 705-722, 2008.
SMITH, T.; WATERMAN, M. Identification of common molecular subsequences. Journal of Molecular Biology, Amsterdam, v. 147, n. 1. p. 195-197, 1981.
SOUZA, M.; VIEIRA, R. Sentiment analysis on Twitter data for Portuguese language. In: INTERNATIONAL CONFERENCE COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, 10., 2012, Coimbra. Proceedings… Coimbra: University of Coimbra, 2012. p. 241-247.
SOUZA, M. et al. Construction of a Portuguese opinion lexicon from multiple resources. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY, 8., 2011, Uberlândia. Proceedings… Uberlândia: Federal University of Uberlândia, 2011. pp. 59-66.
SPIRLING, A. Democratization and linguistic complexity: the effect of franchise extension on parliamentary discourse, 1832-1915. The Journal of Politics, Chicago, v. 78, n. 1, p. 120-136, 2015.
TABOADA, M. et al. Lexicon-based methods for sentiment analysis. Computational Linguistics, Cambridge, MA, v. 37, n. 2, p. 267-307, 2011.
VENABLES, W. N.; RIPLEY, B. D. Modern applied statistics with S. 4. ed. New York: Springer, 2002.
WALLACH, H. et al. An alternative prior for nonparametric Bayesian Clustering. In: International CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, 13., 2010, Sardinia. Proceedings… Sardinia: Chia Laguna Resort, 2010. p. 892-999, 2010.
WEBER, R. P. Basic content analysis. Newbury Park: Sage, 1990. (University Paper Series on Quantitative Applications in the Social Sciences).
WELBERS, K.; VAN ATTEVELDT, W.; BENOIT, K. Text analysis in R. Communication Methods and Measures, Abingdon, v. 11, n. 4, p. 245-265, 2017.
WICKHAM, H. httr: Tools for Working with URLs and HTTP. R package, [s.l.], 2016. Disponível em: <https://bit.ly/2PwgzT0>. Acesso em: 20 jul. 2018.
______. rvest: Easily Harvest (Scrape) Web Pages. R package, [s.l.], 2018. Disponível em: <https://bit.ly/2wee0fI>. Acesso em: 21 jul. 2018.
WICKHAM, H.; HESTER, J.; OOMS, J. xml2: Parse XML. R package, [s.l.], 2018. Disponível em: <https://bit.ly/2MrMzdi>. Acesso em: 20 jul. 2018.
WILKERSON, J.; CASAS, A. Large-scale computerized text analysis in political science: Opportunities and challenges. Annual Review of Political Science, Palo Alto, v. 20, p. 529-544, 2017.
WILKERSON, J.; SMITH, D.; STRAMP, N. Tracing the flow of policy ideas in legislatures: a text reuse approach. American Journal of Political Science, Washington, DC, v. 59, n. 4, p. 943-956, 2015.