O texto como dado
desafios e oportunidades para as ciências sociais
Palavras-chave:
Análise Automatizada de Conteúdo, Semelhança entre Textos, Métodos de Classificação, Métodos de Escalonamento, Big DataResumo
A comunicação é instrumento fundamental para as relações humanas. É por meio dela, por exemplo, que valores são construídos, símbolos sociais são estabelecidos, tradições são repassadas, debates são concretizados, a política se materializa e o conflito político se expressa. Foco de análises dos cientistas sociais há séculos, a análise do conteúdo transmitido na comunicação sempre esteve restrita à necessidade de volumes relevantes de recursos para a avaliação manual de grandes acervos. Revertendo esse quadro limitado, recentes desenvolvimentos tecnológico, computacional e científico permitem que as ciências sociais potencializem sua investigação reduzindo drasticamente os custos envolvidos na análise de grandes acervos. Por intermédio de novos métodos desenvolvidos, atualmente, é possível verificar comportamentos que antes não eram observáveis, medir quantidades anteriormente imensuráveis e testar hipóteses até então impossíveis de serem testadas. Nesse escopo, o principal objetivo deste artigo é manter as ciências sociais brasileiras na fronteira desse processo e apresentar ao leitor um leque atualizado das principais metodologias de análise automatizada de conteúdo. Sem esgotar suas inúmeras possibilidades, este artigo é um guia para a inovadora e instigante área de pesquisa do texto como dado.
Downloads
Referências
ALDRICH, J.; MCKELVEY, R. A method of scaling with applications to the 1968 and 1972 presidential elections. The American Political Science Review, Washington, DC, v. 71, n. 1, p. 11-130, 1973.
BARBERÁ, P. Birds of the same feather tweet together: Bayesian ideal point estimation using twitter data. Political Analysis, Cambridge, UK, v. 23, n. 1, p. 76-91, 2015.
BARRON, A. et al. Individuals, institutions, and innovation in the debates of the French Revolution. Proceedings of the National Academy of Sciences, Washington, DC, v. 115, n. 18, p. 4607-4612, 2018.
BERINSKY, A.; HUBER, G.; LENZ, G. Evaluating online labor markets for experimental research: Amazon. com’s Mechanical Turk. Political Analysis, Cambridge, UK, v. 20, n. 3, p. 351-368, 2012.
BISHOP, C. Neural networks for pattern recognition. Gloucestershire: Clarendon Press, 1995.
BLEI, D. M.; LAFFERTY, J. D. Dynamic topic models. In: INTERNATIONAL CONFERENCE ON MACHINE LEARNING, 23., 2006, New York. Proceedings… New York: ACM, 2006. pp. 113-120.
BLEI, D.; NG, A.; JORDAN, M. Latent dirichlet allocation. Journal of Machine Learning Research, Cambridge, MA, v. 3, n. 1, p. 993-1022, 2003.
BRADY, H. The perils of survey research: inter-personally incomparable responses. Political Methodology, Oxford, UK, v. 11, n. 3-4, p. 269-291, 1985.
BREIMAN, L. Random forests. Journal of Machine Learning Research, Cambridge, MA, v. 45, n. 1, p. 5-32, 2001.
BUDGE, I. et al. Mapping policy preferences: estimates for parties, electors, and governments, 1945-1998. Oxford, UK: Oxford University Press, 2001.
CAMPBELL, S. PENNEBAKER, J. The secret life of pronouns flexibility in writing style and physical health. Psychological Science, Washington, DC, v. 14, n. 1, p. 600-65, 2003.
CAMPOS, L. A., FERES JR., J.; GUARNIERI, F. 50 Anos da Revista DADOS: uma análise bibliométrica do seu perfil disciplinar e temático. Dados, Rio de Janeiro, v. 60, n. 3, p. 623-661, 2017.
CARRUBBA, C. et al. Off the record: unrecorded legislative votes, selection bias and roll-call vote analysis. British Journal of Political Science, Cambridge, UK, v. 36, n. 4, p. 691-704, 2006.
CHANG, J. et al. Reading tea leaves: how humans interpret topic models. In: BENGIO, Y. et al. Advances in neural information processing systems. Cambridge, MA: MIT Press, 2009. p. 288-296.
CLINTON, J.; JACKMAN, S.; RIVERS, D. The statistical analysis of roll call data. American Political Science Review, Washington, DC, v. 98, n. 2, p. 355-370, 2004.
EFRON, B.; GONG, G. A leisurely look at the bootstrap, the jacknife, and cross-validation. American Statistician, Abingdon, v. 37, n. 1, p. 36-48, 1983.
EGAMI, N. et al. How to make causal inferences with text. Working paper. 2018. Disponível em: <https://bit.ly/2MtXMdq>. Acesso em: 21 jul. 2018.
FEINERER, I. HORNIK, K. tm: Text Mining Package. R package, [s.l.], 2018. Disponível em: <https://bit.ly/2KcAx2w>. Acesso em: 21 jul. 2018.
FLESCH, R. A new readability yardstick. Journal of Applied Psychology, Washington, DC, v. 32, n. 3, p. 221-233, 1948.
FOKKENS, A. et al. Offspring from reproduction problems: what replication failure teaches us. In: ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 51., 2013, Sofia. Proceedings… Sofia: Association for Computational Linguistics, 2013. (Volume 1: Long Papers). p. 1691-1701.
FONG, C.; GRIMMER, J. Discovery of treatments from text corpora. In: ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 54., 2016, Berlin. Proceedings… Berlin: Association for Computational Linguistics, 2016. p. 1-10. FREY, B.; DUECK, D. Clustering by passing messages between data points. Science, Washington, DC, v. 315, n. 5814, p. 972-976, 2007.
GARRETT, K.; JANSA, J. Interest group influence in policy diffusion networks. State Politics & Policy Quarterly, Thousand Oaks, v. 15, n. 3, p. 387-417, 2015.
GRIMMER, J. A Bayesian hierarchical topic model for political texts: measuring expressed agendas in Senate press releases. Political Analysis, Cambridge, UK, v. 18, n. 1, p. 1-35, 2010.
______. We are all social scientists now: how big data, machine learning, and causal inference work together. PS: Political Science & Politics, Cambridge, UK, v. 48, n. 1, p. 80-83, 2015.
GRIMMER, J.; KING, G. General purpose computer-assisted clustering and conceptualization. Proceedings of the National Academy of Sciences, Washington, CD, v. 108, n. 7, p. 2643-2650, 2011.
GRIMMER, J.; STEWART, B. Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, v. 21, n. 3, p. 267-297, 2013.
GRÜN, B.; HORNIK, K. Topicmodels: AN R Package for fitting topic models. Journal of Statistical Software, Innsbruck, v. 40, n. 13, p. 1-30, 2011.
HAND, D. Classifier technology and the illusion of progress. Statistical Science, Bethesda, v. 21, n. 1, p. 1-14, 2006.
HASTIE, T.; TIBSHIRANI, R.; FRIEDMAN, J. The elements of statistical learning. New York: Springer, 2001.
HERNAN, M.; ROBINS, J. Causal inference. Boca Raton: CRC Press, 2018.
HOPKINS, D. KING, G. A method of automated nonparametric content analysis for social science. American Journal of Political Science, Washington, DC, v. 54, n. 1, p. 229-247, 2010.
HOPKINS, D. et al. ReadMe: software for automated content analysis. Gari King, Cambridge, MA, 2017. Disponível em: <https://bit.ly/2Mq7HRl>. Acesso em> 21 jul. 2018.
IMBENS, G.; RUBIN, D. Causal inference in statistics, social, and biomedical sciences. Cambridge, UK: Cambridge University Press, 2015.
ITTI, L.; BALDI, P. Bayesian surprise attracts human attention. In: JORDAN, M. I.; LECUN, Y.; SOLLA, S. A. (Eds.). Advances in neural information processing systems: proceedings of the first 12 conferences. Cambridge, MA: The MIT Press, 2006.
IZUMI, M. Velhas questões, novos métodos: posições, agenda, ideologia e dinheiro na política brasileira. 2017. 113 f. Tese (Doutorado em Ciência Política) – Universidade de São Paulo, São Paulo, 2017.
JURAFSKY, D.; MARTIN, J. Speech and natural language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River: Prentice Hall, 2009.
KING, G. et al. Enhancing the validity and cross-cultural comparability of measurement in survey research. American Political Science Review, Cambridge, UK, v. 98, n. 1, p. 191-207, 2004.
KRIPPENDORFF, K. Content analysis: an introduction to its methodology. New York: Sage, 2004.
KROEGER, M. Plagiarizing policy: model legislation in state legislatures. Working paper. 2015. Disponível em: <https://bit.ly/2o0lpf5>. Acesso em: 21 jul. 2018.
LAUDERDALE, B. HERZOG, A. Measuring political positions from legislative speech. Political Analysis, Cambridge, UK, v. 24, n. 3, p. 374-394, 2016.
LAVER, M.; BENOIT, K.; GARRY, J. Extracting policy positions from political texts using words as data. American Political Science Review, Washington, DC, v. 97, n. 2, p. 311-331, 2003.
LAZER, D. et al. Life in the network: the coming age of computational social science. Science, New York, v. 323, n. 5915, p. 721, 2009.
LI, W.; LAROCHELLE, D.; LO, A. Estimating policy trajectories during the financial crisis. Working paper. 2014. Disponível em: <https://bit.ly/2MtZfjN>. Acesso em: 21 jul. 2018.
LIU, B. Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, London, v. 5, n. 1, p. 1-167, 2012.
LOWE, W. Understanding wordscores. Political Analysis, Cambridge, UK, v. 16, n. 4, p. 356-371, 2008.
______. Austin: do things with words. Conjugateprior, Princeton, 2015. Disponível em: <https://bit.ly/2BCFGAY>. Acesso em: 21 jul. 2018.
LUCAS, C. et al. Computer-assisted text analysis for comparative politics. Political Analysis, Cambridge, UK, v. 23, n. 2, p. 254-277, 2015.
MACQUEEN, J. Some methods for classification and analysis of multivariate observations. In: LE CAM, L. M.; NEYMAN, J. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press, 1967. (Volume 1: Statistics). p. 281-297.
MAGALHÃES, R. et al. Perspectives on political methodology: interview with Simon Jackman. Leviathan, São Paulo, n. 7, p.158-175, 2013.
MANNING, C.; RAGHAVAN, P.; SCHÜTZE, H. Introduction to information retrieval. Cambridge: Cambridge University Press, 2008.
MARON, M.; KUHNS, J. On relevance, probabilistic indexing and information retrieval. Journal of the ACM (JACM), New York, v. 7, n. 3, p. 216-244, 1960.
MCCARTY, N. POOLE, K. ROSENTHAL, H. Polarized America: the dance of ideology and unequal riches, Cambridge: MIT University Press, 2006.
MONROE, B.; SCHRODT, P. Introduction to the special issue: the statistical analysis of political text. Political Analysis, Cambridge, UK, v. 16, n. 4, p. 351-355, 2008.
MONROE, B. et al. No! Formal theory, causal inference, and big data are not contradictory trends in political science. PS: Political Science & Politics, Cambridge, UK, v. 48, n. 1, p. 71-74, 2015.
MOREIRA, D. Com a palavra os nobres deputados: frequência e ênfase temática dos discursos dos parlamentares brasileiros. 2016. 204 f. Tese (Doutorado em Ciência Política) – Universidade de São Paulo, SP, 2016.
NEUENDORF, K. The content analysis guidebook. Thousand Oaks: Sage, 2002.
OOMS, J. Tesseract: Open Source OCR Engine. R package, [s.l.], 2018. Disponível em: <https://bit.ly/2whiySw>. Acesso em 21 jul. 2018.
PANG, B.; LEE, L. Opinion mining and sentiment analysis. Foundations and Trends® in Information Retrieval, Hanover, v. 2, n. 1-2, p. 1-135, 2008.
PATTY, J.; PENN, E. Analyzing big data: social choice and measurement. PS: Political Science and Politics, Cambridge, UK, v. 48, n. 1, p. 95-101, 2015.
PEARL, J. Causality. Cambridge, UK: Cambridge University Press, 2009.
PENNEBAKER, J. W.; MEHL, M. R.; NIEDERHOFFER, K. G. Psychological aspects of natural language use: our words, ourselves. Annual Review of Psychology, Palo Alto, v. 54, n. 1, p. 547-577, 2003.
POOLE, K.; ROSENTHAL, H. Ideology and congress. New Brunswick: Transaction Publishers, 2007.
PORTER, M. F. An algorithm for suffix stripping. Program: Electronic Library and Information Systems, Belfast, v. 14, n. 3, p. 130-137, 1980.
POWER, T.; ZUCCO, C. Estimating ideology of Brazilian legislative parties, 1990-2005: a research communication. Latin American Research Review, Pittsburgh, v. 44, n. 1, p. 218-246, 2009.
QUINN, K. et al. How to analyze political attention with minimal assumptions and costs. American Journal of Political Science, Washington, DC, v. 54, n. 1, p. 209-228, 2010.
ROBERTS, M. E. Introduction to the Virtual Issue: recent innovations in text analysis for social science. Political Analysis, Cambridge, UK, v. 24, n. 10, p. 1-5, 2016.
ROBERTS, M.; STEWART, B.; TINGLEY, D. stm: R Package for Structural Topic Models. R package, [s.l.], 2018. Disponível em: <https://bit.ly/2wc0rOT>. Acesso em: 3 jul. 2018.
ROBERTS, M. E. et al. The structural topic model and applied social science. Advances in neural information processing systems workshop on topic models: computation, application, and evaluation. Cambridge, MA: Harvard University, 2013.
______. Topic models for open-ended survey responses with applications to experiments. American Journal of Political Science, Washington, DC, v. 58, n. 4, p. 1064-1082, 2014.
SLAPIN, J.; PROKSCH, S.‐O. A scaling model for estimating time‐series party positions from texts. American Journal of Political Science, Washington, DC, v. 52, n. 3, p. 705-722, 2008.
SMITH, T.; WATERMAN, M. Identification of common molecular subsequences. Journal of Molecular Biology, Amsterdam, v. 147, n. 1. p. 195-197, 1981.
SOUZA, M.; VIEIRA, R. Sentiment analysis on Twitter data for Portuguese language. In: INTERNATIONAL CONFERENCE COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, 10., 2012, Coimbra. Proceedings… Coimbra: University of Coimbra, 2012. p. 241-247.
SOUZA, M. et al. Construction of a Portuguese opinion lexicon from multiple resources. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY, 8., 2011, Uberlândia. Proceedings… Uberlândia: Federal University of Uberlândia, 2011. pp. 59-66.
SPIRLING, A. Democratization and linguistic complexity: the effect of franchise extension on parliamentary discourse, 1832-1915. The Journal of Politics, Chicago, v. 78, n. 1, p. 120-136, 2015.
TABOADA, M. et al. Lexicon-based methods for sentiment analysis. Computational Linguistics, Cambridge, MA, v. 37, n. 2, p. 267-307, 2011.
VENABLES, W. N.; RIPLEY, B. D. Modern applied statistics with S. 4. ed. New York: Springer, 2002.
WALLACH, H. et al. An alternative prior for nonparametric Bayesian Clustering. In: International CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, 13., 2010, Sardinia. Proceedings… Sardinia: Chia Laguna Resort, 2010. p. 892-999, 2010.
WEBER, R. P. Basic content analysis. Newbury Park: Sage, 1990. (University Paper Series on Quantitative Applications in the Social Sciences).
WELBERS, K.; VAN ATTEVELDT, W.; BENOIT, K. Text analysis in R. Communication Methods and Measures, Abingdon, v. 11, n. 4, p. 245-265, 2017.
WICKHAM, H. httr: Tools for Working with URLs and HTTP. R package, [s.l.], 2016. Disponível em: <https://bit.ly/2PwgzT0>. Acesso em: 20 jul. 2018.
______. rvest: Easily Harvest (Scrape) Web Pages. R package, [s.l.], 2018. Disponível em: <https://bit.ly/2wee0fI>. Acesso em: 21 jul. 2018.
WICKHAM, H.; HESTER, J.; OOMS, J. xml2: Parse XML. R package, [s.l.], 2018. Disponível em: <https://bit.ly/2MrMzdi>. Acesso em: 20 jul. 2018.
WILKERSON, J.; CASAS, A. Large-scale computerized text analysis in political science: Opportunities and challenges. Annual Review of Political Science, Palo Alto, v. 20, p. 529-544, 2017.
WILKERSON, J.; SMITH, D.; STRAMP, N. Tracing the flow of policy ideas in legislatures: a text reuse approach. American Journal of Political Science, Washington, DC, v. 59, n. 4, p. 943-956, 2015.