O texto como dado: desafios e oportunidades para as ciências sociais

Maurício Izumi; Davi Moreira

Autores

Maurício Izumi Fundação Getúlio Vargas
Davi Moreira UFPE

Palavras-chave:

Análise Automatizada de Conteúdo, Semelhança entre Textos, Métodos de Classificação, Métodos de Escalonamento, Big Data

Resumo

A comunicação é instrumento fundamental para as relações humanas. É por meio dela, por exemplo, que valores são construídos, símbolos sociais são estabelecidos, tradições são repassadas, debates são concretizados, a política se materializa e o conflito político se expressa. Foco de análises dos cientistas sociais há séculos, a análise do conteúdo transmitido na comunicação sempre esteve restrita à necessidade de volumes relevantes de recursos para a avaliação manual de grandes acervos. Revertendo esse quadro limitado, recentes desenvolvimentos tecnológico, computacional e científico permitem que as ciências sociais potencializem sua investigação reduzindo drasticamente os custos envolvidos na análise de grandes acervos. Por intermédio de novos métodos desenvolvidos, atualmente, é possível verificar comportamentos que antes não eram observáveis, medir quantidades anteriormente imensuráveis e testar hipóteses até então impossíveis de serem testadas. Nesse escopo, o principal objetivo deste artigo é manter as ciências sociais brasileiras na fronteira desse processo e apresentar ao leitor um leque atualizado das principais metodologias de análise automatizada de conteúdo. Sem esgotar suas inúmeras possibilidades, este artigo é um guia para a inovadora e instigante área de pesquisa do texto como dado.

Downloads

Não há dados estatísticos.

Biografia do Autor

Maurício Izumi, Fundação Getúlio Vargas

Doutor em Ciência Política pela Universidade de São Paulo (DCP/USP) e pesquisador do Centro de Política e Economia do Setor Público da Fundação Getúlio Vargas (Cepesp/FGV). O autor contou com apoio da FAPESP, processo número 2018/08118-4.

Davi Moreira, UFPE

Doutor em Ciência Política pela USP e pós-doutorando pela UFPE. Vencedor do Prêmio Capes de Tese 2017 na área de Ciência Política e Relações Internacionais. Especialista em análise automatizada de conteúdo, discursos políticos e métodos quantitativos para ciências sociais. Idealizador do projeto Retórica Parlamentar, implementado pelo Laboratório Hacker da Câmara dos Deputados do Brasil.

Referências

ALDRICH, J.; MCKELVEY, R. A method of scaling with applications to the 1968 and 1972 presidential elections. The American Political Science Review, Washington, DC, v. 71, n. 1, p. 11-130, 1973.

BARBERÁ, P. Birds of the same feather tweet together: Bayesian ideal point estimation using twitter data. Political Analysis, Cambridge, UK, v. 23, n. 1, p. 76-91, 2015.

BARRON, A. et al. Individuals, institutions, and innovation in the debates of the French Revolution. Proceedings of the National Academy of Sciences, Washington, DC, v. 115, n. 18, p. 4607-4612, 2018.

BERINSKY, A.; HUBER, G.; LENZ, G. Evaluating online labor markets for experimental research: Amazon. com’s Mechanical Turk. Political Analysis, Cambridge, UK, v. 20, n. 3, p. 351-368, 2012.

BISHOP, C. Neural networks for pattern recognition. Gloucestershire: Clarendon Press, 1995.

BLEI, D. M.; LAFFERTY, J. D. Dynamic topic models. In: INTERNATIONAL CONFERENCE ON MACHINE LEARNING, 23., 2006, New York. Proceedings… New York: ACM, 2006. pp. 113-120.

BLEI, D.; NG, A.; JORDAN, M. Latent dirichlet allocation. Journal of Machine Learning Research, Cambridge, MA, v. 3, n. 1, p. 993-1022, 2003.

BRADY, H. The perils of survey research: inter-personally incomparable responses. Political Methodology, Oxford, UK, v. 11, n. 3-4, p. 269-291, 1985.

BREIMAN, L. Random forests. Journal of Machine Learning Research, Cambridge, MA, v. 45, n. 1, p. 5-32, 2001.

BUDGE, I. et al. Mapping policy preferences: estimates for parties, electors, and governments, 1945-1998. Oxford, UK: Oxford University Press, 2001.

CAMPBELL, S. PENNEBAKER, J. The secret life of pronouns flexibility in writing style and physical health. Psychological Science, Washington, DC, v. 14, n. 1, p. 600-65, 2003.

CAMPOS, L. A., FERES JR., J.; GUARNIERI, F. 50 Anos da Revista DADOS: uma análise bibliométrica do seu perfil disciplinar e temático. Dados, Rio de Janeiro, v. 60, n. 3, p. 623-661, 2017.

CARRUBBA, C. et al. Off the record: unrecorded legislative votes, selection bias and roll-call vote analysis. British Journal of Political Science, Cambridge, UK, v. 36, n. 4, p. 691-704, 2006.

CHANG, J. et al. Reading tea leaves: how humans interpret topic models. In: BENGIO, Y. et al. Advances in neural information processing systems. Cambridge, MA: MIT Press, 2009. p. 288-296.

CLINTON, J.; JACKMAN, S.; RIVERS, D. The statistical analysis of roll call data. American Political Science Review, Washington, DC, v. 98, n. 2, p. 355-370, 2004.

EFRON, B.; GONG, G. A leisurely look at the bootstrap, the jacknife, and cross-validation. American Statistician, Abingdon, v. 37, n. 1, p. 36-48, 1983.

EGAMI, N. et al. How to make causal inferences with text. Working paper. 2018. Disponível em: <https://bit.ly/2MtXMdq>. Acesso em: 21 jul. 2018.

FEINERER, I. HORNIK, K. tm: Text Mining Package. R package, [s.l.], 2018. Disponível em: <https://bit.ly/2KcAx2w>. Acesso em: 21 jul. 2018.

FLESCH, R. A new readability yardstick. Journal of Applied Psychology, Washington, DC, v. 32, n. 3, p. 221-233, 1948.

FOKKENS, A. et al. Offspring from reproduction problems: what replication failure teaches us. In: ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 51., 2013, Sofia. Proceedings… Sofia: Association for Computational Linguistics, 2013. (Volume 1: Long Papers). p. 1691-1701.

FONG, C.; GRIMMER, J. Discovery of treatments from text corpora. In: ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 54., 2016, Berlin. Proceedings… Berlin: Association for Computational Linguistics, 2016. p. 1-10. FREY, B.; DUECK, D. Clustering by passing messages between data points. Science, Washington, DC, v. 315, n. 5814, p. 972-976, 2007.

GARRETT, K.; JANSA, J. Interest group influence in policy diffusion networks. State Politics & Policy Quarterly, Thousand Oaks, v. 15, n. 3, p. 387-417, 2015.

GRIMMER, J. A Bayesian hierarchical topic model for political texts: measuring expressed agendas in Senate press releases. Political Analysis, Cambridge, UK, v. 18, n. 1, p. 1-35, 2010.

______. We are all social scientists now: how big data, machine learning, and causal inference work together. PS: Political Science & Politics, Cambridge, UK, v. 48, n. 1, p. 80-83, 2015.

GRIMMER, J.; KING, G. General purpose computer-assisted clustering and conceptualization. Proceedings of the National Academy of Sciences, Washington, CD, v. 108, n. 7, p. 2643-2650, 2011.

GRIMMER, J.; STEWART, B. Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, v. 21, n. 3, p. 267-297, 2013.

GRÜN, B.; HORNIK, K. Topicmodels: AN R Package for fitting topic models. Journal of Statistical Software, Innsbruck, v. 40, n. 13, p. 1-30, 2011.

HAND, D. Classifier technology and the illusion of progress. Statistical Science, Bethesda, v. 21, n. 1, p. 1-14, 2006.

HASTIE, T.; TIBSHIRANI, R.; FRIEDMAN, J. The elements of statistical learning. New York: Springer, 2001.

HERNAN, M.; ROBINS, J. Causal inference. Boca Raton: CRC Press, 2018.

HOPKINS, D. KING, G. A method of automated nonparametric content analysis for social science. American Journal of Political Science, Washington, DC, v. 54, n. 1, p. 229-247, 2010.

HOPKINS, D. et al. ReadMe: software for automated content analysis. Gari King, Cambridge, MA, 2017. Disponível em: <https://bit.ly/2Mq7HRl>. Acesso em> 21 jul. 2018.

IMBENS, G.; RUBIN, D. Causal inference in statistics, social, and biomedical sciences. Cambridge, UK: Cambridge University Press, 2015.

ITTI, L.; BALDI, P. Bayesian surprise attracts human attention. In: JORDAN, M. I.; LECUN, Y.; SOLLA, S. A. (Eds.). Advances in neural information processing systems: proceedings of the first 12 conferences. Cambridge, MA: The MIT Press, 2006.

IZUMI, M. Velhas questões, novos métodos: posições, agenda, ideologia e dinheiro na política brasileira. 2017. 113 f. Tese (Doutorado em Ciência Política) – Universidade de São Paulo, São Paulo, 2017.

JURAFSKY, D.; MARTIN, J. Speech and natural language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River: Prentice Hall, 2009.

KING, G. et al. Enhancing the validity and cross-cultural comparability of measurement in survey research. American Political Science Review, Cambridge, UK, v. 98, n. 1, p. 191-207, 2004.

KRIPPENDORFF, K. Content analysis: an introduction to its methodology. New York: Sage, 2004.

KROEGER, M. Plagiarizing policy: model legislation in state legislatures. Working paper. 2015. Disponível em: <https://bit.ly/2o0lpf5>. Acesso em: 21 jul. 2018.

LAUDERDALE, B. HERZOG, A. Measuring political positions from legislative speech. Political Analysis, Cambridge, UK, v. 24, n. 3, p. 374-394, 2016.

LAVER, M.; BENOIT, K.; GARRY, J. Extracting policy positions from political texts using words as data. American Political Science Review, Washington, DC, v. 97, n. 2, p. 311-331, 2003.

LAZER, D. et al. Life in the network: the coming age of computational social science. Science, New York, v. 323, n. 5915, p. 721, 2009.

LI, W.; LAROCHELLE, D.; LO, A. Estimating policy trajectories during the financial crisis. Working paper. 2014. Disponível em: <https://bit.ly/2MtZfjN>. Acesso em: 21 jul. 2018.

LIU, B. Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, London, v. 5, n. 1, p. 1-167, 2012.

LOWE, W. Understanding wordscores. Political Analysis, Cambridge, UK, v. 16, n. 4, p. 356-371, 2008.

______. Austin: do things with words. Conjugateprior, Princeton, 2015. Disponível em: <https://bit.ly/2BCFGAY>. Acesso em: 21 jul. 2018.

LUCAS, C. et al. Computer-assisted text analysis for comparative politics. Political Analysis, Cambridge, UK, v. 23, n. 2, p. 254-277, 2015.

MACQUEEN, J. Some methods for classification and analysis of multivariate observations. In: LE CAM, L. M.; NEYMAN, J. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press, 1967. (Volume 1: Statistics). p. 281-297.

MAGALHÃES, R. et al. Perspectives on political methodology: interview with Simon Jackman. Leviathan, São Paulo, n. 7, p.158-175, 2013.

MANNING, C.; RAGHAVAN, P.; SCHÜTZE, H. Introduction to information retrieval. Cambridge: Cambridge University Press, 2008.

MARON, M.; KUHNS, J. On relevance, probabilistic indexing and information retrieval. Journal of the ACM (JACM), New York, v. 7, n. 3, p. 216-244, 1960.

MCCARTY, N. POOLE, K. ROSENTHAL, H. Polarized America: the dance of ideology and unequal riches, Cambridge: MIT University Press, 2006.

MONROE, B.; SCHRODT, P. Introduction to the special issue: the statistical analysis of political text. Political Analysis, Cambridge, UK, v. 16, n. 4, p. 351-355, 2008.

MONROE, B. et al. No! Formal theory, causal inference, and big data are not contradictory trends in political science. PS: Political Science & Politics, Cambridge, UK, v. 48, n. 1, p. 71-74, 2015.

MOREIRA, D. Com a palavra os nobres deputados: frequência e ênfase temática dos discursos dos parlamentares brasileiros. 2016. 204 f. Tese (Doutorado em Ciência Política) – Universidade de São Paulo, SP, 2016.

NEUENDORF, K. The content analysis guidebook. Thousand Oaks: Sage, 2002.

OOMS, J. Tesseract: Open Source OCR Engine. R package, [s.l.], 2018. Disponível em: <https://bit.ly/2whiySw>. Acesso em 21 jul. 2018.

PANG, B.; LEE, L. Opinion mining and sentiment analysis. Foundations and Trends® in Information Retrieval, Hanover, v. 2, n. 1-2, p. 1-135, 2008.

PATTY, J.; PENN, E. Analyzing big data: social choice and measurement. PS: Political Science and Politics, Cambridge, UK, v. 48, n. 1, p. 95-101, 2015.

PEARL, J. Causality. Cambridge, UK: Cambridge University Press, 2009.

PENNEBAKER, J. W.; MEHL, M. R.; NIEDERHOFFER, K. G. Psychological aspects of natural language use: our words, ourselves. Annual Review of Psychology, Palo Alto, v. 54, n. 1, p. 547-577, 2003.

POOLE, K.; ROSENTHAL, H. Ideology and congress. New Brunswick: Transaction Publishers, 2007.

PORTER, M. F. An algorithm for suffix stripping. Program: Electronic Library and Information Systems, Belfast, v. 14, n. 3, p. 130-137, 1980.

POWER, T.; ZUCCO, C. Estimating ideology of Brazilian legislative parties, 1990-2005: a research communication. Latin American Research Review, Pittsburgh, v. 44, n. 1, p. 218-246, 2009.

QUINN, K. et al. How to analyze political attention with minimal assumptions and costs. American Journal of Political Science, Washington, DC, v. 54, n. 1, p. 209-228, 2010.

ROBERTS, M. E. Introduction to the Virtual Issue: recent innovations in text analysis for social science. Political Analysis, Cambridge, UK, v. 24, n. 10, p. 1-5, 2016.

ROBERTS, M.; STEWART, B.; TINGLEY, D. stm: R Package for Structural Topic Models. R package, [s.l.], 2018. Disponível em: <https://bit.ly/2wc0rOT>. Acesso em: 3 jul. 2018.

ROBERTS, M. E. et al. The structural topic model and applied social science. Advances in neural information processing systems workshop on topic models: computation, application, and evaluation. Cambridge, MA: Harvard University, 2013.

______. Topic models for open-ended survey responses with applications to experiments. American Journal of Political Science, Washington, DC, v. 58, n. 4, p. 1064-1082, 2014.

SLAPIN, J.; PROKSCH, S.‐O. A scaling model for estimating time‐series party positions from texts. American Journal of Political Science, Washington, DC, v. 52, n. 3, p. 705-722, 2008.

SMITH, T.; WATERMAN, M. Identification of common molecular subsequences. Journal of Molecular Biology, Amsterdam, v. 147, n. 1. p. 195-197, 1981.

SOUZA, M.; VIEIRA, R. Sentiment analysis on Twitter data for Portuguese language. In: INTERNATIONAL CONFERENCE COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, 10., 2012, Coimbra. Proceedings… Coimbra: University of Coimbra, 2012. p. 241-247.

SOUZA, M. et al. Construction of a Portuguese opinion lexicon from multiple resources. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY, 8., 2011, Uberlândia. Proceedings… Uberlândia: Federal University of Uberlândia, 2011. pp. 59-66.

SPIRLING, A. Democratization and linguistic complexity: the effect of franchise extension on parliamentary discourse, 1832-1915. The Journal of Politics, Chicago, v. 78, n. 1, p. 120-136, 2015.

TABOADA, M. et al. Lexicon-based methods for sentiment analysis. Computational Linguistics, Cambridge, MA, v. 37, n. 2, p. 267-307, 2011.

VENABLES, W. N.; RIPLEY, B. D. Modern applied statistics with S. 4. ed. New York: Springer, 2002.

WALLACH, H. et al. An alternative prior for nonparametric Bayesian Clustering. In: International CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, 13., 2010, Sardinia. Proceedings… Sardinia: Chia Laguna Resort, 2010. p. 892-999, 2010.

WEBER, R. P. Basic content analysis. Newbury Park: Sage, 1990. (University Paper Series on Quantitative Applications in the Social Sciences).

WELBERS, K.; VAN ATTEVELDT, W.; BENOIT, K. Text analysis in R. Communication Methods and Measures, Abingdon, v. 11, n. 4, p. 245-265, 2017.

WICKHAM, H. httr: Tools for Working with URLs and HTTP. R package, [s.l.], 2016. Disponível em: <https://bit.ly/2PwgzT0>. Acesso em: 20 jul. 2018.

______. rvest: Easily Harvest (Scrape) Web Pages. R package, [s.l.], 2018. Disponível em: <https://bit.ly/2wee0fI>. Acesso em: 21 jul. 2018.

WICKHAM, H.; HESTER, J.; OOMS, J. xml2: Parse XML. R package, [s.l.], 2018. Disponível em: <https://bit.ly/2MrMzdi>. Acesso em: 20 jul. 2018.

WILKERSON, J.; CASAS, A. Large-scale computerized text analysis in political science: Opportunities and challenges. Annual Review of Political Science, Palo Alto, v. 20, p. 529-544, 2017.

WILKERSON, J.; SMITH, D.; STRAMP, N. Tracing the flow of policy ideas in legislatures: a text reuse approach. American Journal of Political Science, Washington, DC, v. 59, n. 4, p. 943-956, 2015.

O texto como dado

desafios e oportunidades para as ciências sociais

Autores

Palavras-chave:

Resumo

Downloads

Biografia do Autor

Maurício Izumi, Fundação Getúlio Vargas

Davi Moreira, UFPE

Referências

Downloads

Publicado

Como Citar

Edição

Seção

Idioma

Informações

Palavras-chave

Edição Atual

Navegar

Enviar Submissão

Redes sociais