Le texte en tant que donné
défis et opportunités pour les Sciences Sociales
Mots-clés :
Analyse de Contenu Automatisée, Similarité entre textes, Méthodes de Classification, Methodes de Planification, Big DataRésumé
La communication est un outil fundamental pour les relations humaines. C’est par la communication que des valeurs sont construites, des symboles sociaux sont établis, des traditions sont transmises, des débats sont réalisés, des politiques sont materialisées et des conflits politiques sont exprimés. Un accent dans les recherches des sociologues, l’analyse du contenu transmis dans la communication a toujours été limitée au besoin d’une grande quantité de fonds de recherche pour l’évaluation manuelle de grandes collections. En changeant cet scénario limité, les récents développements technologiques, informatiques e scientifiques ont permis aux sociologues d’analyser des plus grandes collections de documents à bas prix. Actuallement, grâce au développement de nouvelles methodes, il est désormais possible de identifier comportements qui étaient inobservables, de mesurer des quantités auparavant incommensurables et tester des hypothèses jusqu’alors impossibles. Dans ce sens, l’objectif de cet article est de mantenir les Sciences Sociales brésiliennes à la frontier de ce processus et de présenter au lecteur les méthodologies le plus récentes pour l’analyse de contenu automatisée. Sans épuiser ses nombreuses possibilités, cet article est un guide sur le domaine innovant de la recherche des textes en tant que donnés.
Téléchargements
Références
ALDRICH, J.; MCKELVEY, R. A method of scaling with applications to the 1968 and 1972 presidential elections. The American Political Science Review, Washington, DC, v. 71, n. 1, p. 11-130, 1973.
BARBERÁ, P. Birds of the same feather tweet together: Bayesian ideal point estimation using twitter data. Political Analysis, Cambridge, UK, v. 23, n. 1, p. 76-91, 2015.
BARRON, A. et al. Individuals, institutions, and innovation in the debates of the French Revolution. Proceedings of the National Academy of Sciences, Washington, DC, v. 115, n. 18, p. 4607-4612, 2018.
BERINSKY, A.; HUBER, G.; LENZ, G. Evaluating online labor markets for experimental research: Amazon. com’s Mechanical Turk. Political Analysis, Cambridge, UK, v. 20, n. 3, p. 351-368, 2012.
BISHOP, C. Neural networks for pattern recognition. Gloucestershire: Clarendon Press, 1995.
BLEI, D. M.; LAFFERTY, J. D. Dynamic topic models. In: INTERNATIONAL CONFERENCE ON MACHINE LEARNING, 23., 2006, New York. Proceedings… New York: ACM, 2006. pp. 113-120.
BLEI, D.; NG, A.; JORDAN, M. Latent dirichlet allocation. Journal of Machine Learning Research, Cambridge, MA, v. 3, n. 1, p. 993-1022, 2003.
BRADY, H. The perils of survey research: inter-personally incomparable responses. Political Methodology, Oxford, UK, v. 11, n. 3-4, p. 269-291, 1985.
BREIMAN, L. Random forests. Journal of Machine Learning Research, Cambridge, MA, v. 45, n. 1, p. 5-32, 2001.
BUDGE, I. et al. Mapping policy preferences: estimates for parties, electors, and governments, 1945-1998. Oxford, UK: Oxford University Press, 2001.
CAMPBELL, S. PENNEBAKER, J. The secret life of pronouns flexibility in writing style and physical health. Psychological Science, Washington, DC, v. 14, n. 1, p. 600-65, 2003.
CAMPOS, L. A., FERES JR., J.; GUARNIERI, F. 50 Anos da Revista DADOS: uma análise bibliométrica do seu perfil disciplinar e temático. Dados, Rio de Janeiro, v. 60, n. 3, p. 623-661, 2017.
CARRUBBA, C. et al. Off the record: unrecorded legislative votes, selection bias and roll-call vote analysis. British Journal of Political Science, Cambridge, UK, v. 36, n. 4, p. 691-704, 2006.
CHANG, J. et al. Reading tea leaves: how humans interpret topic models. In: BENGIO, Y. et al. Advances in neural information processing systems. Cambridge, MA: MIT Press, 2009. p. 288-296.
CLINTON, J.; JACKMAN, S.; RIVERS, D. The statistical analysis of roll call data. American Political Science Review, Washington, DC, v. 98, n. 2, p. 355-370, 2004.
EFRON, B.; GONG, G. A leisurely look at the bootstrap, the jacknife, and cross-validation. American Statistician, Abingdon, v. 37, n. 1, p. 36-48, 1983.
EGAMI, N. et al. How to make causal inferences with text. Working paper. 2018. Disponível em: <https://bit.ly/2MtXMdq>. Acesso em: 21 jul. 2018.
FEINERER, I. HORNIK, K. tm: Text Mining Package. R package, [s.l.], 2018. Disponível em: <https://bit.ly/2KcAx2w>. Acesso em: 21 jul. 2018.
FLESCH, R. A new readability yardstick. Journal of Applied Psychology, Washington, DC, v. 32, n. 3, p. 221-233, 1948.
FOKKENS, A. et al. Offspring from reproduction problems: what replication failure teaches us. In: ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 51., 2013, Sofia. Proceedings… Sofia: Association for Computational Linguistics, 2013. (Volume 1: Long Papers). p. 1691-1701.
FONG, C.; GRIMMER, J. Discovery of treatments from text corpora. In: ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 54., 2016, Berlin. Proceedings… Berlin: Association for Computational Linguistics, 2016. p. 1-10. FREY, B.; DUECK, D. Clustering by passing messages between data points. Science, Washington, DC, v. 315, n. 5814, p. 972-976, 2007.
GARRETT, K.; JANSA, J. Interest group influence in policy diffusion networks. State Politics & Policy Quarterly, Thousand Oaks, v. 15, n. 3, p. 387-417, 2015.
GRIMMER, J. A Bayesian hierarchical topic model for political texts: measuring expressed agendas in Senate press releases. Political Analysis, Cambridge, UK, v. 18, n. 1, p. 1-35, 2010.
______. We are all social scientists now: how big data, machine learning, and causal inference work together. PS: Political Science & Politics, Cambridge, UK, v. 48, n. 1, p. 80-83, 2015.
GRIMMER, J.; KING, G. General purpose computer-assisted clustering and conceptualization. Proceedings of the National Academy of Sciences, Washington, CD, v. 108, n. 7, p. 2643-2650, 2011.
GRIMMER, J.; STEWART, B. Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, v. 21, n. 3, p. 267-297, 2013.
GRÜN, B.; HORNIK, K. Topicmodels: AN R Package for fitting topic models. Journal of Statistical Software, Innsbruck, v. 40, n. 13, p. 1-30, 2011.
HAND, D. Classifier technology and the illusion of progress. Statistical Science, Bethesda, v. 21, n. 1, p. 1-14, 2006.
HASTIE, T.; TIBSHIRANI, R.; FRIEDMAN, J. The elements of statistical learning. New York: Springer, 2001.
HERNAN, M.; ROBINS, J. Causal inference. Boca Raton: CRC Press, 2018.
HOPKINS, D. KING, G. A method of automated nonparametric content analysis for social science. American Journal of Political Science, Washington, DC, v. 54, n. 1, p. 229-247, 2010.
HOPKINS, D. et al. ReadMe: software for automated content analysis. Gari King, Cambridge, MA, 2017. Disponível em: <https://bit.ly/2Mq7HRl>. Acesso em> 21 jul. 2018.
IMBENS, G.; RUBIN, D. Causal inference in statistics, social, and biomedical sciences. Cambridge, UK: Cambridge University Press, 2015.
ITTI, L.; BALDI, P. Bayesian surprise attracts human attention. In: JORDAN, M. I.; LECUN, Y.; SOLLA, S. A. (Eds.). Advances in neural information processing systems: proceedings of the first 12 conferences. Cambridge, MA: The MIT Press, 2006.
IZUMI, M. Velhas questões, novos métodos: posições, agenda, ideologia e dinheiro na política brasileira. 2017. 113 f. Tese (Doutorado em Ciência Política) – Universidade de São Paulo, São Paulo, 2017.
JURAFSKY, D.; MARTIN, J. Speech and natural language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River: Prentice Hall, 2009.
KING, G. et al. Enhancing the validity and cross-cultural comparability of measurement in survey research. American Political Science Review, Cambridge, UK, v. 98, n. 1, p. 191-207, 2004.
KRIPPENDORFF, K. Content analysis: an introduction to its methodology. New York: Sage, 2004.
KROEGER, M. Plagiarizing policy: model legislation in state legislatures. Working paper. 2015. Disponível em: <https://bit.ly/2o0lpf5>. Acesso em: 21 jul. 2018.
LAUDERDALE, B. HERZOG, A. Measuring political positions from legislative speech. Political Analysis, Cambridge, UK, v. 24, n. 3, p. 374-394, 2016.
LAVER, M.; BENOIT, K.; GARRY, J. Extracting policy positions from political texts using words as data. American Political Science Review, Washington, DC, v. 97, n. 2, p. 311-331, 2003.
LAZER, D. et al. Life in the network: the coming age of computational social science. Science, New York, v. 323, n. 5915, p. 721, 2009.
LI, W.; LAROCHELLE, D.; LO, A. Estimating policy trajectories during the financial crisis. Working paper. 2014. Disponível em: <https://bit.ly/2MtZfjN>. Acesso em: 21 jul. 2018.
LIU, B. Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, London, v. 5, n. 1, p. 1-167, 2012.
LOWE, W. Understanding wordscores. Political Analysis, Cambridge, UK, v. 16, n. 4, p. 356-371, 2008.
______. Austin: do things with words. Conjugateprior, Princeton, 2015. Disponível em: <https://bit.ly/2BCFGAY>. Acesso em: 21 jul. 2018.
LUCAS, C. et al. Computer-assisted text analysis for comparative politics. Political Analysis, Cambridge, UK, v. 23, n. 2, p. 254-277, 2015.
MACQUEEN, J. Some methods for classification and analysis of multivariate observations. In: LE CAM, L. M.; NEYMAN, J. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press, 1967. (Volume 1: Statistics). p. 281-297.
MAGALHÃES, R. et al. Perspectives on political methodology: interview with Simon Jackman. Leviathan, São Paulo, n. 7, p.158-175, 2013.
MANNING, C.; RAGHAVAN, P.; SCHÜTZE, H. Introduction to information retrieval. Cambridge: Cambridge University Press, 2008.
MARON, M.; KUHNS, J. On relevance, probabilistic indexing and information retrieval. Journal of the ACM (JACM), New York, v. 7, n. 3, p. 216-244, 1960.
MCCARTY, N. POOLE, K. ROSENTHAL, H. Polarized America: the dance of ideology and unequal riches, Cambridge: MIT University Press, 2006.
MONROE, B.; SCHRODT, P. Introduction to the special issue: the statistical analysis of political text. Political Analysis, Cambridge, UK, v. 16, n. 4, p. 351-355, 2008.
MONROE, B. et al. No! Formal theory, causal inference, and big data are not contradictory trends in political science. PS: Political Science & Politics, Cambridge, UK, v. 48, n. 1, p. 71-74, 2015.
MOREIRA, D. Com a palavra os nobres deputados: frequência e ênfase temática dos discursos dos parlamentares brasileiros. 2016. 204 f. Tese (Doutorado em Ciência Política) – Universidade de São Paulo, SP, 2016.
NEUENDORF, K. The content analysis guidebook. Thousand Oaks: Sage, 2002.
OOMS, J. Tesseract: Open Source OCR Engine. R package, [s.l.], 2018. Disponível em: <https://bit.ly/2whiySw>. Acesso em 21 jul. 2018.
PANG, B.; LEE, L. Opinion mining and sentiment analysis. Foundations and Trends® in Information Retrieval, Hanover, v. 2, n. 1-2, p. 1-135, 2008.
PATTY, J.; PENN, E. Analyzing big data: social choice and measurement. PS: Political Science and Politics, Cambridge, UK, v. 48, n. 1, p. 95-101, 2015.
PEARL, J. Causality. Cambridge, UK: Cambridge University Press, 2009.
PENNEBAKER, J. W.; MEHL, M. R.; NIEDERHOFFER, K. G. Psychological aspects of natural language use: our words, ourselves. Annual Review of Psychology, Palo Alto, v. 54, n. 1, p. 547-577, 2003.
POOLE, K.; ROSENTHAL, H. Ideology and congress. New Brunswick: Transaction Publishers, 2007.
PORTER, M. F. An algorithm for suffix stripping. Program: Electronic Library and Information Systems, Belfast, v. 14, n. 3, p. 130-137, 1980.
POWER, T.; ZUCCO, C. Estimating ideology of Brazilian legislative parties, 1990-2005: a research communication. Latin American Research Review, Pittsburgh, v. 44, n. 1, p. 218-246, 2009.
QUINN, K. et al. How to analyze political attention with minimal assumptions and costs. American Journal of Political Science, Washington, DC, v. 54, n. 1, p. 209-228, 2010.
ROBERTS, M. E. Introduction to the Virtual Issue: recent innovations in text analysis for social science. Political Analysis, Cambridge, UK, v. 24, n. 10, p. 1-5, 2016.
ROBERTS, M.; STEWART, B.; TINGLEY, D. stm: R Package for Structural Topic Models. R package, [s.l.], 2018. Disponível em: <https://bit.ly/2wc0rOT>. Acesso em: 3 jul. 2018.
ROBERTS, M. E. et al. The structural topic model and applied social science. Advances in neural information processing systems workshop on topic models: computation, application, and evaluation. Cambridge, MA: Harvard University, 2013.
______. Topic models for open-ended survey responses with applications to experiments. American Journal of Political Science, Washington, DC, v. 58, n. 4, p. 1064-1082, 2014.
SLAPIN, J.; PROKSCH, S.‐O. A scaling model for estimating time‐series party positions from texts. American Journal of Political Science, Washington, DC, v. 52, n. 3, p. 705-722, 2008.
SMITH, T.; WATERMAN, M. Identification of common molecular subsequences. Journal of Molecular Biology, Amsterdam, v. 147, n. 1. p. 195-197, 1981.
SOUZA, M.; VIEIRA, R. Sentiment analysis on Twitter data for Portuguese language. In: INTERNATIONAL CONFERENCE COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, 10., 2012, Coimbra. Proceedings… Coimbra: University of Coimbra, 2012. p. 241-247.
SOUZA, M. et al. Construction of a Portuguese opinion lexicon from multiple resources. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY, 8., 2011, Uberlândia. Proceedings… Uberlândia: Federal University of Uberlândia, 2011. pp. 59-66.
SPIRLING, A. Democratization and linguistic complexity: the effect of franchise extension on parliamentary discourse, 1832-1915. The Journal of Politics, Chicago, v. 78, n. 1, p. 120-136, 2015.
TABOADA, M. et al. Lexicon-based methods for sentiment analysis. Computational Linguistics, Cambridge, MA, v. 37, n. 2, p. 267-307, 2011.
VENABLES, W. N.; RIPLEY, B. D. Modern applied statistics with S. 4. ed. New York: Springer, 2002.
WALLACH, H. et al. An alternative prior for nonparametric Bayesian Clustering. In: International CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, 13., 2010, Sardinia. Proceedings… Sardinia: Chia Laguna Resort, 2010. p. 892-999, 2010.
WEBER, R. P. Basic content analysis. Newbury Park: Sage, 1990. (University Paper Series on Quantitative Applications in the Social Sciences).
WELBERS, K.; VAN ATTEVELDT, W.; BENOIT, K. Text analysis in R. Communication Methods and Measures, Abingdon, v. 11, n. 4, p. 245-265, 2017.
WICKHAM, H. httr: Tools for Working with URLs and HTTP. R package, [s.l.], 2016. Disponível em: <https://bit.ly/2PwgzT0>. Acesso em: 20 jul. 2018.
______. rvest: Easily Harvest (Scrape) Web Pages. R package, [s.l.], 2018. Disponível em: <https://bit.ly/2wee0fI>. Acesso em: 21 jul. 2018.
WICKHAM, H.; HESTER, J.; OOMS, J. xml2: Parse XML. R package, [s.l.], 2018. Disponível em: <https://bit.ly/2MrMzdi>. Acesso em: 20 jul. 2018.
WILKERSON, J.; CASAS, A. Large-scale computerized text analysis in political science: Opportunities and challenges. Annual Review of Political Science, Palo Alto, v. 20, p. 529-544, 2017.
WILKERSON, J.; SMITH, D.; STRAMP, N. Tracing the flow of policy ideas in legislatures: a text reuse approach. American Journal of Political Science, Washington, DC, v. 59, n. 4, p. 943-956, 2015.