Behavioral Assessments Based on Automated Text Analyses

Investigators:

Project Details

Abstract:

This project sought to expand and enhance current computerized text analytic tools in English to be used with languages of relevance to the understanding of global terrorist threats. Working with native language and translation experts, the research team (i) developed versions of its Linguistic Inquiry and Word Count software (LIWC; Pennebaker, Booth, & Francis, 2007) for Farsi, Korean, Mandarin, and Russian, and (ii) explored the feasibility of adding a suite of other text analysis software modules in computational linguistics that automatically process content words, syntax, referential cohesion, coherence of mental models, discourse structure, and world knowledge (described below). These computational linguistics modules have already been developed in English through a tool called Coh-Metrix (Graesser, McNamara, Louwerse, & Cai, 2004). The next step was to explore extensions to the other languages.

These tools were then used to analyze texts produced by approximately a dozen major contemporary and historical political leaders (e.g., Mahmoud Ahmadinejad, Osama bin Ladin, Fidel Castro, Winston Churchill, Saddam Hussein, Adolph Hitler, Mao Zedong, Kim Jong Il, Vladimir Putin, one or more American political leaders, etc.). Using data mining techniques, the research team concurrently built a corpus of original language texts for each target leader. By tracking natural language over time, the team assessed personality, changes in intent and deception, and other psychological responses to major events over time across formal and informal situations. The results of these analyses can inform understanding about how leaders are impacted by catastrophic events affecting their citizenry, as well as improving understanding of the linguistic indicators evident among leaders that employ violence against their citizens as a governance tool.

The project also involved parallel laboratory experiments to determine the degree to which the language patterns we see in "real world" texts can be recreated in the laboratory. The laboratory component was critical in determining which findings are reliable and linked to our hypothesized psychological causes.

Deliverables for this project included:
1) Copies of the LIWC software and translation dictionaries;
2) A technical report of the procedures and findings;
3) Copies of reports and papers on the analyses of the texts of political leaders.

Primary Findings:

Computerized text analysis approaches were used to assess status, deception, and intent in the speeches of political and authoritarian leaders. The research team found reliable language markers of status, particularly pronouns, that accurately indicate the lower versus higher status interactant across several laboratory studies, emails between members of a university and of a real world company, and in the memoranda of Saddam Hussein's military and administration. These findings can be used to quickly and remotely detect the status of group members when self-reports are inaccessible. Another promising approach for the remote detection of leadership and group dynamics was to assess cohesion. Cohesion decreased in speeches by Mao Zedong and Hosni Mubarak over time in their original languages (Chinese and Arabic) and English translations, possibly indicating increased common ground about topics among citizens of a culture. Higher cohesion in messages, then, may signal that new information is being shared by a leader to their group members. The team analyzed the speeches of various leaders, including President George W. Bush, Mao Zedong, Che Guevara, and Adolf Hitler to assess language markers preceding war. It found that first person singular use reliably and very strongly drops before going to war, suggesting that the analysis of function words can indicate intent, deception, and potential for violence. In the analysis of American Presidential Administrations, researchers found that language markers reliably varied according to whether a lie was associated with war, personal matters, or matters of the state. These findings tell us that identifying the kind of lie can help determine the kinds of language patterns to look for in detecting deception. Note that a major component of the project was to develop computerized text analysis approaches, especially for the study of languages other than English. They are described in the methodology section below.

Methodology:

Although a variety of text analysis tools, algorithms, and approaches were used, the two primary tools used throughout the project were Linguistic Inquiry and Word Count (LIWC) and Coh-Metrix. LIWC is a word counting tool that efficiently counts and categorizes words into grammatical, psychological and content categories. In this project, LIWC was used to assess leaders' language for markers of status, deception, and language style matching (to assess synchrony and engagement). Coh-Metrix is a powerful web application that computes the narrativity, cohesion, complexity, syntactic simplicity, and concreteness of any given text. In this project, cohesion metrics in leaders' speeches were found to be associated with major economical, historical, and social events such as growth rates of GDP per capita, social unrest, status, and duration of leadership. A major component of the project was focused on the development of computerized text analysis approaches. During the funding period, LIWC was translated and validated in Arabic, Chinese, French, Russian, and Turkish. In addition, an automated Speech Act Classifier that assesses text for the types of statements uttered (e.g., declaration, question, command, etc.) was developed using parts of speech taggers, lexicons, and extensive manual classification of corpora in English and in Arabic.