The Discourse Attributes Analysis Program (DAAP)


Bernard Maskit and Sean Murphy

The Discourse Attributes Analysis Program (DAAP) is a computer-based text analysis system designed by Bernard Maskit, whose features include the use of both weighted and unweighted dictionaries and whose output includes measures based on an exponential smoothing operator. The current version is DAAP10.1.4.py, which is a command line program written in python. The DAAP software was originally designed to provide computerized measurements related to Bucci's theory of the referential process; it continues to be used for this purpose and has also been applied in a wider range of  psychological and linguistic studies.

DAAP Features: The main features of DAAP10.1.4.py include the following. DAAP reads a text and compares each word with a set of dictionaries, some of which may be weighted, and then has two tracks of data analysis. The Word or RAW track is fairly standard for a quantitative text analysis system; it is based on proportions of dictionary matches. The Smooth track is based on the production of a function that tracks the local relative usage of words from each dictionary for each turn of speech; the graph of this function is visually smooth so that, for each dictionary, one can see how the psychological variable underlying the dictionary is changing. These smooth dictionary functions are then used to compute several special measures, such as the Mean High WRAD, and the covariations, which are measures of the extent to which two dictionary measures are simultaneously high and low.

Other DAAP10 features include (a) an automatic disambiguator (for the words 'kind', 'know', 'like', 'mean' and 'well', which are sometimes disfluencies and sometimes content words); (b) automatic treatment of one and two word turns of speech (separating substantive from non-substantive speech); (c) using special classification indicators to identify and aggregate related data, such as responses to a particular question in a set of interviews; (d) using classification indicators to define a new measure, such as inserting markers in the text for usage of past vs. present tense; DAAP10 then produces a smooth curve showing this variation.

DAAP Operating Instructions:  

The operating instructions for DAAP10.1.4.py, the word count based DAAP program currently in use,  along with transcribing and formatting instructions for transcribers, as well as a guide for making new dictionaries and an overview of the DAAP10 outputs can be found in:


A similar set of instructions for the previous version of DAAP (DAAP09.6) can be found it:

Maskit, B. (2014). The Discourse Attributes Analysis Program (DAAP) Operating Instructions. figshare. http://dx.doi.org/10.6084/m9.figshare.947740

DAAP Math: A mathematical description of the Discourse Attributes Analysis Program (DAAP); including DAAP's smoothing operator, Mean High WRAD (MHW), High WRAD Proportion (HWP) and covariations between measures is available in the document:

Maskit, B. (2014). DAAPMath. figshare. http://dx.doi.org/10.6084/m9.figshare.928469

DAAP using Time as an Independent Variable: Technical Aspects:  A mathematical description of the workings of the Variable Time DAAP program that is used to align text with vocal characteristics is available in the document:

Maskit, B.(2014). DAAP using Time as Independent Variable: Technical Aspects. figshare. http://dx.doi.org/10.6084/m9.figshare.947741

DAAP Time Marker Instructions:  Instructions for putting time markers into existing transcripts that have been transcribed for the Discourse Attributes Analysis Program (DAAP) for use with Variable Time DAAP is available in the document:

Maskit, B. & Jenness, K. (2014). DAAP Time Marker Instructions. figshare.
http://dx.doi.org/10.6084/m9.figshare.961799

History:

The first DAAP programs were produced in 2002; these were HDAAP, written in C++ by Bernard Maskit, and BADAAP, written in visual basic by Andres Roussos. The BADAAP continues to be used in Argentina with the Spanish version of WRAD, developed by Roussos. The HDAAP series gave way in 2005 to the DAAP06 - DAAP09 series, written in perl by Maskit. Starting in 2016, these are being replaced by the DAAP10 series, written in python.The versions of DAAP currently in use or under development are as follows.

DAAP10.1.4.py, the basic version of DAAP currently in use is based on word count.

The TDAAP series uses time rather than word count as the independent variable. This is accomplished by having time markers inserted in the transcript at fixed intervals, such as every 2 or 3 seconds. The current working version is TDAAP.03.1.pl; the python version, TDAAP10.1.py is under development.

The VTDAAP series uses variable time markers inserted in the transcript using the CLAN software. It also integrates vocal analysis data obtained from PRAAT with the linguistic DAAP output. The current working version is VTDAAP02.26.pl, the python version, VTDAAP10.1.py is under development.

The IDAAP series (current version: IDAAP03.2) is similar to DAAP03.4; it is designed for use with languages such as Italian, whose alphabets include accented letters. There are currently Italian versions of the WRAD, Reflection and Disfluency dictionaries; See  http://sites.google.com/sites/italiandaap for further details.

Future Developments: 

We are in the process of developing a system that uses Automatic Transcription. In broad outline, we expect to develop a recording system for therapy that is reasonably non-intrusive so that the therapist can, at the end of the session, push a button on the computer, and have the session automatically transcribed, and run through a DAAP program producing statistical language data for the session, as well as smooth curves showing the rise and fall of the dictionaries associated with the referential process.

DAAP Basic Functions:

In common with most text analysis systems, such as the The Linguistic Inquiry and Word Count Program (LIWC), The General Inquirer, PCAD, and others, DAAP starts by reading each word of a text and comparing it with one or more dictionaries, some of which might be weighted, and (for DAAP09.6 only) some of which might use word stems or lemmas. Beyond this, there are several features that are unique to DAAP;  these include: the construction of a smoothly varying function for each dictionary; this smooth function is used to create new measures that can be used to segment text into segments for which the function is relatively high or low; it can be used to calculate the covariation between pairs of functions, and also to measure the degree of variation within each function. The basic DAAP program constructs these functions using word count as the independent variable; there are also programs in the DAAP family that use time as the independent variable. Each of these unique features are explained more fully below. More detailed information concerning the dictionary measures can be found in the DAAP Dictionary Definitions.  

Weighted Dictionaries: The first difference between DAAP and other text analysis programs is its ability to deal with weighted dictionaries. Most text analysis programs match words in a sample to a dictionary, count the number of occurrences and divide by the total word count to calculate a proportion. While DAAP does perform these calculations for unweighted dictionaries, it also has the capacity to calculate a score based on a weight assigned to each word in a dictionary. In many circumstances, the use of a weighted dictionary can greatly increase its predictive capacity, since values and degrees can be assigned to a given word as opposed to mere membership in a category where all matching words contribute equally to the score generated.   We currently use only one such dictionary, The Weighted Referential Activity Dictionary (WRAD). However, a Weighted Reflection-Reorganizing Dictionary (WRR) is currently being tested, and a Weighted Arousal Dictionary (WAD)  is being planned.

Smoothing: The basic idea behind the smoothing process is the use of a moving weighted average. For example, the value of the WRAD graph shown below at word number 2400 - this value is approximately .42, is obtained by taking an average of the WRAD scores at words 2300 through 2500. This average is weighted by an exponential function that looks something like a normal curve, so that the weight given to the WRAD value at word 2400 is much greater than 1/200, which is the weight it would have if we used an ordinary, unweighted, average, while the weight given to the WRAD value at word 2310 is very small, far smaller than 1/200. This smoothing is computed separately for each turn of speech, so that, at each word (point on the x-axis), the value of the smooth curve depends entirely on the speaker of that word.

There is also a wrap-around procedure used near the beginning and end of each turn of speech so that the mean value is preserved. That is, if we look at the first turn of speech on the graph below - this occurs between the beginning, at word number 2000, and approximately word number 2880 - and compute the average value of the WRAD (shown in green) of the smooth curve, we will obtain the same average as the actual average of the dictionary values of these approximately 880 words.

The graph below gives an example of the DAAP smoothing. It is from a session of psychoanalysis. The graph shows curves for Disfluency (DF) in grey, Reflection (REF) in blue, and WRAD in green, for the patient (S 1), and WRAD in the green heavier line for the analyst (S 2). DF and REF are unweighted dictionaries; so the value of .1 means that, near that word, about 10% of the spoken words constitute disfluencies. For the WRAD, which is a weighted dictionary, with a neutral value of .5, the extent to which the green curve lies above .5 is an indication of the extent to which the speaker is engaged in referential activity; the extent to which the curve lies below .5 is an indication of the extent to which the speaker is doing something else.  For more information about smoothing please see, Maskit, B. (2014). DAAPMath. figshare. http://dx.doi.org/10.6084/m9.figshare.928469.


Covariations Between Pairs of Measures: As a function of smoothing measures over the course of a narrative, DAAP is able to calculate indexes of the extent to which any pair of measures tend to move together or move in opposite directions.  This adds an entirely new dimension to text analysis and allows us to measure, for example, the extent to which negative affect word use (sad, cry, etc.) go together with disfluent language (uh, um, well).  The results of studying these covariations across a number of studies have shown that WRAD, or the extent to which a speaker is telling a story, tends to move in the opposite direction as Reflection (Ref).  Reflection (Ref) words are those that refer directly to cognitive processes e.g. remember, know, think. These two measures are the main markers of the symbolizing phase (WRAD) and reorganizing phase (Ref) of Bucci's referential process. The consistency of this pattern across studies is striking and is likely illustrative of different types of language use associated with different underlying psychological processes (for more information on this please read the Weighted Referential Activity Dictionary Page). DAAP automatically calculates these covariations as well as covariations for every pair of measures that are run on a given text sample.

Variation of word use within a text: DAAP produces an index of the extent to which the proportion of dictionary matches varies within a text segment.  For example, one speaker might use a high proportion of positive affect words in the middle of a segment or turn of speech, but not during the beginning and end of that segment, whereas another speaker might use a moderate proportion of positive affect words throughout the segment. The overall proportion of positive affect word use would come out the same for both speakers, but a glance at the variance measures would show a high degree of variance for the first speaker and very little for the second. Though these variance measures have not yet been widely studied they may be promising linguistic variables in their own right.  Such variables may aid in the automatic identification of different types of passages or even psychological aspects of different speakers.