The Discourse Attributes Analysis Program (DAAP)

Bernard Maskit and Sean Murphy

The Discourse Attributes Analysis Program (DAAP) is a computer-based text analysis system designed by Bernard Maskit, whose features include the use of both weighted and unweighted dictionaries and whose output includes measures based on an exponential smoothing operator. This software was originally designed to provide computerized measurements related to Bucci's theory of the referential process; it has also been applied in a wider range of  psychological and linguistic studies.

DAAP Operating Instructions:  An overview of operating instructions, including a guide for transcribers, a guide to making new dictionaries, and an overview of DAAP outputs is available in the document:

Maskit, B. (2014). The Discourse Attributes Analysis Program (DAAP) Operating Instructions. figshare.

DAAP Math:  A mathematical description of the Discourse Attributes Analysis Program (DAAP); including DAAP's smoothing operator, Mean High WRAD (MHW), High WRAD Proportion (HWP) and covariations between measures is available in the document:

Maskit, B. (2014). DAAPMath. figshare.

DAAP using Time as an Independent Variable: Technical Aspects:  A mathematical description of the workings of the Variable Time DAAP program that is used to align text with vocal characteristics is available in the document:

Maskit, B.(2014). DAAP using Time as Independent Variable: Technical Aspects. figshare.

DAAP Time Marker Instructions:  Instructions for putting time markers into existing transcripts that have been transcribed for the Discourse Attributes Analysis Program (DAAP) for use with Variable Time DAAP is available in the document:

Maskit, B. & Jenness, K. (2014). DAAP Time Marker Instructions. figshare.


The first DAAP programs were produced in 2002; these were HDAAP, written in C++ by Bernard Maskit, and BADAAP, written in visual basic by Andres Roussos. The BADAAP continues to be used in Buenos Aires with the Spanish version of WRAD, developed by Roussos. The HDAAP series gave way in 2005 to the DAAP series, written in perl by Maskit. The versions of DAAP currently in use or under development are as follows.

The DAAP09 series (current versions: DAAP09.6 for multiple speakers and SimDAAP09 for just one speaker) are the direct descendants of the original DAAP. Their basic features, including the computer based disambiguator, transcription rules, and summary of the DAAP09 output, are described below.

The TDAAP series (current version: TDAAP.03.1) uses time rather than word count as the independent variable. This is accomplished by having time markers inserted in the transcript every 2 seconds.

The VTDAAP series (current version VTDAAP02.26) uses variable time markers inserted in the transcript using the CLAN software. It also integrates vocal analysis data obtained from PRAAT with the linguistic DAAP output.

The IDAAP series (current version: IDAAP03.2) is similar to DAAP03.4; it is designed for use with languages such as Italian, whose alphabets include accented letters. There are currently Italian versions of the WRAD, Reflection and Disfluency dictionaries; See for further details.

The DAAP04 series (currently under development) is for interactive dialog involving many short turns of speech. It produces simultaneous linguistic variable curves for both speakers.

The DAAP05 series (currently under development) has new versions of the covariation function.

DAAP Basic Functions:

In common with most text analysis systems, such as the The Linguistic Inquiry and Word Count Program (LIWC), The General Inquirer, PCAD, and others, DAAP starts by reading each word of a text and comparing it with one or more dictionaries, some of which might be weighted, and some of which might use word stems or lemmas. Beyond this, there are several features that are unique to DAAP;  these include: the construction of a smoothly varying function for each dictionary; this smooth function is used to create new measures that can be used to segment text into segments for which the function is relatively high or low; it can be used to calculate the covariation between pairs of functions, and also to measure the degree of variation within each function. The basic DAAP program constructs these functions using word count as the independent variable; there are also programs in the DAAP family that use time as the independent variable. Each of these unique features are explained more fully below. For more detailed information concerning the DAAP measures please read the DAAP Dictionary Definitions.  

Weighted Dictionaries: The first, and perhaps most important, difference between DAAP and other text analysis programs is its ability to deal with weighted dictionaries. Most text analysis programs match words in a sample to a dictionary, count the number of occurrences and divide by the total word count to calculate a proportion. While DAAP does perform this calculation for unweighted dictionaries, it also has the capacity to calculate a score based on a weight assigned to each word in a dictionary. In many circumstances, the use of a weighted dictionary can greatly increase its predictive capacity, since values and degrees can be assigned to a given word as opposed to mere membership in a category where all matching words contribute equally to the score generated.   There is currently only one such dictionary in existence, The Weighted Referential Activity Dictionary (WRAD), however, several other dictionaries of this type are being developed; these include the dictionaries of Warriner et al. using word stems for valence and intensity.

Smoothing: The basic idea behind the smoothing process is the use of a moving weighted average. For example, the value of the WRAD graph shown below at word number 2400 - this value is approximately .42, is obtained by taking an average of the WRAD scores at words 2300 through 2500. This average is weighted by an exponential function that looks something like a normal curve, so that the weight given to the WRAD value at word 2400 is much greater than 1/200, which is the weight it would have if we used an ordinary, unweighted, average, while the weight given to the WRAD value at word 2310 is very small, far smaller than 1/200. This smoothing is computed separately for each turn of speech, so that, at each word (point on the x-axis), the value of the smooth curve depends entirely on the speaker of that word.

There is also a technical procedure used near the beginning and end of each turn of speech so that the mean value is preserved. That is, if we look at the first turn of speech on the graph below - this occurs between the beginning, at word number 2000, and approximately word number 2880 - and compute the average value of the WRAD (shown in green) of the smooth curve, we will obtain the same average as the actual average of the dictionary values of these approximately 880 words.

The graph below gives an example of the DAAP smoothing. It is from a session of psychoanalysis. The graph shows curves for Disfluency (DF) in grey, Reflection (REF) in blue, and WRAD in green, for the patient (S1), and WRAD in the green heavier line for the analyst (S2). DF and REF are unweighted dictionaries; so the value of .1 means that, near that word, about 10% of the spoken words constitute disfluencies. For the WRAD, which is a weighted dictionary, with a neutral value of .5, the extent to which the green curve lies above .5 is an indication of the extent to which the speaker is engaged in referential activity; the extent to which the curve lies below .5 is an indication of the extent to which the speaker is doing something else.  For more information about smoothing please see, Maskit, B. (2014). DAAPMath. figshare.

Covariations Between Pairs of Measures: As a function of smoothing measures over the course of a narrative, DAAP is able to calculate indexes of the extent to which any pair of measures tend to move together or move in opposite directions.  This adds an entirely new dimension to text analysis and allows us to measure, for example, the extent to which negative affect word use (sad, cry, etc.) go together with disfluent language (uh, um, well).  The results of studying these covariations across a number of studies have shown that WRAD, or the extent to which a speaker is telling a story, tends to move in the opposite direction as Reflection (Ref).  Reflection (Ref) words are those that refer directly to cognitive processes e.g. remember, know, think. These two measures are the main markers of the symbolizing phase (WRAD) and reorganizing phase (Ref) of Bucci's referential process. The consistency of this pattern across studies is striking and is likely illustrative of different types of language use associated with different underlying psychological processes (for more information on this please read the Weighted Referential Activity Dictionary Page). DAAP automatically calculates these covariations as well as covariations for every pair of measures that are run on a given text sample.

Variation of word use within a text: DAAP produces an index of the extent to which the proportion of dictionary matches varies within a text segment.  For example, one speaker might use a high proportion of positive affect words in the middle of a segment or turn of speech, but not during the beginning and end of that segment, whereas another speaker might use a moderate proportion of positive affect words throughout the segment. The overall proportion of positive affect word use would come out the same for both speakers, but a glance at the variance measures would show a high degree of variance for the first speaker and very little for the second. Though these variance measures have not yet been widely studied they may be promising linguistic variables in their own right.  Such variables may aid in the automatic identification of different types of passages or even psychological aspects of different speakers.