The DAAP Technical Manual


This document contains an overview of the current version of DAAP using Word Count as the independent variable, including transcribing rules, an outline of the DAAP procedures and the DAAP output.


There are special rules for putting time markers in transcripts for VTDAAP (Variable Time DAAP). These can be found in the attached file: TimeMarker Instructions.pdf. If you are going to use time markers, you should use those instructions (and suggestions) for transcribing first, then, as you proof-listen, apply the transcription rules given below.


Transcribing


DAAP08 has two different tracks: one is for single speaker; the other for multiple speakers. There are sometimes different instructions for the different tracks; these are in italics.



Word Processing Format. If at all possible use a fairly low level word processor, such as TextEdit or WordPad. Many word processors automatically change single and double quotation marks into “smart” quotes; if at all possible, turn these off, so that you get "unsmart" quotes. (In MS Word, uncheck the appropriate boxes in Tools, AutoCorrect.) Do not use any of the typography or formatting tools of your word processor, such as bold, italics, justification, etc. Make sure you write or save (Save As) your final version as a text (.txt) file. Some word processors break words and insert hyphens. Make sure your word processor does not do this.

    If this transcription is part of a larger project, such as several sessions or interviews or subject responses in an experiment, the transcript for each such session or interview or subject must be placed in a separate file, where the file name reflects the session or interview or subject number. These files are then all to be placed in a single folder.

Top Matter. The transcript begins with a confidentiality and/or ownership statement provided by the project director. Place this confidentiality statement at the top of the transcript.

    Follow this with a statement telling: the transcriber’s identifier (i.e., name or initials, as directed by the project director), date of transcription, and name of study.
Write the name or number of the session or interview or subject on the next line. The file name should also identify the session or interview or subject.

    For transcripts with just one speaker place the top matter inside [square brackets]. This is not necessary for transcripts with more than one speaker.

Optional function - Classification indicators. You can choose any number of classification terms depending on the organization of your study. For example, if your transcript is an interview with different overall categories of subject matter, and distinct questions within each category, your classification terms might be: 'time', 'person' and 'question'. If, for example, you questions are concerned with incidents at different times (far past, near past and future), different people (father, mother, spouse), and you have a set of questions for each of these times and people, and you start with questions about mother in the far past, you might place the following before the transcription of the first speaker:

\t time:FarPast
\t person:Mother
\t question:1

    Note that each of the above lines contains a space after \t, but no other spaces. 

    If your transcript contains no other classification indicator, then you can proceed to the first question. If you will have other classification indicators in the transcript, they must all be identified as above before the first speaker. 

    After you have completed the first question, if you move on to question 2, then you type on a line by itself:
\t question:2

    However, if the next question asks question 1 about father, you next type on a line by itself:
\t person:Father

    There is no need to repeat that we are still in the far past or that we are still concerned with question 1.

    The point of these classification indicators is that they can be used to aggregate data. For example, if you want to see data for all responses concerning the mother, tell the operator to aggregate the data for time and question. If you have a batch of files in a folder to be simultaneously run, they must all use the same classification indicators, and, for each run, they will all be aggregated the same way.

DAAP treats each change of classification as a new turn of speech, so, if your transcript contains more than one speaker, you must indicate who is speaking after each change of classification, even if there is no change of speaker.That is, you must have a line beginning with

\s

after each change of classification (this is not necessary for transcripts with only one speaker).

    All the classifications to be used in your study must be included in the first set of labels. You cannot add a new classification after the beginning of the first turn of speech; and you must spell the classification term exactly the same way each time you use it (the program is case sensitive, spelling includes capital letters).

    The classification terms, and the names of the instances of the categories (e.g., FarPast, Mother, 1) must be made up of ordinary letters, either upper or lower case, and numbers; no spaces or other keyboard characters are allowed; also no accented letters are allowed. Again, the program is case sensitive, 'T' and 't' are different letters.

Confidentiality. The transcriber must consult with the research project director concerning the use of disguises. All proper names of persons, places and animals must be changed; they are usually changed into other names of persons, places and animals, of the same form, but some project directors prefer other codes. This replacement must be done by the transcriber; the software will not do it. These changes must be listed in a code book, and must not appear in any form in the transcript. The form and safety of the code book is the responsibility of the project director. NOTE: the "words": 'a', 'd', 'i', 'm', 's' and 't', are actual words according to DAAP, and must not be used as disguises.

Speaker designators. If there is only one speaker, this section can be skipped. If the sound track contains the voices of several speakers; name these as speaker 1, speaker 2, etc. There is a special rule for psychotherapy and psychoanalytic sessions, and similar dialogues, such as interviews; the patient or interviewee is always speaker 1, and the therapist or analyst or interviewer is always speaker 2. Note that this means that the first speaker on the transcript could be speaker 2; this is not a problem.

    There is no a priori limit to the number of speakers. However, you cannot skip a number in the ordering of the speakers: there must be speakers 1 and 2; if there are more than two speakers, there must be a speaker 3, etc.

    Each time there is a change of speaker, start a new line after a blank line. If the new speaker is speaker 1, start this new line with the speaker designator:

\s 1

If the next speaker is speaker 2, start this new line with

\s 2

etc. Note that there is no space before the backslash, and that there is a space after the number; this is important. 


Non-interruptions. This section can be skipped if there is only one speaker. There is a special rule for turns of speech with no content, such as the case of a speaker who is interrupted by someone else coughing, or laughing, or saying just some non-word, such as “hm” or “um hmm”. In these cases, and in these cases only, start a new line and, if for example, the interruption consists of speaker 2 saying “hm”, type:

\st 2 mm.

(Again, note that the backslash occurs at the beginning of a line, and that there are spaces both before and after the speaker number.)


    There is another special rule for interruptions that may contain words, but that do not constitute a redirection of the speaker. Examples are that speaker 2 says: "yes", or "of course", or "but …", and is then interrupted by speaker 1 continuing. In this case, start a new line and type:

\sc 2 but

End of transcript. At the end of the transcription of the last speaker, even in the case that there is only one speaker, start a new line and type:

\c9

Any material in the transcript after this will not be processed by DAAP. 

Sounds other than spoken words. Events, or sounds other than words, should be noted in square brackets, as in [laughs], [coughs], or [telephone rings], etc.

Brackets: For texts with only one speaker, items enclosed in square brackets, are not processed by DAAP. However, items within round brackets are treated as if the round brackets were not there, and are processed by DAAP.


    For texts with multiple speakers, DAAP does not process items within either round or square brackets.

Backslashes. The backslash may only be used as indicated above. It must always be the first character of the line it is on.

Rules for words. The following are intended to standardize the decisions that the transcriber will need to make. NOTE: These rules do not apply to items within square brackets, as DAAP ignores all such items. However, for texts with only one speaker, DAAP processes items included within round brackets (parentheses).

    a) A word or phrase that is in the dictionary should be transcribed as written in the dictionary.

    b) If there is a choice between a hyphenated or unhyphenated form, such as “goodbye” or “good-bye”, choose either one; DAAP automatically changes it into the unhyphenated form.
    c) If the dictionary offers one word or two, such as “chickenpox” or “chicken pox”, choose one word; i.e., “chickenpox”.
    d) If the item sounds as if it ought to be one word, but is not in the dictionary, such as “bookturner”, and there is no such item in the dictionary, and both parts are words in the dictionary, write it as two words; i.e., “book turner”.
    e) If the item as spoken appears as one word, such as “nonbelief” and there is no such item in the dictionary, and the two separate parts are not words in the dictionary (“non” is not a word in the dictionary; it is listed as a prefix), then write this as one word, that is, “nonbelief.”
    f) There are a few doubly hyphenated words, such as "mother-in-law". Please type such words as unhyphenated; that is, "motherinlaw".
    g) Incomplete words. Denote an incomplete word by ending it with exactly one hyphen. For example, if the speaker stutters and says: "f f fail"; this should be transcribed as "f- f- fail"; note the spaces after the hyphens. If the speaker starts a word, hesitates, and then either completes the word or says another word, type the first partial word with a hyphen at the end, followed by a space. For example if the speaker says "some", then hesitates, then says "somewhat", transcribe it as "some- somewhat". The reason for this rule is that DAAP counts incomplete words as disfluencies. However, if the speaker says “I I I don’t know what to say”; these are not incomplete words, and this sentence should be transcribed as shown.
    If at all possible, do not use hyphens for purposes other than incomplete words, and binding parts of words, as in non-judgmental. Some word processors, including MS Word, sometimes change hyphens used in other contexts into em-dashes or en-dashes, which can cause difficulties.
    h) Unclear words. These are noted in square brackets; if the speaker says "the" followed by one or more unclear words, type "the [unclear]". It is not necessary to 
try to preserve the number of unclear words.
    i) Misspoken words. If the speaker misspeaks, or if you hear the speaker as misspeaking, and there is no doubt as to the correct meaning, type the correct word. For example, if the speaker says something that sounds like, “I want to Philadelphia yesterday, and walked on Market Street”, this is clearly a misspeaking, and the correct word is “went”, rather than “want”, so the transcription should read, “I went [want] to …” (one might need to code either "Philadelphia" or "Market Street").
    j) Apostrophes. Use apostrophes as usual for contractions, such as "don't", "can't", "I'd", and for possessives. However, type “oclock” rather than “o’clock”, as this is really one word.
    k) Filled pauses. Sounds that have no meaning, such as “um”, 'hm', 'ah', etc., should be written as they sound, with the following exception. An elongated "hm" should be typed as "mmm" or "hmm", but not more than 3 successive letters 'm' should be used.
    l) Numbers. Numbers should be written as numbers, but without commas or periods; that is, if the speaker says "two hundred thousand", you can type 200000. If the speaker says "four and a half", type, "4 and a half". For time, type, for example, “8:45".
    m) Amounts. Amounts, such as dollars and percents, should be transcribed as spoken; i.e., 100 dollars, or 10 percent.
    n) Disfluencies. The usual disfluencies are: 'well', 'like', you know', 'I mean' and 'kind of'. Filled pauses, such as 'mm' or uhm', incomplete words, repeated words and repeated pairs of words are also marked by DAAP as disfluencies. You can mark a word as a disfluency by typing DX as the last two letters of the word. For example, the word 'oh' is sometimes used as a disfluency; in that case, it should be transcribed as 'ohDX'.

Punctuation Marks. Use punctuation marks, such as commas (stops), periods (full stops), semicolons and question marks as in customary usage. DAAP uses these to decide if, for example, the word 'like' is being used as a disfluency. Do not use exclamation marks (!).

Pauses. You may use slashes (/), to indicate pauses of up to five seconds (one slash for each second). Do not use longer dashes (em-dashes or en-dashes) or three dots (…) for this purpose. You should leave a space both before and after each of these dashes. (Some word processors, such as MS Word, on occasion, change hyphens to em-dashes, and/or use special symbols for ellipses (...). These cause difficulties for DAAP.) 


Constructing your own unweighted dictionary.

An unweighted dictionary is simply a list of words, in alphabetical order, with each word on its own line, written in text format, where the file containing the dictionary has no suffix; for example, the name of the file containing the Disfluency dictionary is DF. If you are constructing your own dictionary, please do not use any of the names already in use: AffSen, AN, AND, ANF, ANH, ANP, AP, AZ, DF, future, Logic, LOL, Neg, past, R, SAS, SenS. (Some of the DAAP programs require that the names of the weighted dictionaries come alphabetically after the names of the unweighted dictionaries. Due to this, we require that the name of any unweighted dictionary start with one of the letters: A ... S.)

Constructing your own weighted dictionary.

This is a list of words, in alphabetical order where each word is followed by a space, and then a number between -1 and +1 (-1 and +1 are also allowed). Again, each word, with its number is on its own line. The weights in the dictionary are assumed to have meaning referring to a linguistic or psychological construct, where zero is the neutral value, positive weights are assigned to words connoting positive quality, and negative weights are assigned to words with a negative connotation. A weighted dictionary must be in text format in a file whose name has the suffix .Wt. For example, the WRAD file is named WRAD.Wt. (Some of the DAAP programs require that the names of the weighted dictionaries come alphabetically after the names of the unweighted dictionaries. Due to this, we require that  the name of any weighted dictionary start with one of the letters: T, U, V, W, X, Y, Z.)

The Smoothing Operator.

For each dictionary, regardless of whether it is weighted or not, there is no easy visual device for seeing the dictionary matches in a relatively long text. Many of the DAAP outputs are based on its smoothing operator which defines a curve that is both mathematically and visually smooth. This curve shows the places in the text where the dictionary matches are relatively high, vs. relatively low. The smoothing operator is basically a moving weighted average of dictionary values, but it also has a special wrap-around feature. To make the weighted average, DAAP uses an exponential function, very similar to the normal curve. At each word, the dictionary values of the previous 99 words, together with the next 99 words, are weighted and averaged; the weighting is highest at the word itself and diminishes rapidly, so that the weights at words relatively far away are very small. This weighting has the effect that, at each word, 95% of the weight is carried by the closest 84 words; the last 5% is carried by the remaining 114 more distant words.

One of the main advantages of the smoothing operator is that it can be used for texts with any number of words and any number of speakers. In particular, the smoothing operator can be applied separately to each turn of speech in a dialog, so that the words of one speaker do not affect the smoothed dictionary values of the other speaker.

There is a mathematical problem with the above computation of the moving average for words that occur within the first 99 words of a segment, or within the last 99 words. The wrap-around feature solves this problem with two incidental effects. The first is that the mean of the smooth dictionary values is equal to the mean of the actual dictionary values. The second is that the smooth curve tends to be essentially constant for short text segments (under about 25 words); this can cause a bias in some of the outputs, as described below.

Smoothing for Unweighted Dictionaries.

An unweighted dictionary is simply a list of words. A word in the text that matches a word in the  dictionary receives the value +1; other words receive the value 0. The corresponding smoothed curve also lies between 0 and 1. One should think of the curve as giving the approximate value of the underlying linguistic or psychological variable at each word of the text. That is, the curve at each point or word approximates the proportion of words from the dictionary near that point.

Smoothing for Weighted Dictionaries.

Each word of each weighted dictionary has a value between -1 and +1. Words that are not in the dictionary are assigned the weight 0. (IMPORTANT NOTE: The above description is for technical purposes in building new weighted dictionaries only. In fact, DAAP rescales the scores of weighted dictionaries so that all outputs are presented as if the weights lay between 0 and 1, with a neutral value of .5. Also, all theoretical discussions of the smoothed weighted values are presented as if the dictionary scores were all between 0 and 1.) The value of the smooth curve for a weighted dictionary at a given point or word should be  thought of as an approximation to the value of the underlying linguistic or psychological variable at that point, so that the smooth WRAD curve gives, for each word, an approximation to the RA at that point.

DAAP08 Output Files

In order to explain the structure of the output files, we assume that the there are some number of distinct texts, with file names: Text1.txt, Text2.txt, etc. in the folder labeled "Texts" that DAAP08 is processing. This folder contains a subfolder, Dics, that contains the weighted and unweighted dictionaries to be used (again, in alphabetical order, the names of the unweighted dictionaries precedes the names of the weighted dictionaries; the names of the weighted dictionaries consist only of alphanumeric characters and no dots; the names of the weighted dictionaries end with .Wt) There is also another subfolder, DATA, that contains the output data, as given below.

There are separate Marked Text, Type-Token Ratio and Smooth files for each text file. The DATA folder also contains several overall files containing data for all the text files. These are the Coverage File, the Turn File, the First Aggregation File, the Second Aggregation File, Main Aggregation File and Variations File. DAAP also produces two files, the Log File, and the Word File, that are primarily for the use of the operator.

There is a separate Marked Text for each file (Text1MTT.txt, Text2MTT.txt, etc.). Each file reproduces the material in the original text, and adds some markers. The turns of speech are counted, and there is a marker in square brackets at the beginning of each turn of speech showing its number. There is a marker of the form [a] every 10 counted words (words in parentheses or square brackets are not counted), where 'a' is the current word count starting at the beginning of the text. There is also some information at the end of the marked text concerning the total number of words, and the total number of turns of speech.

There is a separate Type-Token Ratio file for each text (Text1TTR.txt, Text2TTR.txt, etc.). This file contains, for each speaker, a list of the distinct words (types) spoken by that speaker, together with the number of occurrences of that type (tokens). The file also contains, for each speaker, the number of distinct types, the total number of tokens, and their ratio, the type-token ratio.

There is a separate Smooth File for each text (Text1SMT.csv, Text2SMT.csv, etc.). This file is a comma-separated-value (csv) file that can be read by any spreadsheet, such as Excel, Lotus 123, Numbers (Mac), Calc (Open Office), etc. The file contains the data needed to make a chart of the smoothed curves. The top line contains the names of the variables. After that, there is one line for each word of the text. The columns are ordered first by speaker and then by dictionary. For example, if there are two speakers, and three dictionaries (DF, R and WRAD.Wt), the columns will be labeled as: DF(S1), R(S1), WRAD(S1), DF(S2), R(S2), WRAD(S2).

There is a separate Word File for each text (Text1WRD.csv, Text2WRD.csv, etc.). This file contains a list of all the (counted) words in the file; for each word, the file shows the number of the speaker and the dictionary values assigned to the word. This file is primarily for use by the DAAP operator.

There is an overall LogFile containing a list of the files processed, and, for each file, a list of errors. This file is primarily for use by the DAAP operator.

There is an overall Coverage File (TextsGLB.csv). This file contains information for each file in the folder that is independent of speaker: the number of words, the number of turns of speech, the number of Non-Turn-Vocalizations (marked by \st, but not by \sc); and then, for each dictionary, the coverage, which is the percent of words in the text that match the dictionary, and the number of dictionary matches. (Some of this is redundant: the coverage equals the number of dictionary matches divided by the total number of words.)

There is an overall Turn File (TextsTRN.csv). This file contains a separate row for each turn of speech. The columns give the name of the particular text file; the turn number (this starts again at 1 for each file); the speaker number; the instance of each category (there is a separate column for each category); the number of words in this turn of speech; the number of the word at which this turn of speech starts; the  mean score for each unweighted dictionary; the mean score, the mean high score, and the high score proportion for each weighted dictionary; and (if requested) the covariations between each pair of dictionaries.

The High Score Proportion for a weighted dictionary is the proportion of words for which the smooth curve is above the neutral value of .5.

The Mean High Score for a weighted dictionary is the average value of the amount by which the smooth score for that dictionary is above the neutral value of .5. That is, we look only at the words for which the smooth score is above .5, and we take the average value, for those words only, of the difference between the smooth score and .5.

The Covariation between a pair of dictionaries is a measure that is computationally related to the (Pearson) correlation coefficient, but does not have the same statistical meaning. The weighted dictionaries all have a natural neutral value of .5, the unweighted dictionaries are each given the mean over the entire text as a neutral value. The correlation is a measure of the extent to which the smoothed dictionary curves are simultaneously above, and simultaneously below, their neutral values.

CAUTION: The covariations are unreliable for texts with relatively few words, or for texts consisting in large part of segments, such as turns of speech, consisting of relatively few words. They are generally reliable for texts with more than 25 words. For this reason, covariations for therapist speech for psychotherapy sessions are generally not reported.

There is an overall First Aggregation File (TextsAG1.csv). This file contains a separate set of rows, and possibly a separate set of columns, for each text. For each text, there is a separate row for each speaker and for each instance of each category. For example, if there are two speakers, two categories labeled as date and section, where the instances of date are Jan10, Feb10, Mar10, and the instances of section are parents and children, there will be a total of 12  rows for this file: speaker 1,Jan10,parents; speaker2,Jan10,parents; speaker1,Jan10,Children; speaker2,Jan10,Children; etc. The columns give the File name; speaker, instance of first category, instance of second category, etc.; number of words; number of turns of speech; mean score for each unweighted dictionary; mean score, mean high score, and mean high proportion for each weighted dictionary; and (if requested) the overall covariations between each pair of dictionaries.

At the bottom of the  first aggregation file, there is a separate set of rows and columns. There is one row for each speaker. The columns give speaker, number of words, overall mean score for each unweighted dictionary, and overall mean score, mean high score, and high score proportion for each unweighted dictionary. CAUTION: These have meaning only if all texts in the folder have the same speaker.

There is an overall Variations File (TextsAG1SD.csv). This file contains the same rows as the First Aggregation File. For each row,  the columns list the text file; the speaker, the number of words, the number of turns, the name of the instance for each category, and the average variation of each of the smoothed dictionary curves. This average variation is computationally the same as the standard deviation, but does not carry the same statistical meaning.

There is an overall Second Aggregation File (TextsAG2.csv). As above, this file contains a separate set of rows and columns for each text file. However, the data from some of the categories are aggregated. If, in the above example, the data from section are aggregated, then, for each file, there is a row for each speaker and each date, for a total of six rows. The columns are the same as the main part of the first aggregation file, except that there are no entries in the column labeled section.

There is an overall Main Aggregation File (TextsAG0.csv). There is a separate set of rows for each text file, one row for each speaker, and a final row for all speakers combined. The columns give the name of the text file; the speaker, labeled as 1, 2, ..., All; the number of words; the number of turns of speech; the mean score for each unweighted dictionary; the mean score, mean high score, and mean high proportion for each weighted dictionary; and (if requested) the overall covariations between each pair of dictionaries.