This document contains an overview of the current version of DAAP using Word Count as the independent variable, including transcribing rules, an outline of the DAAP procedures and the DAAP output.There is a separate DAAP Math Page containing the mathematical formulae for the DAAP smoothing and covariation functions. There is also a separate VTDAAP Technical manual. There are special rules for putting time markers in transcripts for VTDAAP (Variable Time DAAP). These can be found in the attached file: TimeMarker Instructions.pdf. If you are going to use time markers, you should use those instructions (and suggestions) for transcribing first, then, as you proof-listen, apply the transcription rules given below.TranscribingDAAP08
has two different tracks: one is for single speaker; the other for
multiple speakers. There are sometimes different instructions for the
different tracks; these are in italics. Word Processing Format. If at
all possible use a fairly low level word processor, such as TextEdit
or WordPad. Many word processors automatically change single and
double quotation marks into “smart” quotes; if at all possible,
turn these off, so that you get "unsmart" quotes. (In MS
Word, uncheck the appropriate boxes in Tools, AutoCorrect.) Do not
use any of the typography or formatting tools of your word
processor, such as bold, italics, justification, etc. Make sure you
write or save (Save As) your final version as a text (.txt) file. Some
word processors break words and insert hyphens. Make sure your word
processor does not do this. Top
Matter. The transcript begins
with a confidentiality and/or ownership statement provided by the
project director. Place this confidentiality statement at the top of
the transcript. For transcripts with just one speaker place the top matter inside [square brackets]. This is not necessary for transcripts with more than one speaker.
Optional function - Classification indicators. You can choose any number of classification terms depending on the organization of your study. For example, if your transcript is an interview with different overall categories of subject matter, and distinct questions within each category, your classification terms might be: 'time', 'person' and 'question'. If, for example, you questions are concerned with incidents at different times (far past, near past and future), different people (father, mother, spouse), and you have a set of questions for each of these times and people, and you start with questions about mother in the far past, you might place the following before the transcription of the first speaker: \t time:FarPast\t
person:Mother \t
question:1 Note
that each of the above lines contains a space after \t, but no other
spaces. If
your transcript contains no other classification indicator, then you
can proceed to the first question. If you will have other
classification indicators in the transcript, they must all be
identified as above before the first speaker. After
you have completed the first question, if you move on to question 2,
then you type on a line by itself:
\t
question:2 However,
if the next question asks question 1 about father, you next type on
a line by itself:
\t
person:Father
There
is no need to repeat that we are still in the far past or that we are
still concerned with question 1. The
point of these classification indicators is that they can be used to
aggregate data. For example, if you want to see data for all
responses concerning the mother, tell the operator to aggregate the
data for time and question. If you have a batch of files in a folder
to be simultaneously run, they must all use the same classification
indicators, and, for each run, they will all be aggregated the same
way. DAAP treats each change of classification as a new turn of speech, so, if your transcript contains more than one speaker, you must indicate who is speaking after each change of classification, even if there is no change of speaker.That is, you must have a line beginning with \s after each change of classification (this is not necessary for transcripts with only one speaker). All the classifications to be used in your study must be included in the first set of labels. You cannot add a new classification after the beginning of the first turn of speech; and you must spell the classification term exactly the same way each time you use it (the program is case sensitive, spelling includes capital letters). The classification terms, and the names of the instances of the categories (e.g., FarPast, Mother, 1) must be made up of ordinary letters, either upper or lower case, and numbers; no spaces or other keyboard characters are allowed; also no accented letters are allowed. Again, the program is case sensitive, 'T' and 't' are different letters.
Confidentiality. The transcriber must consult with the research project director concerning the use of disguises. All proper names of persons, places and animals must be changed; they are usually changed into other names of persons, places and animals, of the same form, but some project directors prefer other codes. This replacement must be done by the transcriber; the software will not do it. These changes must be listed in a code book, and must not appear in any form in the transcript. The form and safety of the code book is the responsibility of the project director. NOTE: the "words": 'a', 'd', 'i', 'm', 's' and 't', are actual words according to DAAP, and must not be used as disguises.
Speaker designators. If there is only one speaker, this section can be skipped. If the sound track contains the voices of several speakers; name these as speaker 1, speaker 2, etc. There is a special rule for psychotherapy and psychoanalytic sessions, and similar dialogues, such as interviews; the patient or interviewee is always speaker 1, and the therapist or analyst or interviewer is always speaker 2. Note that this means that the first speaker on the transcript could be speaker 2; this is not a problem. There is no a priori limit to the number of speakers. However, you cannot skip a number in the ordering of the speakers: there must be speakers 1 and 2; if there are more than two speakers, there must be a speaker 3, etc. Each time there is a change of speaker, start a new line after a blank line. If the new speaker is speaker 1, start this new line with the speaker designator: \s 1 If the next speaker is speaker 2, start this new line with \s 2 etc. Note that there is no space before the backslash, and that there is a space after the number; this is important. Non-interruptions. This section can be skipped if there is only one speaker. There is a special rule for turns of speech with no content, such as the case of a speaker who is interrupted by someone else coughing, or laughing, or saying just some non-word, such as “hm” or “um hmm”. In these cases, and in these cases only, start a new line and, if for example, the interruption consists of speaker 2 saying “hm”, type: \st 2 mm. (Again, note that the backslash occurs at the beginning of a line, and that there are spaces both before and after the speaker number.) \sc 2 but
End of transcript. At the end of the transcription of the last speaker, even in the case that there is only one speaker, start a new line and type: \c9 Any material in the transcript after this will not be processed by DAAP.
Sounds other than spoken words. Events, or sounds other than words, should be noted in square brackets, as in [laughs], [coughs], or [telephone rings], etc.
Brackets: For texts with only one speaker, items enclosed in square brackets, are not processed by DAAP. However, items within round brackets are treated as if the round brackets were not there, and are processed by DAAP. Backslashes.
The
backslash may only be used as indicated above. It must always be the
first character of the line it is on. Rules for words. The following are intended to standardize the decisions that the transcriber will need to make. NOTE: These rules do not apply to items within square brackets, as DAAP ignores all such items. However, for texts with only one speaker, DAAP processes items included within round brackets (parentheses).
a) A word or phrase that is in the dictionary should be transcribed as written in the dictionary. b) If there is a choice between a hyphenated or unhyphenated form, such as “goodbye” or “good-bye”, choose either one; DAAP automatically changes it into the unhyphenated form. c)
If the dictionary offers one word or two, such as “chickenpox” or
“chicken pox”, choose one word; i.e., “chickenpox”. d) If the item sounds as if it ought to
be one word, but is not in the dictionary, such as “bookturner”,
and there is no such item in the dictionary, and both parts are words
in the dictionary, write it as two words; i.e., “book turner”. e)
If the item as spoken appears as one word, such as “nonbelief”
and there is no such item in the dictionary, and the two separate
parts are not words in the dictionary (“non” is not a word in the
dictionary; it is listed as a prefix), then write this as one word,
that is, “nonbelief.” f)
There are a few doubly hyphenated words, such as "mother-in-law".
Please type such words as unhyphenated; that is, "motherinlaw". g)
Incomplete words. Denote an
incomplete word by ending it with exactly one hyphen. For example,
if the speaker stutters and says: "f f fail"; this should
be transcribed as "f- f- fail"; note
the spaces after the hyphens.
If the speaker starts a word, hesitates, and then either completes
the word or says another word, type the first partial word with a
hyphen at the end, followed by a space. For example if the speaker
says "some", then hesitates, then says "somewhat",
transcribe it as "some- somewhat". The reason for this rule
is that DAAP counts incomplete words as disfluencies. However, if the
speaker says “I I I don’t know what to say”; these are not
incomplete words, and this sentence should be transcribed as shown. If
at all possible, do not use hyphens for purposes other than
incomplete words, and binding parts of words, as in non-judgmental.
Some word processors, including MS Word, sometimes change hyphens
used in other contexts into em-dashes or en-dashes, which can cause
difficulties. h)
Unclear words. These are noted
in square brackets; if the speaker says "the" followed by
one or more unclear words, type "the [unclear]". It is not
necessary to try to preserve the number of unclear words. i)
Misspoken words. If
the speaker misspeaks, or if you hear the speaker as misspeaking, and
there is no doubt as to the correct meaning, type the correct word.
For example, if the speaker says something that sounds like, “I
want to Philadelphia yesterday, and walked on Market Street”, this
is clearly a misspeaking, and the correct word is “went”, rather
than “want”, so the transcription should read, “I went [want]
to …” (one might need to code either "Philadelphia" or
"Market Street"). j)
Apostrophes. Use apostrophes as
usual for contractions, such as "don't", "can't",
"I'd", and for
possessives. However, type “oclock” rather than “o’clock”,
as this is really one word. k)
Filled pauses. Sounds that have
no meaning, such as “um”, 'hm', 'ah', etc., should be written as
they sound, with the following exception. An elongated "hm"
should be typed as "mmm" or "hmm", but not more
than 3 successive letters 'm' should be used. l)
Numbers. Numbers
should be written as numbers, but without commas or periods; that is,
if the speaker says "two hundred thousand", you can type
200000. If the speaker says "four and a half", type, "4
and a half". For time, type, for example, “8:45". m)
Amounts. Amounts,
such as dollars and percents, should be transcribed as spoken; i.e.,
100 dollars, or 10 percent. n)
Disfluencies. The
usual disfluencies are: 'well', 'like', you know', 'I mean' and 'kind
of'. Filled pauses, such as 'mm' or uhm', incomplete words, repeated
words and repeated pairs of words are also marked by DAAP as
disfluencies. You can mark a word as a disfluency by typing DX as the
last two letters of the word. For example, the word 'oh' is sometimes
used as a disfluency; in that case, it should be transcribed as
'ohDX'.
Punctuation Marks. Use punctuation marks, such as commas (stops), periods (full stops), semicolons and question marks as in customary usage. DAAP uses these to decide if, for example, the word 'like' is being used as a disfluency. Do not use exclamation marks (!). Pauses.
You
may use slashes (/), to indicate pauses of up to five seconds (one
slash for each second). Do not use longer dashes (em-dashes or
en-dashes) or three dots (…) for this purpose. You should leave a
space both before and after each of these dashes. (Some word
processors, such as MS Word, on occasion, change hyphens to
em-dashes, and/or use special symbols for ellipses (...). These cause
difficulties for DAAP.) Constructing your own unweighted dictionary. An unweighted dictionary is simply a list of words, in alphabetical order, with each word on its own line, written in text format, where the file containing the dictionary has no suffix; for example, the name of the file containing the Disfluency dictionary is DF. If you are constructing your own dictionary, please do not use any of the names already in use: AffSen, AN, AND, ANF, ANH, ANP, AP, AZ, DF, future, Logic, LOL, Neg, past, R, SAS, SenS. (Some of the DAAP programs require that the names of the weighted dictionaries come alphabetically after the names of the unweighted dictionaries. Due to this, we require that the name of any unweighted dictionary start with one of the letters: A ... S.) Constructing your own weighted dictionary.This is a list of words, in alphabetical order where each word is followed by a space, and then a number between -1 and +1 (-1 and +1 are also allowed). Again, each word, with its number is on its own line. The weights in the dictionary are assumed to have meaning referring to a linguistic or psychological construct, where zero is the neutral value, positive weights are assigned to words connoting positive quality, and negative weights are assigned to words with a negative connotation. A weighted dictionary must be in text format in a file whose name has the suffix .Wt. For example, the WRAD file is named WRAD.Wt. (Some of the DAAP programs require that the names of the weighted dictionaries come alphabetically after the names of the unweighted dictionaries. Due to this, we require that the name of any weighted dictionary start with one of the letters: T, U, V, W, X, Y, Z.) The Smoothing Operator.For each dictionary, regardless of whether it is weighted or not, there is no easy visual device for seeing the dictionary matches in a relatively long text. Many of the DAAP outputs are based on its smoothing operator which defines a curve that is both mathematically and visually smooth. This curve shows the places in the text where the dictionary matches are relatively high, vs. relatively low. The smoothing operator is basically a moving weighted average of dictionary values, but it also has a special wrap-around feature. To make the weighted average, DAAP uses an exponential function, very similar to the normal curve. At each word, the dictionary values of the previous 99 words, together with the next 99 words, are weighted and averaged; the weighting is highest at the word itself and diminishes rapidly, so that the weights at words relatively far away are very small. This weighting has the effect that, at each word, 95% of the weight is carried by the closest 84 words; the last 5% is carried by the remaining 114 more distant words. One of the main advantages of the smoothing operator is that it can be used for texts with any number of words and any number of speakers. In particular, the smoothing operator can be applied separately to each turn of speech in a dialog, so that the words of one speaker do not affect the smoothed dictionary values of the other speaker. There is a mathematical problem with the above computation of the moving average for words that occur within the first 99 words of a segment, or within the last 99 words. The wrap-around feature solves this problem with two incidental effects. The first is that the mean of the smooth dictionary values is equal to the mean of the actual dictionary values. The second is that the smooth curve tends to be essentially constant for short text segments (under about 25 words); this can cause a bias in some of the outputs, as described below. Smoothing for Unweighted Dictionaries.An unweighted dictionary is simply a list of words. A word in the text that matches a word in the dictionary receives the value +1; other words receive the value 0. The corresponding smoothed curve also lies between 0 and 1. One should think of the curve as giving the approximate value of the underlying linguistic or psychological variable at each word of the text. That is, the curve at each point or word approximates the proportion of words from the dictionary near that point. Smoothing for Weighted Dictionaries.Each word of each weighted dictionary has a value between -1 and +1. Words that are not in the dictionary are assigned the weight 0. (IMPORTANT NOTE: The above description is for technical purposes in building new weighted dictionaries only. In fact, DAAP rescales the scores of weighted dictionaries so that all outputs are presented as if the weights lay between 0 and 1, with a neutral value of .5. Also, all theoretical discussions of the smoothed weighted values are presented as if the dictionary scores were all between 0 and 1.) The value of the smooth curve for a weighted dictionary at a given point or word should be thought of as an approximation to the value of the underlying linguistic or psychological variable at that point, so that the smooth WRAD curve gives, for each word, an approximation to the RA at that point. DAAP08 Output FilesIn order to explain the structure of the output files, we assume that the there are some number of distinct texts, with file names: Text1.txt, Text2.txt, etc. in the folder labeled "Texts" that DAAP08 is processing. This folder contains a subfolder, Dics, that contains the weighted and unweighted dictionaries to be used (again, in alphabetical order, the names of the unweighted dictionaries precedes the names of the weighted dictionaries; the names of the weighted dictionaries consist only of alphanumeric characters and no dots; the names of the weighted dictionaries end with .Wt) There is also another subfolder, DATA, that contains the output data, as given below. There are separate Marked Text, Type-Token Ratio and Smooth files for each text file. The DATA folder also contains several overall files containing data for all the text files. These are the Coverage File, the Turn File, the First Aggregation File, the Second Aggregation File, Main Aggregation File and Variations File. DAAP also produces two files, the Log File, and the Word File, that are primarily for the use of the operator. There is a separate Marked Text for each file (Text1MTT.txt, Text2MTT.txt, etc.). Each file reproduces the material in the original text, and adds some markers. The turns of speech are counted, and there is a marker in square brackets at the beginning of each turn of speech showing its number. There is a marker of the form [a] every 10 counted words (words in parentheses or square brackets are not counted), where 'a' is the current word count starting at the beginning of the text. There is also some information at the end of the marked text concerning the total number of words, and the total number of turns of speech. There is a separate Type-Token Ratio file for each text (Text1TTR.txt, Text2TTR.txt, etc.). This file contains, for each speaker, a list of the distinct words (types) spoken by that speaker, together with the number of occurrences of that type (tokens). The file also contains, for each speaker, the number of distinct types, the total number of tokens, and their ratio, the type-token ratio. There is a separate Smooth File for each text (Text1SMT.csv, Text2SMT.csv, etc.). This file is a comma-separated-value (csv) file that can be read by any spreadsheet, such as Excel, Lotus 123, Numbers (Mac), Calc (Open Office), etc. The file contains the data needed to make a chart of the smoothed curves. The top line contains the names of the variables. After that, there is one line for each word of the text. The columns are ordered first by speaker and then by dictionary. For example, if there are two speakers, and three dictionaries (DF, R and WRAD.Wt), the columns will be labeled as: DF(S1), R(S1), WRAD(S1), DF(S2), R(S2), WRAD(S2). There is a separate Word File for each text (Text1WRD.csv, Text2WRD.csv, etc.). This file contains a list of all the (counted) words in the file; for each word, the file shows the number of the speaker and the dictionary values assigned to the word. This file is primarily for use by the DAAP operator. There is an overall LogFile containing a list of the files processed, and, for each file, a list of errors. This file is primarily for use by the DAAP operator. There is an overall Coverage File (TextsGLB.csv). This file contains information for each file in the folder that is independent of speaker: the number of words, the number of turns of speech, the number of Non-Turn-Vocalizations (marked by \st, but not by \sc); and then, for each dictionary, the coverage, which is the percent of words in the text that match the dictionary, and the number of dictionary matches. (Some of this is redundant: the coverage equals the number of dictionary matches divided by the total number of words.) There is an overall Turn File (TextsTRN.csv). This file contains a separate row for each turn of speech. The columns give the name of the particular text file; the turn number (this starts again at 1 for each file); the speaker number; the instance of each category (there is a separate column for each category); the number of words in this turn of speech; the number of the word at which this turn of speech starts; the mean score for each unweighted dictionary; the mean score, the mean high score, and the high score proportion for each weighted dictionary; and (if requested) the covariations between each pair of dictionaries. The High Score Proportion for a weighted dictionary is the proportion of words for which the smooth curve is above the neutral value of .5. The Mean High Score for a weighted dictionary is the average value of the amount by which the smooth score for that dictionary is above the neutral value of .5. That is, we look only at the words for which the smooth score is above .5, and we take the average value, for those words only, of the difference between the smooth score and .5. The Covariation between a pair of dictionaries is a measure that is computationally related to the (Pearson) correlation coefficient, but does not have the same statistical meaning. The weighted dictionaries all have a natural neutral value of .5, the unweighted dictionaries are each given the mean over the entire text as a neutral value. The correlation is a measure of the extent to which the smoothed dictionary curves are simultaneously above, and simultaneously below, their neutral values. CAUTION: The covariations are unreliable for texts with relatively few words, or for texts consisting in large part of segments, such as turns of speech, consisting of relatively few words. They are generally reliable for texts with more than 25 words. For this reason, covariations for therapist speech for psychotherapy sessions are generally not reported. There is an overall First Aggregation File (TextsAG1.csv). This file contains a separate set of rows, and possibly a separate set of columns, for each text. For each text, there is a separate row for each speaker and for each instance of each category. For example, if there are two speakers, two categories labeled as date and section, where the instances of date are Jan10, Feb10, Mar10, and the instances of section are parents and children, there will be a total of 12 rows for this file: speaker 1,Jan10,parents; speaker2,Jan10,parents; speaker1,Jan10,Children; speaker2,Jan10,Children; etc. The columns give the File name; speaker, instance of first category, instance of second category, etc.; number of words; number of turns of speech; mean score for each unweighted dictionary; mean score, mean high score, and mean high proportion for each weighted dictionary; and (if requested) the overall covariations between each pair of dictionaries. At the bottom of the first aggregation file, there is a separate set of rows and columns. There is one row for each speaker. The columns give speaker, number of words, overall mean score for each unweighted dictionary, and overall mean score, mean high score, and high score proportion for each unweighted dictionary. CAUTION: These have meaning only if all texts in the folder have the same speaker. There is an overall Variations File (TextsAG1SD.csv). This file contains the same rows as the First Aggregation File. For each row, the columns list the text file; the speaker, the number of words, the number of turns, the name of the instance for each category, and the average variation of each of the smoothed dictionary curves. This average variation is computationally the same as the standard deviation, but does not carry the same statistical meaning. There is an overall Second Aggregation File (TextsAG2.csv). As above, this file contains a separate set of rows and columns for each text file. However, the data from some of the categories are aggregated. If, in the above example, the data from section are aggregated, then, for each file, there is a row for each speaker and each date, for a total of six rows. The columns are the same as the main part of the first aggregation file, except that there are no entries in the column labeled section. There is an overall Main Aggregation File (TextsAG0.csv). There is a separate set of rows for each text file, one row for each speaker, and a final row for all speakers combined. The columns give the name of the text file; the speaker, labeled as 1, 2, ..., All; the number of words; the number of turns of speech; the mean score for each unweighted dictionary; the mean score, mean high score, and mean high proportion for each weighted dictionary; and (if requested) the overall covariations between each pair of dictionaries. |