Developing and Using Questionnaires (Guidance, 10/93, GAO/PEMD-10.1.7). GAO presented a guide to its evaluators with information on the use of questionnaires in their evaluations, focusing on the rationales for determining when questionnaires should be used to accomplish assignment objectives, and planning and designing questionnaires for population surveys. The guide will also help evaluators to use questionnaires more effectively, design appropriate questions, develop testing tasks, communicate questionnaire requirements to statistical analysis experts, and ensure the quality of final questionnaire and data results. --------------------------- Indexing Terms ----------------------------- REPORTNUM: PEMD-10.1.7 TITLE: Developing and Using Questionnaires DATE: 10/01/93 SUBJECT: Surveys Demographic data Data collection operations Statistical methods Evaluation methods Information analysis operations Information gathering operations Government publications Agency missions ************************************************************************** * This file contains an ASCII representation of the text of a GAO * * report. Delineations within the text indicating chapter titles, * * headings, and bullets are preserved. Major divisions and subdivisions * * of the text, such as Chapters, Sections, and Appendixes, are * * identified by double and single lines. The numbers on the right end * * of these lines indicate the position of each of the subsections in the * * document outline. These numbers do NOT correspond with the page * * numbers of the printed product. * * * * No attempt has been made to display graphic images, although figure * * captions are reproduced. Tables are included, but may not resemble * * those in the printed version. * * * * A printed copy of this report may be obtained from the GAO Document * * Distribution Facility by calling (202) 512-6000, by faxing your * * request to (301) 258-4066, or by writing to P.O. Box 6015, * * Gaithersburg, MD 20884-6015. We are unable to accept electronic orders * * for printed documents at this time. * ************************************************************************** Cover ================================================================ COVER Program Evaluation and Methodology Division October 1993 DEVELOPING AND USING QUESTIONNAIRES GAO/PEMD-10.1.7 Questionnaires Abbreviations =============================================================== ABBREV GAO - U.S. General Accounting Office PEMD - Program Evaluation and Methodology Division PREFACE ============================================================ Chapter 0 GAO assists congressional decisionmakers in their decisionmaking process by furnishing analytical information on issues and options under consideration. Many diverse methodologies are needed to develop sound and timely answers to the questions that are posed by the Congress. To provide GAO evaluators with basic information about the more commonly used methodologies, GAO's policy guidance includes documents such as methodology transfer papers and technical guidelines. The purpose of this methodology transfer paper is to provide evaluators with a background that is of sufficient depth to use questionnaires in their evaluations. Specifically, this paper provides rationales for determining when questionnaires should be used to accomplish assignment objectives. It also describes how to plan, design, and use a questionnaire in conducting a population survey. We do not expect GAO evaluators to become experts after reading this paper. But we do hope that they will become familiar enough with questionnaire design guidelines to plan and use a questionnaire; to make preliminary designs and assist in many development and testing tasks; to communicate the questionnaire requirements to the measurement, sampling, and statistical analysis experts; and to ensure the quality of the final questionnaire and the resulting data collection. The present document is a revision. An earlier version was authored by Brian Keenan and Marilyn Mauch in 1986. This revision, authored by Brian Keenan, includes new material on cognition as well as on a number of developments in pretesting that have occurred since then. As such, the present document supersedes the 1986 version. Developing and Using Questionnaires is one of a series of papers prepared and issued by the Program Evaluation and Methodology Division (PEMD). The purpose of the series is to provide GAO evaluators with guides to various aspects of audit and evaluation methodology, to illustrate applications, and to indicate where more detailed information is available. We look forward to receiving comments from the readers of this paper. They should be addressed to Eleanor Chelimsky at 202-512-2900. Werner Grosshans Assistant Comptroller General Office of Policy Eleanor Chelimsky Assistant Comptroller General for Program Evaluation and Methodology USING QUESTIONNAIRES ============================================================ Chapter 1 This paper describes how to design and use questionnaires. Such information is important for GAO evaluators for two reasons. First, GAO frequently uses questionnaires to collect data. Second, the questionnaire is a method with a high potential for error if not designed and used properly. GAO employs questionnaires to ask people for figures, statistics, amounts, and other facts. We ask them to describe conditions and procedures that affect the work, organizations, and systems with which they are involved, and we ask for their judgments and views about processes, performance, adequacy, efficiency, and effectiveness. We ask people to report past events and to make forecasts, to tell us about their attitudes and opinions, and to describe their behavior and the behavior of others. Questionnaires are popular because they can be a relatively inexpensive way of getting people to provide information. But because they rely on people to provide answers, a benefit-risk consideration is associated with their use. People with the ability to observe, select, acquire, process, evaluate, interpret, store, retrieve, and report can be a valuable and versatile source of information under the right circumstances. However, the human mind is a very complex and vulnerable observation instrument. And if we do not ask the right people the right questions in the right way, we will not get high-quality answers. This holds true for even the simplest of questions. An easy way to demonstrate this is to do a simple straw poll, like asking co-workers how they came to work. One may answer "By way of New York Avenue" or give some other route description. Another answer to the same question may be "by car pool." If you continued this straw poll, many of the answers would be unusable if your intent was to learn modes of transportation to work. Asking good questions in the right way--the focus of this paper--is both a science and an art. It is a science in that it uses many scientific principles developed from various fields of applied psychology, sociology, cognitive research, and evaluation research. It is an art because it requires clear and interesting writing and the ability to trade off or accommodate many competing requirements. For example, a precisely worded, well-qualified, unambiguous question may be stilted and hard to read. Questions must be clear, interesting, and easy to understand and answer. In addition to asking the right questions, evaluators need to be aware of other principles dealing with questionnaire design and administration that are also covered in this paper. OVERVIEW OF TASKS IN USING QUESTIONNAIRES ---------------------------------------------------------- Chapter 1:1 Using even a simple questionnaire is not always simple. Numerous major tasks to develop and use a questionnaire must be completed in a logical sequence. After deciding to use a questionnaire, evaluators must plan the questionnaire, develop measures, design the sample, develop and test the questionnaire, produce the questionnaire, prepare and distribute the mailout or interview packages, collect the data and follow up with nonrespondents, perform checks to ensure the quality of responses, and reduce and analyze the data. Figure 1.1 reviews these major tasks. Except for the data collection, these processes are very similar regardless of whether the questionnaire is to be designed for the mail or a telephone or face-to-face interview. When interviewers are used, however, they must also be trained, which adds another major task. Figure 1.1: Typical Completion Times for Major Questionnaire Tasks (See figure in printed edition.) After describing important factors to consider when deciding to use a questionnaire, we briefly cover, in the remaining sections of this chapter, the major tasks listed in figure 1.1 and refer to subsequent chapters that provide detailed instructions. We do this to give an overview of the scope of work required to plan, develop, and implement a questionnaire and to show what the reader can expect to find in each of the subsequent chapters. Overall, the organization of this paper parallels the logical sequence of tasks undertaken when developing and using questionnaires. DECIDING TO USE STRUCTURED QUESTIONNAIRES ---------------------------------------------------------- Chapter 1:2 One of the first decisions evaluators have to make is whether to use a questionnaire or some other method to collect the data for the job. In many situations, other data collection techniques may be superior. In fact, over the past years other techniques were recommended by technical design teams for about one of every three proposed GAO questionnaires. The decision to use questionnaires should be made only after carefully considering the comparative advantages and disadvantages of the various ways of administering questionnaires over other data collection techniques. DATA CONSIDERATIONS -------------------------------------------------------- Chapter 1:2.1 Data can be collected in a variety of ways, such as field observations, reviews of records or published reports, interviews and standardized mail, and face-to-face or telephone questionnaires. The selection of one technique over another involves trade-offs between staff requirements, costs, time constraints, and--most importantly--the depth and type of information needed. For example, if the objective of the assignment is to determine the average per acre charge and the income derived from public grazing-land permit fees, the evaluator might consider using structured data collection forms or pro forma work papers to manually retrieve data from the case files in record storage. However, if the objective is to determine how much land the ranchers are willing to lease and how much per acre they are willing to pay, a mail, telephone, or face-to-face survey of ranchers would be necessary. Questionnaires are frequently used with sample survey strategies to answer descriptive and normative audit or evaluation questions. They are often less central in studies answering impact, or cause-and-effect, questions. While operational audits and impact, or cause-and-effect, studies are often not large-scale efforts, questionnaires can be used to confirm or expand their scope. Questionnaires can be useful when the evaluator needs a cost-effective way to collect a large amount of standardized information, when the information to be collected varies in complexity, when a large number of respondents are needed, when different populations are involved, and when the people in those populations are in widely separated locations. Furthermore, questionnaires are usually more versatile than other methods. They can be used to collect more types of information from a wider variety of sources than other methods because they use people, who can report facts, figures, amounts, statistics, dates, attitudes, opinions, experiences, events, assessments, and judgments during a single contact. People can answer for a specific type of source, such as members of a health maintenance organization, or for a variety of types of sources, such as local, state, and federal government officials. Questionnaires are difficult to use if the respondent population cannot be readily identified or if the information being sought is not widely distributed among the population of those who hold the knowledge. Furthermore, questionnaires should not be used if the respondents are likely to be unable or unwilling to answer or to provide accurate and unbiased answers or if the questions are inappropriate or compromising. In general, questionnaires should not be used to gather information that taxes the limitations of the respondent. Sometimes people are not knowledgeable or accurate reporters of certain kinds of information. They remember recent events much better than long-past events. They remember salient and routine events and meaningful facts but do not remember details, dates, and incidental events very well. For example, veterans might accurately report that doctors made medical examinations for Agent Orange effects on their eyes, ears, nose, throat, genitals, and pelvis but might substantially underreport skin examinations. If the information were needed on skin examinations, other sources, such as medical records, might be more useful. However, there are exceptions, particularly when the respondents are highly motivated. Structured questionnaires are also not particularly well suited for broad, global, or exploratory questions. Because respondents have many different forms of reference, levels of knowledge, and question interpretations, the structured methodology limits the evaluators' ability to vary the focus, scope, depth, and direction of the line of inquiry. Such flexibility is necessary to accommodate variances in the respondents' perceptions and understanding that result from such questions. Most of the people from whom GAO evaluators seek information are members of special populations, such as federal and state government employees, welfare recipients, or company executives. Unlike pollsters and market researchers, GAO evaluators rarely do a national population survey. Consequently, some of the mass survey techniques like random-digit dialing seldom apply to GAO work.\1 Also, GAO evaluators very rarely go back to the same population, and when they do, the time periods between surveys are so long that they usually have to redocument the population. -------------------- \1 Random-digit dialing refers to a telephone interview method that contacts people by dialing numbers at random. In some situations, usually when the population is hidden or not easily identified (for example, heads of households older than 65), this method may provide better access than other methods. ADMINISTRATION CONSIDERATIONS -------------------------------------------------------- Chapter 1:2.2 If after considering the pros and cons of using questionnaires, a questionnaire is still the method of choice for data collection, the evaluators need to consider the most appropriate method of administration. The appropriateness of the method of administration--whether it be mail, face-to-face interview, or telephone--varies with the resources and constraints of the job, the abilities and motivation of the respondent population, and the requirements of the evaluation. All three methods have comparative advantages and disadvantages, depending on the time and cost constraints of the job, the characteristics of the respondent population, and the nature of the inquiry. Mail questionnaires are usually more cost effective but require longer time periods than personal or telephone interviews. While mail questionnaires usually have higher development costs than telephone or face-to-face interviews, this is generally offset by the relatively inexpensive data collection costs. Mail questionnaires are the least labor intensive of the alternatives, with the labor costs limited to the effort needed to mail the questionnaire and track, follow up on, and edit the returns. Generally staff can mail hundreds of letters or edit scores of returns in a given day. Workers are not so productive with telephone and face-to-face interviews. Because of the difficulty in establishing telephone or personal interview contacts and the one-on-one nature of interviews, these alternatives require more staff time. Interviewers usually do not complete more than 10 or 12 telephone interviews or two or three face-to-face interviews in a day. Furthermore, the travel requirements for personal interviews can be very expensive when compared to postage or telephone charges. But mail questionnaires take longer to design and require longer periods for collecting and editing data than other choices. Extra care must be taken with the mail questionnaires because, unlike the other choices, there is no interviewer to help the respondent. Also, mail is a slow means of transmission, and mail questionnaires take two or three follow-ups. In summary, if money is tight and the subject matter can be phrased intelligibly for the respondent population, use the mail; if time is tight and staff time is not, use the face-to-face or telephone interview methods. In addition to subject matter, respondent characteristics play a key role in the method of choice. For example, if the respondents are motivated and literate and have normal vision, the mail is often the best option; otherwise, use the telephone or an interviewer. If respondents cannot be readily located by address or telephone number but gather at particular places (such as restaurants, parks, or hospitals), then a face-to-face interview is the only option. If the contact people are likely to conceal the identity of the intended respondent, and this is likely to make a difference, or if the evaluator is not sure that the intended respondent will get the questionnaire, then personal contact is better than telephone and telephone is better than mail. Also, if the respondent has a vested interest in giving biased reports that can readily be verified by inspection, then the face-to-face interview is the obvious choice. However, if the contact has a likely chance of temporarily inconveniencing the respondent or the respondent has privacy concerns, then a mail survey has the advantage over the remaining choices. Questionnaire characteristics also determine choice. Long, complex questionnaires designed to be answered by simple checks or short fill-in-the-blanks are better suited for self-administered questionnaires than the interview method. However, the converse is often true if the questions require the composition of responses that are other than very short answers (most people would rather speak than write). Also, if the questionnaire has many complex and confusing skips that frequently require respondents to answer some questions but not others, then one of the interview methods is preferable to a mail or self-administered questionnaire. In summary, evaluators should review the conditions and requirements of the data collection before deciding to use questionnaires and again before deciding the methods for administering the questionnaire. Mail questionnaires are a versatile, low-cost method of collecting detailed data. They are particularly adaptable to survey methods when the population is big, difficult to contact, likely to be inconvenienced, concerned about privacy, and widely dispersed. But mail questionnaires usually have a long turnaround time. The evaluators must be willing to invest the time required to carefully craft and test these questions. And the respondent must be willing and able and sufficiently literate and unbiased to accurately answer the queries. Interview methods, while much more expensive and more prone to bias, help insure against respondent error, have less turnaround time if sufficient staff is provided, and can be used to provide some interviewer verifications. PLANNING THE QUESTIONNAIRE ---------------------------------------------------------- Chapter 1:3 Once evaluators decide to use a questionnaire, planning starts with this paper, which provides information on the procedures necessary to do each of the major tasks to design and use questionnaires. The next step is to review the evaluation design and audit plan and then mentally walk the job through each procedure necessary to design and implement a questionnaire: developing the measures, designing the sample, developing and testing the questionnaire, producing the questionnaire, preparing the mailout or interview materials, and conducting the data collection, reduction, and analysis. A write-up of this mental walk-through, evaluated for comprehensiveness and feasibility, can serve as a basis for writing the implementation plan. DEVELOPING THE MEASURES ---------------------------------------------------------- Chapter 1:4 As evaluators do their planning, they will find that the scope of the effort is greatly influenced by information developed in the next two tasks--developing the measures and the sample design to ensure that the right questions are being asked of the right people. Remember that the questionnaire is an instrument used to take measures. To be sure it can do this, evaluators must first identify all the variables or conditions, criteria, causes, and effects that they want to measure. Next, evaluators analyze these variables and describe them so scientifically and precisely that they can be qualified, quantified, manipulated, and related. As explained in chapter 2, "Developing the Measures to Get the Questions," these measures define the requirements for the questionnaire. Questionnaires are designed by establishing a framework and sets of related questions that provide these measures. DESIGNING THE SAMPLE ---------------------------------------------------------- Chapter 1:5 Questionnaires are a way of asking the right people to take the measures needed to complete an evaluation. Before evaluators begin to write a question, it makes good sense to be sure they can find the people. The right people are representatives of a population who share the experiences the evaluators are interested in and who have or can get, will get, and will give them the information they need. Furthermore, evaluators must select these people scientifically, so the population these people represent can be talked about rather than just the individuals contacted. This is called a population survey, and how to do a population survey with questionnaires is explained in chapter 3, "Designing the Sample or Population for Data Collection." DEVELOPING AND TESTING THE QUESTIONNAIRE ---------------------------------------------------------- Chapter 1:6 Once the evaluators have established what to measure and who to ask to take the measures, they are ready to ask people to take these measures. Asking questions in the right way requires the evaluators to write sets of questions so that the answerer can easily understand precisely what information must be provided and, with little or no error, can easily provide this information. This means writing questions in a way that facilitates rather than interferes with the respondents' ability to understand the question and report the answer to the best of their ability. This simply stated task is deceptively complicated. To write good questions, evaluators must first understand something about the very complicated mental or cognitive process people use to answer questions. If evaluators access this cognitive process properly, the questionnaire can become a highly versatile and powerful instrument for observation and recall. If not, it can become a source of confusion and error. The sets of inquiries or questions must then be organized into a draft instrument. This questionnaire is then tested, reviewed, and revised until it is proven that as an instrument it takes the required measures. Since completing these tasks is perhaps the most difficult part of the job and consumes the most resources, we devote nine chapters (chapters 4-12) to explaining some of the many known and tested ways to do this work. Chapters 4-7 show how to facilitate the perception, acceptance, and understanding of the questions and how to help respondents recall their mentally stored information. In chapter 4, "Formatting the Questions," we show how to present the question in the precise format best suited to get the specific type of information requested. We demonstrate what respondents are likely to consider as fair and unfair questions in chapter 5, "Avoiding Inappropriate Questions." In chapter 6, "Writing Clear Questions," we explain how to write a question that can be quickly, easily, and precisely understood by all respondents in the same way. And in chapter 7, "Developing Unscaled Response Lists," we explain how to write in a way that aids respondents as they cognitively search their minds to select the answers to questions. Chapters 8 and 9 deal with the problem of bias and error. This problem has two sources: the question writer and the question answerer. Chapter 8, "Minimizing Question Bias and Memory Error," illustrates many of the typical mistakes question writers make and how to avoid them. Chapter 9, "Minimizing Respondent Bias," explains the ranges of capacities and limitations that respondents have in answering questions and how to make the most of the respondents' abilities and minimize the risk and compromise of their shortcomings. Chapter 10, "Measurement Error and Measurement Scales in Brief," explains how to translate the question answers into qualitative and quantitative measures for use in GAO reports. Throughout chapter 10, we deal with how to write individual questions. However, when we put these individual questions together into a single questionnaire, they often interact with one another in a context that affects the measuring of the questions. Chapter 11, "Organizing the Line of Inquiry," shows how to organize these questions into a line of inquiry that can enhance the quality of the answers and minimize unintended and interfering effects. After finishing the first 11 chapters of this paper, evaluators should be able to help write the first draft of a questionnaire. But there is still much more to be done before evaluators can use this draft as a survey instrument. They should go through a quality-assurance procedure, which requires that the draft questionnaire be tested and validated. The methods for this task, and other quality assurance tasks carried out during data collection and analysis, are described in chapter 12, "Following Quality Assurance Procedures." PRODUCING THE QUESTIONNAIRE ---------------------------------------------------------- Chapter 1:7 Once the questionnaire has been tested and validated, and probably revised, the evaluators can put it in final form and use it to collect and analyze data to answer the assignment questions. Good questionnaires can be seriously compromised if they are not presented in a format that is easy to read and administer. Chapter 13, "Designing the Form and Layout," addresses this issue and shows the evaluator how to design the questionnaire type, format, and layout in a manner that greatly facilitates the user's ability to perceive and respond. PREPARING FOR AND COLLECTING DATA ---------------------------------------------------------- Chapter 1:8 Several administrative procedures, such as preparing the transmittal or contact letters or mail piece or interviewers' kits, must precede data collection. Data collection methods then involve such activities as mailing, contacting, interviewing, tracking, and following up on nonresponses. Poor quality in the execution of these fundamental and very important activities can cut the response rate by as much as 50 percent. To avoid this problem, we have documented procedures shown to be highly effective for mail surveys in chapter 14, "Preparing the Mail-out Package and Collecting and Reducing the Data." Activities needed to check, edit, and prepare the data for computer processing are also covered in this chapter. ANALYZING DATA ---------------------------------------------------------- Chapter 1:9 Chapter 15, "Analyzing Questionnaire Results," discusses some of the initial thinking and conceptualization that are important to the data analysis, including the development of a strategy and a plan for the data analysis. We do not describe data analysis methods since they are covered in Quantitative Data Analysis: An Introduction.\2 Chapter 15 concludes the discussion on using mail and self-administered questionnaires. -------------------- \2 U.S. General Accounting Office, Quantitative Data Analysis: An Introduction, GAO/PEMD-10.1.11 (Washington, D.C.: June 1992). TELEPHONE SURVEYS --------------------------------------------------------- Chapter 1:10 Personal or telephone interviews are also important and useful methods for collecting structured data for GAO assignments. While the methodology for asking good questions developed in this paper applies regardless of whether the questions are asked in a self-administered mode, such as by mail, or in some other mode, such as a face-to-face or telephone interview, certain limitations are specific to each administration method. Those that apply to conducting telephone surveys are discussed in the concluding chapter 16, "Adaptations for the Design and Use of Telephone Surveys." Further details on personal interviews are presented in Using Structured Interviewing Techniques.\3 -------------------- \3 U.S. General Accounting Office, Using Structured Interviewing Techniques, GAO/PEMD-10.1.5 (Washington, D.C.: July 1991). Some information relevant to conducting face-to-face interviews is presented in chapter 12 of this paper, in a section dealing with pretesting techniques. DEVELOPING THE MEASURES TO GET THE QUESTIONS ============================================================ Chapter 2 Deciding what and whom to ask appears to be a straightforward task. But appearances can be deceiving. And as we shall see in the next two chapters, this initial step must be thought through with careful consideration and structured to an elemental level of detail. The what and whom to ask decision lays the foundation for the focus and scope, the level of difficulty and complexity, the risk, completion times, data collection, analysis, and processing requirements and resources needed for the job. Hence, all the job plans are based on this decision. Furthermore, the three major sources of error--misspecification of variables, measurement error, and sampling error--are often introduced at this stage. In this chapter, we discuss methods for documenting what a questionnaire should ask. This documentation will be used to develop a framework for writing the questions, describing the variables in scientific terms necessary for measurement, developing the measures, and specifying the variable relationships in order to check for misspecification of variable and measurement errors. In the next chapter, we discuss protocols for selecting the target population in ways that maintain the integrity of the design and minimize sampling error. Because deciding what to ask and deciding whom to ask it of are complex, we have described them in two chapters. However, in actual practice, deciding what and whom to ask go hand in hand and are among the few tasks in survey research that must be done interactively and iteratively. This is because the questions we ask are determined by both the need for information and the respondent's ability to provide this information. To document the questionnaire framework, variable operationalizations, measures, and variable relationships, it is best to start with what we know about the requirements of the job and mentally work in two directions, by thinking, first, in the abstract to integrate and conceptualize and, then, shifting to more concrete logic to define and analyze. At the start, evaluators usually find that some of the information they will need is very global, general, and abstract and other information is highly specific. However, most of the information they have gathered is at a middle level of detail, and they can begin by working with what they have. Information should be available from the job design, audit plan, evaluation framework, and previously gathered background material. Evaluators should conceptualize and organize this information into a framework of inquiry or types of questions that can be developed to yield answers to the evaluation questions. Often they may have to do additional research or additional thinking through to fill knowledge gaps. Next, they must go in the other direction and think more concretely and analytically. They must specifically describe or operationalize these information requirements and develop measures that will satisfy these requirements. Finally, they should integrate these conceptualizations and analysis into a format that presents the key relationships of the measurement variables. The process needed to develop each document product is described in the following sections. THE QUESTIONNAIRE FRAMEWORK ---------------------------------------------------------- Chapter 2:1 Initially, the evaluators decide what constructs, traits, conditions, or variables are to be measured and how to measure them. The documentation for this task is sometimes referred to as a questionnaire framework. The framework is usually depicted as a taxonomical classification. It is a scheme that lays out the evaluation questions and all the information required to answer each question with ordered and specified relationships. In essence, the framework provides a roadmap to identify and track the kind of data needed to answer the questionnaire. A relatively uncomplicated example might be structured in response to the evaluation question "Is the size of the 4-year college associated with student performance?" The constructs (or the things evaluators want to measure) for college size and student performance and their relationships are identified for measurement development. The identification of these constructs and their relationships influences the choice of data collection sources, methods, and measures. For instance, in the example above, we can readily see that there are alternatives: the use of extant data from various national graduate record achievement score data bases, surveys of administrative and academic deans, and so on. And just as the choice of methods and sources will force a choice of measures, so will the choice of measures determine the methods and sources. Hence, these choices must be made interactively and iteratively. The relationship of college size to student performance was a simple example. In this case, evaluators might have been able to proceed without committing measurement considerations to paper, but it is nearly impossible to plan complex questionnaires without documentation. For example, consider the following evaluation question: "What are the needs of earth-orbiting satellite image users?" The answer to this question requires a plurality of complex considerations, constructs, and measures such as the identification of the different types of users (national and international scientists, political administrators, disaster managers, and earth resource managers) the identification of the national and international, geopolitical, and socioeconomic considerations that determine the type of use and the measures of the quality of the information displays of the satellite and the relationships among the variables and constructs. This is a level of complexity that requires documentation. As we can see by the example, the framework identifies, specifies, and justifies the need for the information, constructs, variables, measures, and variable relationships that the evaluator wishes to collect data on. It is a scheme for documenting the information needs requirements. It is not a questionnaire but rather the basis for the questionnaire. OPERATIONALIZING THE CONSTRUCTS ---------------------------------------------------------- Chapter 2:2 So far we have talked in broad terms about ideas or concepts, traits or properties, and characteristics evaluators often like to measure--usually referred to as constructs. These constructs are not measures until the terms are specific enough to standardize. By "standardize," we mean that questions are designed and asked so that each recipient will understand and answer the same question in the same way. Different people reading the same questions need to have a common understanding. For example, one survey asked congresspersons about the "timeliness" of reports. Some respondents interpreted the construct "timeliness" as turnaround time while others interpreted it as getting the report information in time to use it for legislative decisions. As we can see, standardizing is very important because it enhances the objectivity of the resulting measure. The first step toward standardization is to operationalize or to define the construct in concrete, specific, unambiguous, and contextual terms that reduce the measure to a single trait or characteristic. Failure to do this in the example citing the size of the college resulted in a misspecification of this variable. The respondents variously interpreted size of college as spring enrollment, fall enrollment, total spring and fall enrollment, total full-time plus part-time spring enrollment, total full-time and part-time fall enrollment, full-time equivalent enrollment, and so on. The construct should have been operationalized as the enumeration of both the total full-time enrollment and the total part-time enrollment as of the close of the spring 1992 semester or quarter. DEVELOPING MEASURES FROM OPERATIONALIZED CONSTRUCTS ---------------------------------------------------------- Chapter 2:3 Measures are developed by giving operationalized constructs a dimension. Measures qualify and sometimes quantify the trait in a single dimension such as presence or absence or the amount, intensity, value, frequency of occurrence, or the ranking or rating or some other form of comparative valuation or quantification. The next few paragraphs will help familiarize the reader with some of the requirements of a measure. Although this familiarization will proceed in other chapters of this paper through discussion and example, evaluators should consult a text specifically devoted to measurement or consult a specialist when complex measures are required. Measures must be accurate, precise, valid, reliable, relevant, realistic, meaningful, comprehensive, and in some cases complementary, sensitive, and properly anchored. While evaluators may readily understand the meaning of precision and accuracy, some of the other terms may need to be defined, because in measurement they are used in a very special way. For instance, measures are considered valid if they are logical and they measure what they say they are measuring. They must adequately represent the trait in question. They must consistently predict outcomes, vary as expected in a variety of situations, and hold up against rigorous attempts to prove them invalid. We have all seen valid and questionable measures. Positive examples might be found in well-executed polls that predict voter outcome to a reasonable degree of accuracy. A negative example might be found in the logic of using complaints as a measure of discrimination, because the cost, time to resolve a case, difficulty in proving discrimination, difficulty in filing, fear of retaliation, and other reasons discourage the aggrieved from filing a complaint. Next, consider reliability, which is different from and independent of validity. To be reliable, a measure must give consistent results when repeated under similar situations. For example, IQ tests and employee attitude surveys usually give consistent results when repeated under similar circumstances with the same people. Measures should be relevant, meaningful, and realistic. For example, some very valid measures like IQ and grade-point average are used to hire employees. These are not relevant measures if the new employee is expected to be creative and inventive and generate new ideas, because the traits of IQ, grade-point average, and creativity are not correlated. Also, the labels given to the measure should correctly describe and communicate its meaning. For example, managers frequently measure things like costs, staff time, and number of reports under the term "quality measures." These measures may index effectiveness or productivity but not quality. The measure should be realistic or practical. For example, if a reader's pupils are dilated, this might be a good measure of his or her interest, but these observations are very hard to make. Therefore, under certain circumstances, the accuracy of the respondents' information recall and self-reports, while not as accurate, are more useful because answers are easy to obtain. Ideally, measures ought to be comprehensive and, in some cases, complementary. Comprehensive measures span the entire range of values that are of interest with equal precision. A single measure usually refers to a single trait, but sometimes if the construct is multidimensional or has several traits for reasons of economy or the need to capture two or more traits as they work together, we develop a measure that captures these multitrait effects. For example, asking the respondent if the text was easy to read and readily understandable might be considered a comprehensive measure. In contrast, complementary measures are measures that are distinct and must be taken together to reflect the construct. For example, the number of contrast shades and the sharpness of the contour lines are needed to measure photographic image quality. Other features of measures that are also important are sensitivity and anchoring. Sensitivity refers to a measure's ability to detect (1) the presence or absence of the trait, (2) levels of intensity, or (3) changes in the level of intensity with sufficient precision at sufficiently low levels to meet the needs of the evaluation. Anchoring refers to the establishment of clear, concrete points on the measurement scale that are meaningful to the respondent. That is, the scale should have meaningful starting, interim, mid, and end points. For example, we might anchor estimations of lighting quality as too dim (not bright enough to read a newspaper), appropriate (could comfortably read a newspaper), or too bright (too much glare to comfortably read a newspaper). An example of a complex measure taken from one of the cases cited in the preceding part of this chapter is presented in figure 2.1. The measure was developed from a construct identified with a questionnaire framework: the user's perception of the quality of an earth-orbiting satellite image. The construct was operationalized and developed into a measure of image quality. During this process, particular attention was given to accuracy, precision, validity, reliability, realism of application, meaningfulness of concept, the comprehensiveness and complementary requirements, measure sensitivity, and anchoring of the measure. Figure 2.1: Operationalized Variable in Question Response Format (See figure in printed edition.) SPECIFY THE KEY VARIABLE RELATIONSHIPS ---------------------------------------------------------- Chapter 2:4 We conclude this chapter with a brief but important discussion on specifying the variable relationships to be evaluated. (The two remaining sets of documentation needed to initiate the planning--the identification of and the selection of the target population--are discussed in the next chapter.) This task is important because, as we shall see, errors or omissions in specifying the variable relationship can either invalidate or weaken the evaluation. In this task, evaluators document and review all variables to ensure that all key variable relationships are included and specified with common units of analysis and for appropriate functional relationships and in the appropriate measurement stratification and time periods so as to permit statistical, temporal, and cross-sectional observations and comparability. These variable relationships should be documented down to the level of measurement specification. Then the evaluation design, the evaluation framework, and the questionnaire framework should be checked against this documentation to make sure nothing important is left out and that nothing unnecessary is included. A review should ensure that the sample or population measurements are to be taken on--and generalized to--common units of analysis. For example, in one case we found that one measure was to be taken on contractors, while its comparison measures applied to contracts. A review should be made for changes that would facilitate statistical comparability. For instance, the evaluators may find that one of the measures to be related is unnecessarily categorized while the other is continuous, or that some measures are inappropriately categorized for the intended cross sectional comparisons, thus weakening the statistical power of the analysis or, worse yet, rendering the analysis invalid. Further, review should make sure the specified categories in the comparison variables are not likely to confound cross-sectional comparisons. For instance, suppose we know from past studies that the effects of training are not likely to be noticed until 9 months later, there is less bias against the mentally disabled in the city than in the suburbs, or treatment for violence exposure is most effective soon after the incident. If evaluators test for training effect soon after the training, they may not see an influence because the trainees did not have enough development time to assimilate their experience. If the test is for bias against the mentally disabled only in the inner city rather than in both the inner city and the suburbs, the evaluators may not find the effect because this bias is less noticeable in the inner city. If they test for the effects of treatment for exposure to violence on only those who waited a year before receiving treatment, they may not see the effect because the treatment was given too late to do much good. Hence, evaluators must make sure that the cross-sectional comparison categories are structured to capture, not hide, the effects under study. Another point is to make sure the temporal comparisons are appropriate. For example, it is not unusual to find that the data for the different variables in the relationship are to be collected during different years. Finally, it is important to be sure important categories were not left out. This is because the sampling specialists will use this documentation to design the sample. For instance, in one case the evaluator was disappointed to find that the sample did not have enough power to compare important city, race, and educational stratifications because the sampling specialist had not been aware of these stratifications. DESIGNING THE SAMPLE OR POPULATION FOR DATA COLLECTION ============================================================ Chapter 3 Along with deciding what to ask, evaluators must decide who to ask. The people questioned must have the information the evaluators seek, they must be readily identifiable and accessible, they must be willing and able to answer, and they must be representative of the population being measured. They can be migrant workers, prisoners, police, scientists, medical doctors, commanders or soldiers, inner city African American youths, or government officials. Ideally, everyone in the population should be questioned, and sometimes this is done if the population is very small. But usually the best that can be done is to take a sample of these people and generalize the findings to the population they come from. In theory, to generalize findings, evaluators must first define the population. Then they should enumerate every unit in the population in a way such that every unit has an equal chance of being selected for the sample. In practice, it may be unrealistic to expect to enumerate every unit in a real population (for example, all persons who participated in a government program such as Head Start), but the enumeration must be reasonably complete and accurate and be reasonably representative of the actual population. The evaluators must then draw a representative sample from this population. SURVEY POPULATION ---------------------------------------------------------- Chapter 3:1 However, the sample cannot be determined or drawn until the evaluators have studied the size and characteristics of the population they want to know about. All too often, this step in questionnaire development is overlooked or assumed to be routine. Then, when the questionnaire is complete and ready to be mailed, the team is faced with weeks of hard research, or a major redesign, because the sample was not well founded. The first step in defining the survey population is to learn about the population distribution--the major categories of units and the numbers in each category. For example, if the evaluators want to sample banks, they should learn the differences between county, regional, statewide, branch, and unit banks; they should know geographic location factors and understand the basis for classifying banks as very large, large, medium, and small. If they are studying unit commanders in the armed services, they should know the unit sizes and types and the variations among the services. This research will help in designing sampling factors, such as stratification and stratification size, and will ensure a representative sample. Once the evaluators are familiar with the characteristics of the population, they can look for sources that enumerate each unit in the population or develop a reasonable theory for selecting the sampling units. The enumeration should be accurate, up-to-date, and organized to reflect the distribution characteristics. Sometimes this task is relatively easy. For example, in one project we needed to assess the effect that the Foreign Corrupt Practices Act had on U.S. business. The act prohibits payments to foreign officials if the purpose is to influence business. The population was U.S. companies that conduct most of the foreign business. These companies were readily identified because they were among the Fortune 1,000 companies, which conduct most of the foreign business. All we had to do was buy this list from Fortune magazine. The list gave the order of the companies by sales volume and provided information on each company's activities and the name and address of both the chief executive officer and the chairman of the board. However, for many other projects, considerable effort is needed to document the survey population. In practice, evaluators rarely have a list of the real population; at best they have only a list at the time the source material was current. By the time the questionnaire is administered, some units will have left the population and others will have joined it. For example, in the Fortune 1,000 evaluation, 6 percent of the firms left the population and we do not know how many may have joined it. The sample analysis must evaluate and make statistical adjustments for the losses. Whenever possible, the effect of the additions should also be considered. The best way to start enumerating a population is to talk to experts in the field and search out likely organizations, archives, directories, libraries, and management information systems until a reliable source has been discovered. Then the sampling units or population elements are organized, reorganized, or indexed into groups or frames, so they can be reached by a random, systematic, or prescribed process. For example, in one evaluation, we had to locate retired military users of military medical facilities. From a Department of Defense archival data base we were able to get the names and addresses of all the retired military personnel but we had no way of knowing if they were users of a particular medical facility. Our field work showed that retired military were likely to travel up to 40 miles to use hospital services; if they lived farther away, they usually made other arrangements. So we developed a computer program, based on zip codes, that matched persons to the hospitals that were within 40 miles of their homes. In a study of zoning problems encountered by group homes for the mentally disabled, we discovered that there was no national register of group homes. Since this was a study to see if this restrictive zoning practice was geographically widespread, we sampled catchment areas. We then called up the catchment area directors and got the names and addresses of every group home in each catchment area and sent the group home directors a questionnaire asking about their zoning problems. Sometimes, no matter how hard the search, archival data or records cannot be found from which to develop a population. When this happens, the best thing to do is to look for groups, sections, or clusters of files or lists that contain the information. Or the evaluators may want to look at existing data to surmise some ratio or relationship associated with the population. For example, if they want to define the population of general aviation flight-service airport specialists, they may be able to use previous work or pilot or survey studies. For example, from previous experience, they may find that they can estimate that the average number of specialists per airport is 16, multiply 16 specialists by the 316 airports, and estimate the population at about 5,000. Unfortunately, in a great many cases, there is neither a population enumeration nor a way to get cluster, unit, or ratio figures. In these cases, the evaluators must try to document the biggest possible portion of the most important and most representative cases, or they must develop some reasonable theory for selecting the sampling units. For example, to get a representative list of internal auditors, the evaluators might use the membership list for the Institute of Internal Auditors plus a list of the internal audit departments for the Fortune 1,000 companies. The latter would be included because most of them have internal audit departments. In one situation, we had to sample major importers and exporters. The available list had over 10,000 entries, almost all of which were too small to be considered major. So we used a combination of a "small world network" and a "snowball" approach. We found an association on the eastern coast to which most major mid-Atlantic shippers belonged. We contacted the association and obtained a list of the major shippers and their business volume. This association identified two other shippers' associations, which provided their lists and the names of six more associations. We continued until we had identified all associations and had a list of most of the major shippers. The shippers' associations reviewed our list and estimated that it accounted for 82 percent of the import-export business. Many other sources of specialized lists are available, but their reliability varies considerably. For example, major organizations such as the American Medical Association, the National Education Association, and the National Association for Home Builders can provide detailed address lists and population descriptions of their members. However, their cooperation varies with their interest in what the job is about. The cost for lists can be anything from nothing to a few hundred to several thousand dollars. Although the Bureau of the Census sometimes has useful lists, such as the census of manufacturers and the census of governments, these sources may be out of date. Many commercial sources, such as Ruben and Donnelly, Polk, and Thomas, sell population lists. Also, some commercial firms such as Dunn and Bradstreet sell specialized lists for various users, such as mail order companies. Care must be taken in using these lists because their quality varies considerably and very little may be known about the bias built into them, how they were developed, or what they include and, more importantly, exclude. Before using a list, it is a good idea to review and perhaps test it. For example, in a sample survey of farmers, the address list was developed from a list of subscribers to the Farm Home Journal. The list turned out to be several years old, and many of the subscribers were not farmers in the technical sense but people who sold or bought agricultural equipment or products or who were interested in rural living. SELECTING THE SAMPLE ---------------------------------------------------------- Chapter 3:2 Once the population has been enumerated and the evaluators are sure that it represents the population to which they want to generalize, they are ready to draw the sample. The sample must be drawn in accordance with a procedure that ensures a random selection. The sample size must be large enough to provide the degree of measurement precision and accuracy generally accepted by the scientific community. This must be done very efficiently and cost effectively. In many instances, accomplishing this will require the assistance of a sampling statistician who has the appropriate technical skills and practical experience.\1 -------------------- \1 See U.S. General Accounting Office, Using Statistical Sampling, GAO/PEMD-10.1.6 (Washington, D.C.: May 1992). This paper provides a thorough treatment of this topic. NONSTATISTICAL SAMPLING ---------------------------------------------------------- Chapter 3:3 Questionnaires may be used on projects in which statistical sampling is not used, so we need to consider briefly other ways in which evaluators select cases (Deming, 1960). Either all the cases can be studied--that is, a census can be taken--or part of the population can be selected in a nonstatistical manner. When evaluators take part of the population, they usually do so for a reason. It may be that they are doing a case study, so they select one or more cases that provide the best opportunity to observe the phenomena or relationships of interest, and they do not need to generalize their findings to the population. In other situations, the evaluators know very little about the population and cannot draw a statistical sample, so they arbitrarily select as many cases as they can and report the findings. However, in many situations, evaluators want to generalize and they know something about the population but it is just not feasible to draw statistical samples. So they pick a sample that they hope will correspond, in its features, to the population, even though they know they will not be able to use the powerful reasoning associated with statistical samples. An important category of nonstatistical sampling is "judgment sampling." A judgment sample draws its name from the fact that in the judgment of the evaluator, the cases chosen correspond to certain aspects of the population. The cases may be selected because they are judged most typical, because they represent the extreme ranges, because they represent a known part of the population, or because they simulate or act as a proxy for a representative sample from the population. For example, we could interview all the Fortune 500 chief executive officers in New York and Chicago because we believe that this sample is typical of chief executive officers in large companies. We could study selected group homes for the mentally disabled in California, Mississippi, New York, and Texas, because these states represent the extremes of the laws and practices. We could study 50 prime contractors with the Department of Defense in California and New York, because these contractors account for 82 percent of all defense contracts. We might pick 15 airports in 11 states, such that the sample would be similar to the population of airports with respect to size, geographic coverage, and weather conditions. As a rule, the use of judgment sampling in a project in which the intent is to generalize is ill advised, because arguments to support generalization cannot be nearly as persuasive as with statistical samples. However, occasions may arise (as with a very homogeneous population) in which the situation is not altogether bleak. When the validity of the findings depends on the extent to which they can be generalized to the population, and when there is no statistical sample, it might help to have some rule of thumb that might compare judgment samples to statistical samples. One way to picture the relationship between statistical samples and judgment samples with respect to representativeness might be to imagine a credibility scale from 1 to 10. Assume that a score of 1 is the value given to a single case study designed without intent whatsoever to generalize, and 10 is the credibility associated with studying the whole population. A very large, statistically valid random sample might yield a value of 9. A large, medium, and very small but statistically valid random sample might yield respective scores of 8, 7, and 6. If we made many case studies but did not take a random sample, we might get a value of 4. We might extend this value to 5 if the groups were large enough to provide statistical certainty within their limited area of selection or if the population was very homogeneous. We might get the same score of 5 if we selected a number of cases that represented the range of conditions and circumstances that apply to the population. (Incidentally, this is how pretest candidates are selected, because there is neither time nor resources to draw a statistically valid sample.) However, the score would drop to 3 or even 2 if we selected many or fewer cases without giving consideration to representing the expected range of conditions. A few years ago, we did a review of the elderly in which we selected thousands of cases at random from the same city. This might have been acceptable, from a generalization viewpoint, if we were measuring the conditions associated with cholesterol levels; these levels could be presumed similar for most U.S. city-dwellers. However, in this review, we were concerned about programs and their effects, which may have varied from city to city. Thus, limiting the sample to one city prohibited generalizations beyond the city that was studied. Another example involved a population of 132 health maintenance organizations. We arbitrarily picked 16 of these organizations and collected data from hundreds of people in each one. In the end, what we came up with was a set of 16 case studies. Although the sample for each case study was representative of the population of people in one of the 132 health maintenance organizations, the 16 case studies together permitted only very careful and limited findings. We might have had a much more powerful evaluation at a fraction of the cost if we had taken a random sample of organizations and looked at fewer cases within each organization. FORMATTING THE QUESTIONS ============================================================ Chapter 4 Before writing the questionnaire, the evaluators need to choose the format for each question. Each format presented in this chapter serves a specific purpose that should coincide with the available information and data analysis needs. OPEN-ENDED QUESTIONS ---------------------------------------------------------- Chapter 4:1 Open-ended questions are easy to write and require very little knowledge of the subject. All the evaluators have to do is ask a question, such as "What factors do you consider when you pick a carrier?" But this type of question provides a very unstandardized, often incomplete, and ambiguous answer, and it is very difficult to use such answers in a quantitative analysis. Respondents will write some salient factors that they happen to think of (for example, lower rates and faster transit time) but will leave out some important factors because at that moment they did not think of them. Open-ended questions do not help respondents consider a range of factors; rather, they depend on the respondents' unaided recall. There is no way of knowing what was important but not recalled, and because not all respondents consider the same set of factors, it may be extremely difficult or impossible to aggregate the responses. Also, the evaluators may not know how to interpret the answers. For example, people might say they choose a carrier because it is more convenient or less trouble. There is no way of knowing what this means. It may mean any thing from faster transit time to easier documentation. Another problem is that open-ended questions cannot easily be tabulated. Rather, a complicated process called "content analysis" must be used, in which someone reads and rereads a substantial number of the written responses, identifies the major categories of themes, and develops rules for assigning responses to these categories. Then the entire sample has to be gone through to categorize each answer. Because people interpret differently, three or four people have to categorize the answers independently. Furthermore, rules must be developed to handle disagreements and only very low levels of qualitative analysis can be performed.\1 Similarly, at the conclusion of the data reduction phase, only very low levels of qualitative analysis can be performed. Still another problem is that open-ended questions substantially increase response burden. They usually take several minutes to answer, rather than a few seconds. Because respondents must compose and organize their thoughts and then try to express them in concise English, they are much less likely to answer. However, open-ended questions do sometimes have advantages. It may happen that they are unavoidable when, for example, we are uncertain about criteria or we are engaged in exploratory work. If we ask enough people an open-ended question, we can develop a list of alternatives for closed-ended questions. We can also use open-ended questions to make sure our list of structured alternatives did not omit an important item or qualification. We can also ask open-ended questions to obtain responses that might further clarify the meaning of answers to closed-ended questions or to gather respondent examples that can be used to illustrate points. The rest of this chapter details closed-ended questions, because they are the meat and potatoes of our work. -------------------- \1 Interrater reliability is a measure of the consistency among the people categorizing the answers. FILL-IN-THE-BLANK QUESTIONS ---------------------------------------------------------- Chapter 4:2 Each questionnaire usually has some fill-in-the-blank questions. They are not open-ended because the blanks are accompanied by parenthetical directions that specify the units in which the respondent is to answer. Some examples are shown in figure 4.1. Figure 4.1: Fill-in-the-Blank Questions (See figure in printed edition.) Fill-in-the-blank questions should be reserved for very specific requests. The instructions should be explicit and should specify the answer units. Sometimes, several fill-in-the-blank questions are asked at once in a row, column, or matrix format, as shown in the examples presented in figure 4.2. Figure 4.2: Fill-in-the-Blank Row, Column, and Matrix Formats (See figure in printed edition.) YES-NO QUESTIONS ---------------------------------------------------------- Chapter 4:3 Unfortunately, yes-no questions are very popular. Although they have some advantages, they have many problems and few uses. Yes-no questions are ideal for dichotomous variables, such as black and white, because they measure whether the condition or trait is present or absent. They are therefore very good for filters in the line of questioning and can be used to move respondents to the questions that apply to them, as in figure 4.3. Figure 4.3: Yes-No Filter Question (See figure in printed edition.) However, most of the questions GAO asks deal with measures that are not absolute or measures that span a range of values and conditions. Consider the question: "Were the terms of the contracts clear?" Most people would have trouble with this question because it involves several different considerations. First, some contracts may have been clear and others may not have been. Second, some contracts may have been neither clear nor unclear or of marginal clarity. Third, parts of some contracts may have been clear and others not clear. Because so little information is obtained from each yes-no question, several rounds of questions individually have to be administered to get the information needed. "Did you have a plan?" "Was the plan in writing?" "Was it a formal plan?" "Was it approved?" This method of inquiry is usually so boring as to discourage respondents. Sometimes, question writers try to compress their line of inquiry and cause serious item-construction flaws. They ask for two things at once--a double-barreled question. For instance, a yes-no answer to "Did you get mission and site support training?" is imprecise. How do respondents answer if they got mission but not site support training? A related question-writing mistake is mixing yes-no and multiple choice. See figure 4.4. Figure 4.4: Mixed Yes-No and Multiple Choice Question (See figure in printed edition.) The example in figure 4.4 has several problems. The question and the response space do not agree. This slows up the cognitive processing because the question prepares the reader for a simple yes-no answer. But in reality the reader gets not a yes-no answer space but, rather, a list of qualified alternatives. The response alternatives are biased toward "yes" because most of the choices have "yes" in them. Furthermore, "no" in the last item cannot be used with the correlative conjunction "neither nor," because this is an unintended double negative. Such questions make a simple inquiry difficult because they are counter to the cognitive process, burdensome, and cause errors. Yes-no questions are prone to bias and misinterpretation for several reasons. First, many people like to say "yes." Some have the opposite bias and like to say "no." Second, questions such as "Do you submit reports?" have what is called an "inferred bias" toward the "yes" response. The most common way to counter this bias is to add the negative alternative--for example, "Do you submit reports or not?" However, if this is done, the use of yes-no choices in the answer must be qualified or avoided. Without this precaution, a simple "yes" answer may be read as applying to both parts of the question, "Yes, I submit" and "Yes, I do not submit." A simple "No" might also be read as "No, I do not submit"--a double negative. To prevent confusion, qualify the answer choices or avoid yes-no answers. See figure 4.5. Figure 4.5: Balanced and Unambiguous Yes-No Question (See figure in printed edition.) "IMPLIED NO" CHOICES ---------------------------------------------------------- Chapter 4:4 In figure 4.6, failure to check an item implies "no." The implied-no choice format is used because it is easy to read and quick to answer. Figure 4.6: "Implied No" Question (See figure in printed edition.) When evaluators want to emphasize the "no" alternative, they can expand the implied-no format to include one column for "yes" answers and one for "no." "No" is listed as an option when the respondent might not answer or might overlook part of the question, as when the choices are difficult, the list of items is long, or the respondent's recollection is taxed. If "no" is not included as an alternative, no's will be overreported, because the analysts will not be able to differentiate real no's from omissions and nonresponses. An example appears in figure 4.7. Figure 4.7: Emphasized-No Question (See figure in printed edition.) SINGLE-ITEM CHOICES ---------------------------------------------------------- Chapter 4:5 In single-item choices, respondents choose not "yes" or "no" but one of two or more alternatives. See figure 4.8 for an example. Since yes-no and single-item choices are similar, they have the same types of problems, but the difficulties are less pronounced in some respects and accentuated in others. Figure 4.8: Single-Item Choice Question (See figure in printed edition.) On the positive side, the differences between the choices are usually clear, and the writer can set up a truly dichotomous question. If used carefully, the single-item choice can be efficient. It often serves to filter people out or to skip them through parts of the questionnaire. It is not likely to be overused and cause excessive cycles of repetition. Furthermore, the question writer is not likely to compress the question into a double-barreled item. The single-choice format is also not subject to bias from yea sayers or nay sayers. And eliminating the negative alternative reduces misinterpretation. But there are problems. In the single-choice format, the writer is more apt to bias one of the choices by understating or overstating it. Some writers may not properly emphasize the second alternative; others, aware of this tendency, overcompensate. EXPANDED YES-NO QUESTIONS ---------------------------------------------------------- Chapter 4:6 One way around the yes-no constraints is to use an expanded yes-no format like that shown in figure 4.9. The expanded yes-no format gives a measure of intensity, avoids some of the biases common to yes-no, implied-no, and single-choice questions, and resolves the problem of quibbling. Consider the question, "Could you have gotten through college without a loan or not?" Also in the expanded format more students will answer in the negative than otherwise. Figure 4.9: Expanded Yes-No Format (See figure in printed edition.) The expanded alternatives can have qualifiers other than "probably yes" and "probably no." Qualifiers can be changed to meet the situation--"generally yes" and "generally no" or "for the most part yes" and "for the most part no." FREE CHOICES ---------------------------------------------------------- Chapter 4:7 Yes-no, implied-no, single-choice, and expanded formats are forced choices in that respondents must answer one way or the other. Forced-choice items generally simplify measurement and analysis because they divide the population clearly into those who do and those who do not or those who have and those who have not. Unfortunately, putting the population into just two camps may also oversimplify the picture and yield error, bias, and unreliable answers. To avoid this problem and to reduce the respondent's burden, a middle category can be added, as in the question in figure 4.10. Figure 4.10: Expanded Yes-No Format With Middle Category (See figure in printed edition.) Even though the proportion of yes's to no's will not change, the evaluators will have a better measure of the yes-no polarization, because the middle category absorbs those who are uncertain. A good rule of thumb is that if we are not certain that nearly everyone can make a clear choice--we include a middle category. Usually, the question asker will also put in an "escape choice" to filter out those for whom the question is not relevant. Examples are "not applicable," "no basis to judge," "have not considered the issue," and "can't recall." See figure 4.11. Figure 4.11: Expanded Yes-No Format With Escape Choice (See figure in printed edition.) MULTIPLE-CHOICE QUESTIONS ---------------------------------------------------------- Chapter 4:8 The most efficient format--and the most difficult to design--is the multiple-choice question. The respondent is exposed to a range of choices and must pick one or more, as in the example in figure 4.12. Figure 4.12: Multiple-Choice Question (See figure in printed edition.) Multiple-choice questions are difficult to write because the writer must provide a comprehensive range of nonoverlapping choices. They must be a logical and reasonable grouping of the types of experience the respondents are likely to have encountered. The example in figure 4.12 turned out to be flawed in practice. We learned during the pretest that we had left out some important choices. We detected this error because many respondents wrote answers in the "other" category. Because this format is very important and requires the most research, field work, and testing, and because the analysis and interpretation can be complex, we discuss multiple-choice question design in chapter 7 in considerably more detail. RANKING AND RATING QUESTIONS ---------------------------------------------------------- Chapter 4:9 Ranking questions are used to make very difficult distinctions between things that are of nearly equal value. The question forces the respondent to value one alternative over another no matter how close they are. The value that is assigned is a relative value. Rating questions are used when the alternatives are likely to vary somewhat in value and when evaluators want to know how valuable the alternative is rather than if it is a little more or less valuable than the next alternative. First consider ranking. In ranking, the respondents are asked to tell which alternative has the highest value, which has the second highest, and so on. They rank the choices with respect to one another, but their answers tell little about the intrinsic value of their choices. For example, suppose we asked respondents to rank the importance of the following services for institutionalized children: education, health care, lawn care, telephones, and choir practice. They would be hard put to choose between education and health care, because both are essential to the children's development. But they would have to rank one first and one second. Telephones would probably be ranked third. Compared to health care and education, telephones are much less important, yet they are ranked third just behind two services that are so important that it is difficult to choose between them. Ranking starts to get hard for people when there are more than seven categories. This is because they can usually pick the first and second and third and then the last and next to the last and the next to the next to last, so that what is left is the middle. But for more than seven items, respondents begin to lose track of where they are with respect to the first, last, and middle positions. When this happens, they make mistakes. For more than seven items, respondents can be given special task-taking procedures to counter this problem. But this procedure is rather burdensome. Also, ranking questions have to be written very carefully. The slightest lapse in clarity in the question or the instruction given will cause some people to rank in the reverse order or to assign two alternatives the same rank or to forget to rank every alternative. Nonetheless, ranking must sometimes be used. The example in figure 4.13 is one that has worked reasonably well. Respondents will make a few errors, but statistical procedures are available to handle them. Figure 4.13: Ranking Question (See figure in printed edition.) Rating questions are perhaps our most useful format because we usually want to know the actual or absolute value of the trait we are measuring. Ratings are assigned solely on the basis of the score's absolute position within a range of possible values. For example, a rating scale might be assigned the following categories: of little importance, somewhat important, moderately important, and so on. In writing rating questions, we should try to categorize the scales in equal intervals and anchor the scale positions whenever possible. Aside from the scaling, rating questions are easier to write properly and cause less error than ranking questions. We can see from the two examples of the rating format shown in figure 4.14 that ratings provide an adequate level of quantification for most purposes. We can also see by comparing the examples in figures 4.13 and 4.14 that rating formats are far less cumbersome than ranking formats. Figure 4.14: Rating Questions (See figure in printed edition.) GUTTMAN FORMAT --------------------------------------------------------- Chapter 4:10 In questions written in the Guttman format, the alternatives increase in comprehensiveness; that is, the higher-valued alternatives include the lower-valued alternatives. Applying this principle in one job, we asked state resource officials how they benefited from an earth-orbiting satellite. The question is given in figure 4.15. Here we assumed that if respondents had measured the benefit, they had identified it, and if they had determined the cost-benefit ratio, they had measured the primary and secondary benefits and lack of benefits as well as the worth or dollar value of these benefits and lack of benefits. Figure 4.15: Guttman Question (See figure in printed edition.) INTENSITY SCALE QUESTIONS --------------------------------------------------------- Chapter 4:11 The intensity scale format is usually used to measure the strength of an attitude or an opinion. Two popular versions, the extent and expanded yes-no scales, are presented in figures 4.16. Figure 4.16: Extent Scale and the Expanded Yes-No Scale Questions (See figure in printed edition.) LIKERT SCALE ------------------------------------------------------- Chapter 4:11.1 Another frequently used intensity scale format is the Likert or agree-or-disagree scale. The Likert scale is easy to construct. Consider the extent-scale example of figure 4.16. As shown in figure 4.17, all the question writer has to do is convert the question into a statement and follow it with agree-or-disagree choices. Figure 4.17: Extent Scale Converted to Likert Scale Question (See figure in printed edition.) However, if the writer is not careful, the simplicity and adaptability of the Likert scale format are often paid for by greater error and threats to validity. First, there is bias. The Likert scale presents only one side of an argument, and some people have a natural tendency to agree with the "status quo" or the argument presented. Writers of Likert scale questions could attempt to counter this bias error by presenting the converse statement also. For example, they would first ask for a response to "My boss does not let me participate in decisions (agree or disagree)." Then in a subsequent part of the questionnaire, they have to ask their questions in reverse: "My boss lets me participate in decisions (agree or disagree)." But now the line of inquiry is no longer concise or simple. The questions are doubled in number with a serial repetitive format that interferes with the cognitive recall process, aside from inhibiting motivation because these formats quickly become boring. Furthermore, developing precise converse statements of counterbalancing intensity can be difficult and complex. For example, "not satisfied" is not necessarily the opposite of "satisfied." And in the example above, the phrase "My boss does not let me participate" is much more negative than the phrase "My boss lets me participate" is positive. Another problem is that the extent of the respondent's agreement or disagreement with a statement may not correspond directly to the strength of the respondent's attitude about the Likert statement posed in the question. The respondent may consider the statement either true or false and respond as if the question were in an "either or" format rather than a graduated scale measuring the intensity of a belief. The Likert question uses the statement as a reference point or anchor. Hence, what is measured may be not the strength of the respondent's attitude over the complete range of intensities but, rather, the range of intensities bounded or referenced by the position of the anchoring statement at one end of the range and unbounded at the other end of the range. To complicate things even more, the single-bounding anchor may not be at the extreme end of the range; this makes comparisons among items very difficult. The point is that the indirect approach in the Likert scale may produce misleading results for a variety of reasons. It is usually better to use a direct approach that measures the strength of the respondent's actual attitude over a complete range of intensities. For example, it is better to reformulate the item from "My boss never lets me participate" to "To what extent, if at all, do you participate?" However, one situation in which the Likert scale is very useful is when extent of agreement or disagreement is closely and directly related to the statement. For instance, the respondent may be asked about the extent to which he or she agrees or disagrees with a policy, as in figure 4.18. Figure 4.18: Likert Question Used to Evaluate Policy (See figure in printed edition.) AMOUNT AND FREQUENCY INTENSITY SCALES ------------------------------------------------------- Chapter 4:11.2 Many questions ask the respondent to "quantify" either amounts or frequencies. These are relatively simple. They use certain descriptive words to characterize the amount, frequency, or number of items being measured. For example, traits like "help," "hindrance," "effect," "increase," or "decrease" can be quantified by adding "little," "some," "moderate," "great," or "very great." Certain adjectives like some and great have a stable and relatively precise level of quantification. For instance some is usually considered to be about 25 percent of the amount shown on the scale and a great amount is usually considered to be about 75 percent. Sometimes such adverbs as "very" and "extremely" are used. Quantities can also be implied by the sequence of numbered alternatives ordered with respect to increasing or decreasing intensity. See figure 4.19, which uses both methods together, in the common practice. Figure 4.19: Amount Intensity Scale (See figure in printed edition.) Frequencies or occurrences of events are treated the same way. Question writers know that words like "sometimes" and "great many" or "very often" mean about one fourth of the amount or 25 percent of the time and three fourths or 75 percent of the time, respectively, to most people. Similarly, words like "about half" and "moderate" anchor the midpoints. As with amount intensity scales, it is important to use both numbered, ordered scalar presentations and words to quantify the scale intervals. See figure 4.20. Figure 4.20: Frequency Intensity Scale (See figure in printed edition.) In many amount and frequency measures, where ambiguities are likely to occur, it is also important to use proportional anchors such as fractions and percents or verbal descriptive anchors such as once a day or once a month in addition to the adjective and scale number anchors. Examples are shown in figure 4.21. Figure 4.21: Frequency and Amount Intensity Scales With Proportional and Verbal Descriptive Anchors in Addition to the Conventional Adjective and Scale Number Anchors (See figure in printed edition.) BRANCHING INTENSITY SCALE FORMATS ------------------------------------------------------- Chapter 4:11.3 So far, all the examples have illustrated nonbranching formats. However, even more precise measures can be obtained with branching formats. An example is shown in figure 4.22. Figure 4.22: Branching Intensity Scale Format (See figure in printed edition.) FILL-IN-THE-BLANK FREQUENCY FORMATS ------------------------------------------------------- Chapter 4:11.4 Sometimes when evaluators have to be really precise and the range of frequency choices is very wide, such as in the study of repetitive behaviors, they can use a fill-in-the-blank format. What is asked for is the number of occurrences in a given time period or the interval between events to be counted. Examples are shown in figure 4.23. Figure 4.23: Number-of-Occurrences and Time Interval Formats (See figure in printed edition.) Here are some guidelines for using intensity scales. 1. Pick a dimension and a dimension reference point; then decide whether the scale should increase in a negative direction from that reference point, increase in a positive direction, or both. For instance, consider the question, "To what extent, if at all, did the law affect your business?" Here, the scale might go from reference point "no effect" to "a severe hardship" or, if you believe the law can only help, from "no effect" to "a very great help." But if the law could help some and hinder others, the scale would span the range from "a severe hardship" through the "no effect" reference point to "a very great help." 2. Use an odd-number of categories, preferably five or seven. 3. If there is a possibility of bias from the category ordering, order the scale in a way that favors the hypothesis you want to disconfirm and that disadvantages the hypothesis you want to confirm. This way, you confirm the hypothesis with the bias against you. 4. If there is no bias, start the scale with the most undesirable or negative effect and end the scale with the most positive categories. 5. Present the scale categories in the sequence that people are used to seeing them. 6. Pick scale-range anchors or poles (that is, specify the ends of the range) with concrete and unambiguous measures. 7. Use the item sequence and numbering to help define the range of categories. 8. Use words that are natural anchors or that will divide the scale at equal intervals, particularly over the middle two thirds or three fourths of the scale. For example, to most people, "some or somewhat" is usually perceived as about one fourth of the time, intensity, or amount, whereas "great" has a face value of about three fourths. 9. Anchor the intervals with numbers, fractions, or proportions and descriptions, when feasible. 10. Use a branching format when feasible, as it is precise. SEMANTIC DIFFERENTIAL INTENSITY SCALES --------------------------------------------------------- Chapter 4:12 In a semantic differential question, frequencies or values that span the range of possible choices are not completely identified; only the extreme value or frequency categories are labeled. An example is shown in figure 4.24. The respondent must infer that the range is divided into equal intervals. The range seems to work much better with seven categories than five. The reasons for this are complicated, but seven categories provide a closer approximation to the normal distribution. Figure 4.24: Semantic Differential Question (See figure in printed edition.) Semantic differentials are very useful when the evaluators do not have enough information to anchor the intervals between the poles. However, three major problems detract from this format. First, if the questions are not written with great care, many respondents will not answer or will answer with errors. Second, respondents may flounder and make judgment errors because the semantic differential has no midrange or intermediate anchors. Third, the results lack a certain amount of credibility because they are not tied to a factual observation. For example, compare a factually anchored scale point with a simple enumerated scale point. We find there is a big difference between saying that 70 percent of the respondents said their streams were polluted to the point at which most aquatic life was declining and saying that 70 percent checked 5 on a scale of 1 to 7. INTENSITY PAIRED-COMPARISON SCALES --------------------------------------------------------- Chapter 4:13 Intensity scales are very versatile and are sometimes combined with other types of scales. One such combination of scales is sometimes used in establishing priorities. Here an intensity scale is combined with a paired comparison scale. As its name implies, a paired comparison scale compares all the question options by pairs by asking the respondent to rank one item of the pair over the other. An intensity paired comparison scale asks the respondents to scale the amount of the difference between the two pair items. See figure 4.25. Figure 4.25: Intensity Paired Comparison Scale (See figure in printed edition.) AVOIDING INAPPROPRIATE QUESTIONS ============================================================ Chapter 5 To make sure questions are appropriate, the evaluators must become familiar with respondent groups--their knowledge of certain areas, the terms they use, and their perceptions and sensitivities. What may be an excessive burden for one group may not be for another. And what may be a fair question for some may not be for others. For example, in a survey of the handicapped, those who were not obviously handicapped were very sensitive about answering questions. This chapter discusses nine types of inappropriate questions and ways to avoid them. Questions are inappropriate if they are not relevant to the evaluation goals; are perceived as an effort to obtain biased or one-sided results; cannot or will not be answered accurately; are not geared to the respondent's depth and range of information, knowledge, and perceptions; are not perceived by respondents as logical and necessary; require an unreasonable effort to answer; are threatening or embarrassing; are vague or ambiguous; or are unfair. The best way to avoid inappropriate questions is to learn about the respondent group, design and field test for this group, and not rely on preconceptions or stereotypes. An anecdote may bring this point home. A researcher was pretesting a questionnaire on people who used mental health services. During the test, the researchers expressed surprise that the respondents could handle certain difficult concepts. Annoyed, one of the respondents rejoined, "I may be crazy, but I'm not stupid." QUESTIONS THAT ARE NOT RELEVANT TO THE EVALUATION GOALS ---------------------------------------------------------- Chapter 5:1 A questionnaire should contain no more questions than necessary. Questions that are not related to the goals of the evaluation or that are not likely to be used in the final report should be avoided. They require unnecessary time and effort from respondents. And questions that they view as irrelevant to the evaluation are less likely to be answered. This is the single biggest cause of nonparticipation. However, there are occasions when questions that are indeed very important appear to be irrelevant. If this is expected, the author should be very careful to explain why it was included. Occasionally, however, someone asks the evaluators to include what is called a "rider"--an unrelated question for use in another evaluation. Including riders creates three problems. First, the evaluation now has a dual purpose that has to be explained to readers. Second, the riders have to be woven into the questionnaire so that they do not seem irrelevant. Third, the use of the rider changes the context and hence the meaning of the questions. Aside from riders, there are three other ways in which irrelevant questions typically find their way into evaluations: 1. The evaluation design was inadequate. The evaluators did not formulate the overall project questions and the technical approach in a systematic way but decided to measure "everything" and see what they could come up with. 2. The evaluators had a hidden agenda. The evaluation was just a pretext for measuring other things. 3. The evaluators used the questionnaire to cover their bets. They already had the information they needed. They just wanted to be sure not to miss anything. Not one of these reasons is acceptable because the use of evaluations for such purposes wastes the agency's and the respondents' time and money. UNBALANCED LINE OF INQUIRY ---------------------------------------------------------- Chapter 5:2 Evaluators should not write questions that could be seen as developing a line of inquiry to support a particular position or preconceived idea, possibly at the expense of evidence to the contrary. The purpose of questionnaires is to develop information for an objective evaluation. To seem to do otherwise threatens a study's reputation for objectivity, commitment to balance, and integrity. QUESTIONS THAT CANNOT OR WILL NOT BE ANSWERED ACCURATELY ---------------------------------------------------------- Chapter 5:3 Perhaps the most frequent source of error is asking questions that cannot or will not be answered correctly. For example, we asked companies for 4 years of data, when they kept records for only 3 years. A more difficult problem occurs when respondents either purposely or unconsciously give biased answers. For example, unit commanders had a favorable bias when reporting on the performance of their units, whereas enlisted personnel were more likely to "tell it like it is." Similarly, physicians in certain hospitals rated the quality of their own medical practice very high but were objective in their judgment of peers. In these instances, it was inappropriate to ask unit commanders and physicians to rate themselves, because they were understandably biased in their answers. We obtained much more accurate observations from other sources (enlisted members and physician peer and nurse reports). Sometimes respondents provide misinformation because they make a random guess or they do not like to admit that they do not know something or they like to please the question asker by responding "yes." But it is better to have no information than false information. So it is important to skip out those not qualified to answer by using socially acceptable skip questions (see figure 5.1) or to direct the questionnaire only to those the evaluators know are knowledgeable. For example, in one project we evaluated the usefulness of a congressional report that analyzed federal funding by program and geographic location. We did not know which congressional staff used this report. So we analyzed staffing patterns and sent the questionnaire to the right people. Figure 5.1: Skip Question (See figure in printed edition.) Another means of selection is to ask people to rate their expertise. For example, in a study of the feasibility of a national health plan, we asked people to rate their expertise in the various knowledge areas such as the health care industry, insurance, education, manufacturing, and preventive medicine. QUESTIONS THAT ARE NOT GEARED TO RESPONDENT'S DEPTH AND RANGE OF INFORMATION, KNOWLEDGE, AND PERCEPTIONS ---------------------------------------------------------- Chapter 5:4 To avoid questions not properly geared to the respondents, it is important not to use words or terms they do not understand. It is very easy to assume that respondents know the same words we do. Some terms and abbreviations that have caused problems in past surveys are "detoxification," "EEO," "DCASR," "peer group," "net sales," and "adjusted gross income." We could have saved time and money had we provided a few words of explanation, such as "detoxification, or drying out"; "peer group, or the people you work with who have similar rank or status"; and "net sales, or the profit on sales after all expenses have been deducted." Evaluators must also use terms in the same context and sense that people are used to seeing them in. To students at a state college, the student union was a place where people hang out, watch television, and buy coffee and doughnuts; however, to military academy cadets, it was a subversive organization. In another survey, the term "margin" had different meanings to different respondents. It meant barely adequate to consumers, the amount of collateral required for stock purchases to bankers and brokers, the benefits of building or buying additional units to businessmen, and a cross-tabulation calculation to statisticians. Question writers must be familiar with their population, and they cannot assume too much or too little. For instance, we were worried about using two technical terms in surveying ranchers: "actual grazing capacity" and "forage productive capacity." However, our pretests showed the ranchers uniformly understood the terms. In another survey, we asked users to rate the quality of the computer image tapes from the LANDSAT earth-orbiting satellite. (The tapes provide data used to make computer maps of the earth's surface.) In general, the users could not answer this question because it was too broad. They wanted us to be much more specific and ask about the quality of the calibration, striping, formatting, wave length bands, pixil resolution, number of original amplitude steps used in digital conservation, corrections for geometric errors and distortions, and threshold settings. In yet another evaluation, we asked state child development and welfare service officials to rate the usefulness of information provided by major federal and state demonstration programs. We found that while the officials could answer for federal programs and for their own state programs, they could not answer for other state programs. As the preceding examples demonstrate, it is just as easy to assume too much as it is to assume too little. Evaluators usually have to test to be sure. In a survey of welfare recipients, we asked about the difference in quality of service provided by federal government personnel as opposed to state and local personnel. However the respondents saw all as "government men." In another evaluation, we asked mathematics and science teachers to add up a few numbers and calculate some percentages. We assumed this population would have little trouble with simple arithmetic. This was a big mistake. It is also important to make sure that the question writer's perceptions match those of the respondent's. People from rural areas when asked about a very large company may envision a firm with 50 people and $1 million in sales. Hence, the question writer may want to specify "a very large firm (a firm the size of General Motors, which does several billion dollars in sales and employs more than half a million people)." QUESTIONS THAT RESPONDENTS PERCEIVE AS ILLOGICAL OR UNNECESSARY ---------------------------------------------------------- Chapter 5:5 A line of questioning that does not appear to be logical or necessary may tend to confuse or disturb respondents. Questions should proceed in the logical order set up by the instructions and clearly denoted by headings and lead questions. (This is discussed further in chapter 11.) The questions should go from a general topic to the specific item or from the integration of specific details to a logical summary question. Like things should be grouped together, and parts should be structured in a logical progression of function, process, and chronology. For example, a survey of training programs might naturally start with questions on training objectives and then proceed to training plans, curriculums, course programming, lesson plans, instructor selection and training, course material, student selection, student progress assessments, and evaluation. It would, for example, be unnatural to start with evaluations. Items should not only be logical and relevant but should also appear so. For example, in a survey of postmilitary employment, we were interested only in the major economic sectors likely to do business with the Department of Defense. However, we had to include all major sectors and group these sectors in accordance with Bureau of Labor Statistics classifications, because many respondents were used to seeing the information this way. QUESTIONS THAT REQUIRE UNREASONABLE EFFORT TO ANSWER ---------------------------------------------------------- Chapter 5:6 Evaluators should avoid asking questions that require unreasonable amounts of time or work to answer. In general, it is a good idea to refrain from questions that require extensive and difficult calculations, excessive documentation, difficult to follow and burdensome response formats, extensive analysis and record searches, and a great deal of additional help. "Unreasonable" is a relative term that takes into account what respondents are willing to do, what is fair to ask of them, what the question writer is willing to do to help them, and what benefits they will get from participation. In general, form-completion time should be kept to under a half hour. This can be exceeded by a considerable margin if the issue is very salient to respondents; the form is logical, easy to read, and well designed; the approach is right; and respondents both see the need for and value of the information and can reasonably conclude that the evaluators have done all they can to keep the burden down. For example, we had to divide a very lengthy survey on housing grants into several parts and administer each part to separate individuals so that no respondent had to spend more than 1 hour on the questionnaire. However, in a survey of area agencies on aging, respondents were not the least bit reluctant to devote an entire day to the survey because they felt it was important to their jobs to participate. Regardless of how long it will take to fill out the form, the writer must be candid about it and tell respondents at the outset how long it is likely to take. Pretesting is the only sure way to find out the completion time, the task burden, the difficulty of the questionnaire, and the respondent's willingness to accept the burden. The price is very high for a miscalculation. Underestimating the burden may increase the nonresponse rate, yield inadequate answers, and lose credibility. If evaluators overestimate the burden, they may unnecessarily compromise the design to gain the acceptance of its users. Complicated response formats can also be very burdensome. Evaluators should avoid spreadsheet layouts that extend across the page and require respondents to make cross-sectional visual locations. Layouts that make respondents go back and forth through several pages, learn and remember several difficult codes, and make complicated interpolations should also be avoided. THREATENING OR EMBARRASSING QUESTIONS ---------------------------------------------------------- Chapter 5:7 Questions that are embarrassing, threatening, personal, or sensitive should be avoided. Respondents should not be asked to disclose legal actions or sensitive medical or financial information. Questions should not ask about behavior that makes them look less than ideal or about personal problems. If it is necessary to ask questions of this nature, it should be done in a way that makes the respondents at least minimally comfortable. For example, in a child-care needs assessment survey, we wanted information on marital status. This question was sensitive because some of the parents had never been married. We collected the information anonymously and explained how it would be handled and used. We expanded the range of the sensitive response category as far as possible without compromising the use of the data. Hence, the marital status choices were (1) married and (2) separated, divorced, widowed, never married. (Approaches for dealing with sensitive questions are presented in more detail in chapter 9.) VAGUE OR AMBIGUOUS QUESTIONS ---------------------------------------------------------- Chapter 5:8 Vague or ambiguous questions tend to leave respondents frustrated and uncertain how to answer. Vagueness and ambiguity may result from a number of causes, chief (and most remediable) among which are the following four: (1) the writing is unclear, (2) the response choices are unclear or overlapping, (3) the request is not properly qualified, or (4) the question refers to concepts that are too abstract. Unclear writing is covered in chapter 6 and overlapping response choices are covered in chapter 7. This section focuses on qualification and abstraction. IMPROPER QUALIFICATION -------------------------------------------------------- Chapter 5:8.1 Improperly qualified requests do not adequately specify the conditions or the observations evaluators want respondents to report on. If evaluators ask a report user if a report was "timely," the user may not know if they are asking whether getting it took too long or it arrived after it was needed or both. Improperly qualified items are a major source of frustration. Question answerers are frustrated because they do not know how to respond, and question askers are frustrated because they get either no answers or answers they may not be able to use. Some guidelines for correcting this type of flaw are presented below. First, get to know how the respondent population talks, thinks, and does things. Second, make sure that all terms are well qualified. Third, certain subjects cause problems if they are not part of a person's routine or if their meaning varies with the respondent's perspectives. Some of these subjects are processes, sequences, sources, times, goods and services, organizations, classifications, functions, disciplines, regions, programs, systems, space, business, government, and infrastructures. Fourth, question writers should substitute concrete terms or examples for abstract concepts. Fifth, make as few assumptions as possible. In a wage and salary survey, we asked business managers to report on their own establishments. We took for granted that everyone would know what their establishments were. However, in these days of chains, branches, decentralized and consolidated offices, holding companies and subsidiaries, this assumption was false. After a few weeks of testing, we finally came up with the following qualification: "While most of the terms in this questionnaire will be clearly understood, the term 'establishment' may be ambiguous to some and should be further qualified. For this questionnaire, an establishment should be considered as follows: "A single physical location where one or predominantly one type of business or activity is conducted in your metropolitan area (for example, a factory, store, hotel, airline terminal, sales office, warehouse, or central administrative office). "Exclude activities that are conducted at other locations, even though they may be part of the business. "If the establishment engages in more than one distinctly different line of activities or businesses at the same location, consider only the activity that involves the largest number of white collar workers. "If the personnel office is separate from the business location or serves more than one business, consider only the single separate location in the metropolitan area employing the largest number of white collar workers." In another survey of personnel, people had trouble answering "Would you relocate?" because they did not know whether we were asking about relocation within the city, within the state, out of state, to the West coast, or to Washington, D.C. Or again shippers could not answer "How many tons of goods did you ship during your last fiscal year?" Goods have different shipping measures: short tons, long tons, tonnage (a measure based on the displacement of water), hundredweights, cords, board feet, cubic feet, cubic yards, and gallons. Finally, while testing a questionnaire in inspector general offices, we were surprised to find that much of the staff lacked audit experience. This was because some of the inspectors general did not consider investigations and inspections as audits. The question should have read, "How many years of experience have you had with the government doing audits, investigations, or inspections?" ABSTRACT CONCEPTS -------------------------------------------------------- Chapter 5:8.2 Abstract concepts, like inadequately qualified terms, can be inappropriate because the respondent will have trouble giving a precise answer. Examples are "Does the child-care staff show affection and love toward the children?" "How good was the presentation?" "Do you have sufficient autonomy?" "Assess the neighborhood stability." Respondents cannot readily describe or quantify their observations of love, goodness, autonomy, or stability. In general, there are four ways to make abstract concepts easier to address: 1. present the concept as behavior, 2. provide definitions that are more concrete, 3. analyze or break out the concept into more elemental and concrete factors, or 4. define the various factors that govern the concept. The question "Does the child-care staff show affection and love toward the children?" can be broken down into a series of behavior-oriented questions that measure the number and length of times the average child sat on an adult's lap or was picked up, cuddled, or held. Another example of using this behavioral technique is taken from a study of role ambiguity at the U.S. Naval Academy, where the lower-class midshipmen receive much of their training from upper classmen. See question 5.2. Figure 5.2: Behavior-Oriented Question (See figure in printed edition.) Sometimes concepts can be handled more easily by providing concrete definitions. In a survey of program managers, we simplified the abstract question "How much autonomy do you have?" by asking, "How much influence do you have over the project management decisions?" It may take a lot of work to reduce the abstraction in what appears to be a very simple request. The answers to "How good was the presentation?" may be a composite of many factors. In one evaluation, we had to enumerate these factors and then ask respondents to rate each one. In this case, respondents rated relevance, focus and scope, educational contribution, delivery, planning and organization, and technical merit. Furthermore, the abstractions in these terms had to be reduced by giving concrete definitions. For example, "relevancy" was defined as timeliness, importance, and utility of information, and "focus and scope" were defined as appropriateness of the coverage and the emphasis and detail given to high- and low-priority information. "Neighborhood stability" was another seemingly simple concept that required substantial explanation. We provided an operational definition of the various factors that governed "neighborhood stability" and asked respondents to rate the extent to which the neighborhood changed with respect to these factors. The factors were new people coming in, residents leaving, new commercial construction, housing construction, housing renovation, number of blighted houses, and proportion of families with children, among others. UNFAIR QUESTIONS ---------------------------------------------------------- Chapter 5:9 While irrelevant, unreasonable, embarrassing, threatening, and improperly qualified questions are also unfair, this section focuses on four other kinds of questions that give problems to respondents. These are questions that expose respondents to risk, unnecessarily ask for proprietary information, excessively test a respondent's competence or capability, or entrap the respondents. We should try to avoid lines of inquiry that put respondents at risk. Examples include asking user groups to report on their regulators, asking employees to report on their management, and asking job candidates to report on merit system abuses. However, sometimes these types of questions must be asked because they are the best or only source of information. When this occurs, the evaluators should be careful to safeguard the respondents' identities and try to prevent any administrative or other uses that would have repercussions on the informants. For example, we found that certain group homes might be at risk if the information they provided were cross-referenced with that from zoning officials, so we corroborated their reports using other methods. Evaluators should not ask for proprietary or restricted information unless it is essential to the evaluation. By "proprietary," we mean information on new products, advanced designs, marketing strategies, and so on. Also, restricted information should not be requested, such as data on compliance hearings, equal employment opportunity cases, finances, and national security. Evaluators who need this information should initiate safeguards and maintain a resolve not to disclose it. Questionnaires that seek to make an audit point by discrediting respondents' capabilities should be avoided. Questionnaires that are the equivalent of an intelligence test or a comprehensive examination of respondent qualifications are unfair. If a competency assessment is necessary for the evaluation, questions can ask about background, achievement, and behavioral information without asking respondents to "take a test." Evaluators should also avoid using questionnaires for administrative or entrapment purposes--that is, getting respondents to disclose self-incriminating information that may be used against them. Evaluators who must gather this information should be candid and tell the respondents the information they provide might be used against them. WRITING CLEAR QUESTIONS ============================================================ Chapter 6 To help respondents understand a questionnaire, the question writer must write clearly and at the respondent's language level. The questions must be direct, orderly, precise, logical, concise, and grammatically correct. They must have unity, coherence, and emphasis. Although a detailed discussion of clear writing is beyond the scope of this paper, this chapter discusses some common writing problems and presents general guidelines for increasing the readability of questionnaires. SIMPLIFY THE WORD STRUCTURE ---------------------------------------------------------- Chapter 6:1 One of the most effective ways to increase readability is to simplify the word structure. Four word structure factors affect readability: the length of a word, the number of syllables in a word, the ratio of root words to words with prefixes and suffixes, and the frequency of a word's use. Word length should average about 6 letters for the fifth-grade reading level. Sentences with words averaging 10 letters or more are difficult to read. Cutting back multisyllable words also increases readability. When no more than 8 percent of the words in a sentence have more than three syllables, the sentence is easy to read; when 20 percent of the words have more than three syllables, reading will generally be quite difficult for many respondents. For reading at the sixth-grade level, the average number of syllables per word should be kept under 1.3; for college-level reading, 1.7. A text is also difficult to read if the ratio of root words to words with prefixes and suffixes is only 2 to 1. Reading becomes easier as this ratio increases. Having four times as many roots as prefixes and suffixes makes for easy reading. Finally, words that are not in common use are not as likely to be known by people at lower reading levels. Lists and dictionaries that match words to reading levels can be used for assistance. If the evaluator suspects that readability may be a problem, it should be tested. Several readability indexes focus on word length, number of syllables, word prefixes and suffixes, and sentence length. Examples are the Flesch reading ease formula, the Flesch scale, the Fog index, the Dale-Chall formula, FORECAST, and the RIDE formula. BE CAREFUL ABOUT WORDS WITH SEVERAL SPECIFIC MEANINGS AND OTHER PROBLEM WORDS ---------------------------------------------------------- Chapter 6:2 Sometimes a question is misunderstood because a word in it has several meanings and its context is not clear. For example, evaluators may assume "How significant was that result?" means "How important was that result?" But methodologists may think the question deals with the statistical certainty of the result. Evaluators who try to improve the readability of questions by using more familiar words often use words with multiple meanings. Some examples are "case," "run," "feel," "fair," "direct," and "line." The question "How many cases do you carry in a month?" will have one meaning to a parole officer and another to a baggage handler. Other problem words include "like," "best," "believe," "all," "none," "any," and "could." For instance, "like" depends on its context for meaning. Respondents reading "manufacturers like items" may interpret it to mean "manufacturers prefer items" or "manufacturers' similar items." The word "best" can also cause confusion. There is only one best way, but how often do questionnaires state, "Check all the answers that best apply"? The word "believe" may mean "think" to some and "have a conviction" to others. Because "all" and "none" are absolute words, people who are quibblers may avoid these words, insisting there are no such absolutes. "Any" can mean every or some. And "could" is often confused with "would" or "should." DO NOT USE ABSTRACT WORDS ---------------------------------------------------------- Chapter 6:3 Abstract words, or words that convey general or broad meanings or relationships, should be changed to concrete words, or words with more specific meanings. Concrete words are more easily understood. Consider the following example: "Enumerate the mishaps attributable to personnel not cognizant of the regulation that could have been obviated." After we replace the abstract words with concrete words and reorganize the sentence, it becomes much more easily understandable as follows: "List the preventable errors caused by people unaware of the regulation." However, an undue emphasis on concrete words may cause an overly detailed, inefficient line of questioning. It is important to choose the appropriate level of abstraction. As a rule of thumb, the lower the expected level of a respondent's literacy, the more concrete the words must be. REDUCE THE COMPLEXITY OF IDEAS AND PRESENT THEM ONE AT A TIME IN LOGICAL ORDER ---------------------------------------------------------- Chapter 6:4 Question writers must be concise because they need to cover a lot of topics with as few questions and words as possible. However, they sometimes defeat their own purposes by too quickly presenting complex ideas and by failing to link the ideas in logical order. For instance, consider the following question, addressed to National Guardsmen and Reservists: "What percentage of your mission training and the occupational specialty training that you received during unit assemblies and annual active duty followed a published training schedule?" A less complex, more logical version of this question might read: "We need to know what proportion of your training followed a published schedule. "First consider the mission training you received during the unit assembly. What percentage of this training followed a published schedule? "Next consider the mission training received during annual active duty. What percentage of this training followed a published schedule? "Now forget mission training and concentrate on military occupational specialty training. Consider this training received during the unit assembly. What percentage of the occupational specialty training followed a published schedule? "Finally, consider the military occupational specialty training received during annual active duty. What percentage of this training followed a published schedule?" REDUCE THE SENTENCE LENGTH ---------------------------------------------------------- Chapter 6:5 Sentence length has a big effect on readability. Longer sentences usually contain more information, are grammatically more complex, and are harder for the reader to process. It is a rule of thumb that 10-word and 11-word sentences are suited to a sixth-grade reading level. Every two or three words added to a sentence, up to a 16-word sentence, increase the reading level by about one grade. After this, every word increases the reading level by one year. Hence, sentences of 25 words or longer may require college reading levels. SIMPLIFY THE SENTENCE STRUCTURE ---------------------------------------------------------- Chapter 6:6 One factor that makes question writing difficult is the need for very precise, well-qualified language. To satisfy this requirement, sentences grow in length and become more complex. Although the effects of syntax on readability are not well understood, complex syntax also appears to be associated with reading difficulty. However, as we explain in the next paragraphs, this may result more from a tendency to bury, or embed, a main idea in complex sentence structure than from a problem with complex sentence forms. The simple sentence, containing a clear subject-verb relationship, should be the writer's goal. However, because of the need for modifiers, qualifiers, and variety, more complicated sentence forms will have to be used at times. Here are some rules of thumb. In a complex sentence, the main idea should be at the beginning. If this is not possible, it should be at the end. Embedding the main idea in the middle of the sentence should be avoided. The number of dependent clauses should be limited. Compound sentences should not be used unless the independent clauses are of equal value. Otherwise, the less important clause will take on undue importance. As for compound-complex sentences, they should be avoided, if possible. USE ACTIVE AND PASSIVE VOICE APPROPRIATELY ---------------------------------------------------------- Chapter 6:7 People read faster with more comprehension when the text is in the active voice than they do when it is in the passive. In active voice sentences, the emphasis between the subject and verb is clear and the action moves smoothly. Nevertheless, in question writing, certain thoughts should be emphasized more than others. The passive voice can be very useful in subordinating the subject or focusing attention on the object in the sentence. USE DIRECT, PERIODIC, AND BALANCED STYLES APPROPRIATELY ---------------------------------------------------------- Chapter 6:8 Most questions should be asked in a direct style with the main thought first and the details and qualifiers later. This form, sometimes called a "loose sentence," allows quick development of the main idea and the addition of details without the confusion caused by embedding. However, the question writer should be careful not to dilute the main idea by overloading the sentence. Sometimes the "periodic style," in which the main idea comes last, is more useful. For example, when a complex idea must be expressed in one sentence, the writer can build up or emphasize the thoughts the respondent must consider. On occasion, evaluators may present the reader with a balanced contrast of two equal ideas. When this occurs, the two ideas are presented in like construction. AVOID WRITING STYLES THAT INHIBIT COMPREHENSION ---------------------------------------------------------- Chapter 6:9 Question writers should avoid needless shifts in subject, person, voice, and tense. Wordy writing styles should also be avoided. Cutting down on the number of words and sentences allows the respondent to focus more on the information being presented. Concise writing can also add force and emphasis to a query. Prepositional decay is a serious problem in question writing. It often develops in the simple sentence, in which the writer adds so many qualifiers that the main idea is diluted, deemphasized, or forgotten. Although not as serious a problem as embedded syntax, it can compromise a question's effectiveness. Here is an illustration of prepositional decay and a simplifying revision. Prepositional decay: "The federal government, which has a number of programs to provide assistance to individuals and public and private organizations through the state and local governments for use in planning, implementing, and evaluating housing activities in community development areas, is consolidating these categorical grants under a single block grant." Simplifying revision: "The federal government is consolidating its categorical grant housing programs into a single grant. This grant, called a 'block grant,' can be given to a state or local government." Repetition and parallelism can aid comprehension. However, when overused, these techniques become monotonous and irritating. Because people generally have more trouble with an idea stated negatively, question writers should avoid negatives. It takes longer to read negatives and they make for more mistakes. These problems are exacerbated when double negatives are used, even though they may be logically correct. Although researchers are not quite sure why, they have found that another readability problem develops when writers create a noun from a word that is normally a verb. For instance, the nouns "specification, "participation," and "implementation" were derived from the verbs "specify," "participate," and "implement." Rather than adding a level of abstraction that slows the reader down, question writers should go back to the original verb. Often, seemingly small mistakes can cause a lot of trouble. Misplaced modifiers, for example, confuse the reader. Pronouns are sometimes placed where their antecedents could be more than one word. On occasion, the reverse occurs, and the antecedent of the pronoun is made vague or indefinite or put in the wrong position. A similar problem arises when the word "which" is used to refer to a clause. The clause is perceived as indefinite and the reader is confused. If the clause cannot be reduced to one word, the sentence should be reworked to eliminate "which." The following question has a similar problem: "If you do not have children younger than 12 living with you now, is this likely within the next 2 years?" Because the antecedent of "this" was unclear, some people thought that a "yes" answer meant that they did not have children younger than 12 living with them now and did not expect children to be living with them in the future. However, others thought that a "yes" meant that they expected to have children living with them within the next 2 years. A better way to ask for this information is to ask two questions: "Do you have children younger than 12 living with you now?" and "Do you expect to have children younger than 12 living with you within the next 2 years?" DEVELOPING UNSCALED RESPONSE LISTS ============================================================ Chapter 7 A type of multiple-choice question known as an unscaled response list is frequently used in GAO questionnaires. We develop a list of entries and ask respondents to select one or all that apply. In some instances, we want respondents to rate each category for degree of importance or satisfaction. To prepare a good unscaled response list, the question writer must have a thorough grasp of the subject matter covered by the question and understand the subject from the respondent's perspective. Only then can unscaled response lists meet the following standards: The lists must contain all the categories perceived by respondents as significant to the question topics. The categories must not overlap. The categories must be relevant and appropriate from the respondent's perspective. The lists should not exceed five to nine categories, unless the categories are grouped into sets. The specificity of the response categories must be at the level of detail required to answer the evaluation question. Respondents must feel that the order in which the categories are presented is logical. A prior screening question should be used if the question does not apply to all respondents. DEVELOPING COMPREHENSIVE LISTS ---------------------------------------------------------- Chapter 7:1 To obtain useful data, response lists must contain all important categories that apply to the question area. Usually, the question writer includes an "other (specify)" category to cover omitted alternatives. However, because respondents are more likely to recognize than recall all the factors they want to report, they tend to underuse the "other" category, therefore omitting an important alternative. Do your research; write your list; then pretest. In most cases, pretesting is invaluable for ensuring the adequacy of the response list because the respondent population usually knows the area better than the evaluators do. Seemingly broad topics like the quality of medical care can be resolved into comprehensive lists through research and analysis. For example, consider the question in figure 7.1 used successfully in evaluating veterans' satisfaction with Agent Orange examinations provided by Veterans Administration medical centers. Figure 7.1: Question With Comprehensive List of Categories (See figure in printed edition.) PRESENTING MUTUALLY EXCLUSIVE CATEGORIES ---------------------------------------------------------- Chapter 7:2 To develop nonoverlapping categories, the question writer should use words that clearly define category membership. For example, to determine the marital status of respondents, the writer should avoid using the separate categories "single" and "divorced or separated." The word "single" can be read as applying to either divorced or separated as well as never married people. Another example of overlapping categories is given in figure 7.2. Figure 7.2: Question With Overlapping Categories (See figure in printed edition.) Because the categories in figure 7.2 are not sufficiently qualified, they are not mutually exclusive. In particular, managers, technicians, secretaries, and sales persons all consider themselves professionals. Several techniques can be used to develop number ranges that are mutually exclusive. For example, adding such text as "less than 6 months" and "from 6 months up to a year" helps respondents answer questions involving time. In a question about a respondent's age, the end points of one response category must not overlap the beginning of the next category. See figure 7.3. Figure 7.3: Question With Nonoverlapping Categories (See figure in printed edition.) Sometimes a question focuses on two or more information items rather than one, causing overlapping categories. For example, we wanted to know how former Department of Defense employees had learned about postemployment restrictions. The word "how" in this context has various meanings: from a co-worker, at a retirees' meeting, at the office, from magazines or newsletters, during an exit interview at the department, and so on. A response list with these options would be confusing, because it mixes sources of information and places of learning the information. Rather than asking "How?" we needed to ask either "From whom did you learn . . . ?" and "Where were you when you learned . . . ?" or, better still, "From what source did you learn . . . ?" USING RELEVANT AND APPROPRIATE CATEGORIES ---------------------------------------------------------- Chapter 7:3 The alternatives provided in a response list must be geared to the respondent group. For example, if we are surveying food stamp recipients, the response categories for a question on yearly income should be skewed toward the low end of the income range. If we provide response alternatives of $0 to $10,000, $10,001 to $20,000, and so on, most if not all the respondents would probably select the $0 to $10,000 alternative, and the data would not be very useful. A more appropriate format would be $0 to $2,000, $2,001 to $4,000, and so on. To write relevant and appropriate items, the question writer should tailor the wording to the majority of respondents. An illustration is in the use of medical terms. If we need to measure the receipt of health services, we might use simple terms and give examples, if the respondent is not a medically trained professional. See figure 7.4. Figure 7.4: Tailored Question With Comprehensive Nonoverlapping Categories (See figure in printed edition.) KEEPING THE RESPONSE LIST REASONABLY SHORT ---------------------------------------------------------- Chapter 7:4 People can focus on lists of about five to nine categories. Longer lists should be grouped into sets with titles to help respondents grasp the range of information. When each of the response categories is to be rated (for example, by degree of importance), subgrouping also aids respondents in assessing each entry's relative value. Long response lists are more subject to primacy and recency effects. If respondents are asked to select entries from a long list, they tend to select the first and last entries. (Primacy and recency effects are discussed in chapter 8.) USING CATEGORIES OF APPROPRIATE SPECIFICITY ---------------------------------------------------------- Chapter 7:5 Response categories should be neither too broad or abstract nor too narrow or specific for the measurement purposes, and the specificity should be tailored to each respondent group. The level of response specificity also sets that of the question stem, which should be at one level more abstract than the response space. To measure the quality of a speech, for example, we might ask people to assess its educational value, focus and scope, clarity of delivery, interest value, and topic emphasis. Each of these categories is appropriate to the level of measurement needed for evaluation. More specific information on the clarity of the delivery through diction, accent, and syntax is more detailed than needed to answer the evaluation question. While it is important not to ask for details you do not need, it is just as important to ask for levels of detail that you do need. A survey on water pollution further illustrates this point. When the Environmental Protection Agency asked paper-manufacturing plants about the acidity and alkalinity (pH) of waste water released into rivers, the response categories were not precise enough. The agency asked whether the pH level was 4 to 5, 5 to 6, 6 to 7, 7 to 8, and up but needed to know whether the pH level was 7 (6.5 to 7.4), which is neutral. A pH scale of 6 to 7 includes measures that are acidic. A pH scale of 7 to 8 includes measures that are alkaline. It is also important that the level of specificity meet the expectations for the target population. For example, in a national parks survey of people who burn wood, the question and responses varied in specificity according to the knowledge of the types of people cutting the firewood. For the usually less knowledgeable fireplace users, they ask "what types of woods do you usually burn?" with answer responses of hardwood, softwood, mixed wood, any type of wood, and don't know. But for the more knowledgeable wood stove users, they ask "what types of hard and soft woods do you usually burn?" with answer responses "locust, oak, cherry, hickory, pine, poplar, and cottonwood." LISTING CATEGORIES IN THE LOGICAL ORDER EXPECTED BY RESPONDENTS ---------------------------------------------------------- Chapter 7:6 When respondents read a question, they begin to anticipate the response alternatives. If the alternatives are presented in a sequence that is not perceived as logical, the respondents may feel they have misunderstood the question and return to study it again. (Logical sequence is discussed in chapter 11.) USING A SCREENING QUESTION ---------------------------------------------------------- Chapter 7:7 Response lists may place an implicit demand on respondents to check an entry. For example, if doctors are asked to report the professional publications they read during a 2-week period and are presented with a list, they will probably check something regardless of whether they have read a journal or not. Using a screening question that asks whether or not they had been able to read any publications in the last 2 weeks would reduce this tendency. MINIMIZING QUESTION BIAS AND MEMORY ERROR ============================================================ Chapter 8 QUESTION BIAS ---------------------------------------------------------- Chapter 8:1 Sources of bias can occur in either the stem of the question or the structure of the response. Various types of biased questions, as well as some ways to avoid them, are discussed below. STATUS QUO BIAS -------------------------------------------------------- Chapter 8:1.1 Questions that state or imply prevailing conditions may produce inaccurate data. In the following examples, the use of "most" and "as it now stands" could influence answers: "Most child support enforcement offices confirm the employment of absent parents on a regular basis (such as monthly or every other week) rather than 'as needed' (such as when support payments are not made or when files are transferred). Does your office confirm the employment of absent parents regularly or on an 'as needed' basis?" "As it now stands, Department of Defense policy is to provide civilian employees with information on postemployment restrictions during exit interviews. Did you receive any information on employment restrictions when you left the department, or did you leave without getting this information?" Better presentations of these questions would delete status quo information, since some respondents would otherwise feel compelled to conform to what is seen as "normal." BIAS IN MORE THAN ONE DIRECTION -------------------------------------------------------- Chapter 8:1.2 Sometimes question writers add qualifying or identifying information that can bias respondents in different directions. For example, a question writer might ask, "Who would you vote for, Pat Green, the Republican incumbent, or Chris Lamb, the Democratic challenger?" If the question writer is interested in the choice between Pat Green and Chris Lamb, the question is biased. The respondent's choice will be influenced not only by the persons individually but also by political party and the difference between continuance and change in leadership. An illustration of this type of bias in a GAO study might be the following: "Should program managers with responsibilities for major weapon systems be civilians with an engineering background or military personnel with an operational background?" If we want people to base a choice on whether the managers are military or civilian, we must take out the engineering and operational qualifications. If we want people to base a choice on operational and engineering qualifications, we must take out the military and civilian comparison. If, however, we want them to base a choice on several factors, all the factors must be presented. Consider the following example: "How important, if at all, is it for the project managers to be civilians or military or have engineering or operations backgrounds?" BIAS FROM SPECIFIC WORDS -------------------------------------------------------- Chapter 8:1.3 Certain words are "loaded" because they evoke strong emotional feelings. In our culture, such terms as "American," "freedom," and "equality" may tend to evoke positive feelings and "communist," "socialist," and "bureaucracy" may tend to evoke negative feelings. Other emotionally laden words, such as "abortion," "gun control," and "welfare," probably evoke a complex pattern of responses. Since it is difficult to control or predict the effect of these words, it is usually best to avoid them. We can illustrate phrasing that could bias responses. See figure 8.1. Figure 8.1: Biased Question (See figure in printed edition.) Phrases such as "people who don't work" do not contribute to an objective frame of reference. (See Warwick and Lininger, 1975.) An example from a GAO study involves a mail survey of private industry's views on competitive bidding practices for major weapon systems. An article by an expert had compared the bidding process to a game of "liar's dice," implying that bidding is like a game that favors a skilled deceiver. The use of the term "liar's dice" could elicit a negative or threatened feeling. Instead, we wrote the question as follows: "One approach to bidding might be to be conservative. That is, to overestimate cost and underestimate performance on the theory that a firm will look better when it delivers because it beat its original estimates. Another approach would be to make a realistic bid by specifying the actual costs and expected performance. Still a third approach would be to be optimistic by understating costs and exaggerating performance. You might do this on the theory that if you are not optimistic, you won't get the job. The question is, Which strategy gives the best probability of winning: making conservative and realistic estimates or optimistic estimates?" Interestingly, a single word can affect how people respond to a question. For example, people viewing a film that shows a car crash will probably report broken glass if we ask them what happened when the car was "smashed"--even if the film does not show any glass breaking--but they would not report broken glass if we ask them what happened when the car was "hit." UNBALANCED QUESTION BIAS -------------------------------------------------------- Chapter 8:1.4 Just as we can have unbalanced response categories (see chapter 4), we can have unbalanced questions. The wording of an item stem or question may imply or suggest how the respondent should answer. "Do you support the establishment of group homes for the mentally retarded in single-family zones?" or "You're the best trained soldiers in the world, aren't you?" might elicit positive answers, since no other possibilities are made explicit. Questions can frequently be balanced by adding "or not" ("Did you get training or not?") or word opposites ("Do you support or oppose?") It is important to balance word opposites well. For example, "forbid" and "not allow" have different meanings and cannot be used interchangeably as opposite terms for "allow." Depending on the context, "dissatisfied" is the appropriate opposite term for "satisfied," while "not satisfied" is inappropriate. For example, some studies of employee satisfaction indicate that those who are "not satisfied" with their work are basically content but would like improvements in some areas. In contrast, employees who are dissatisfied are basically unhappy with their work. QUESTIONS THAT OMIT IMPORTANT FACTORS -------------------------------------------------------- Chapter 8:1.5 The answers respondents give to a question vary according to their frame of reference. For example, some employees might judge their job satisfaction on their commuting time, some on promotion policies, and others on types of tasks and responsibilities. The question asker must ensure a common frame of reference by delineating each of the factors respondents should consider in reaching an answer. This is particularly important when the respondent has a vested interest in the subject and when complex questions containing several aspects are being asked. Even though a question may be formally balanced, one position may be favored over another because of the topic and the respondent's characteristics. For example, we asked farmers "Do you think the government should provide free agricultural weather reports or not?" Expecting a yes bias, we needed to get the respondents to consider the question from a variety of viewpoints. For example, "In reality there are no free services or subsidies since ultimately everyone pays taxes to provide them. The question is, Do you favor free weather reporting services even though all taxpayers must bear the cost?" In a survey question mentioned previously, program managers of major weapon systems were asked whether civilian or military personnel should be program managers. Most of the respondents were military. To obtain opinions based on balanced considerations, we presented the pros and cons: "A persistent issue is whether or not the program manager [PM] position should be held exclusively by military personnel. There are advantages and disadvantages attributed to the military PM. Pro-military arguments claim knowledge and appreciation of the system (conditions, personnel, organization, etc.) and advantages of service affiliation. However, the military PM system is sometimes criticized for short tenure, valuing performance over cost, constraints on independent action from the military rank hierarchy, and service-mission suboptimization. The question is, Should the federal PM work force be composed exclusively of military personnel or should it be composed of both qualified military and civilian personnel?" Broad questions contain many different aspects to be evaluated. People tend to be selective in remembering and consider only some arguments. The question writer should present all the significant factors and should balance the pro and con positions. If three arguments are given in support of a position and two arguments are given in opposition, endorsement percentages will tend to favor the former. PRIMACY AND RECENCY EFFECTS -------------------------------------------------------- Chapter 8:1.6 Structured response formats vary in length from two alternatives (such as "yes" and "no") to fairly lengthy lists. The evidence in survey research is mixed regarding the tendency of respondents to pick alternatives presented first (primacy effect) or last (recency effect), regardless of item content. Primacy effects may result because the first item in a series may receive additional attention or mental processing. Recency effects seem to be more likely when the reader is presented with lengthy or complex text, lists, or response alternatives. When presented with a questionnaire item, people try to process both the question, or the stem part of the item, and all the choices in the response part of the item before answering. Hence, respondents with long lists of response alternatives tend to be biased toward the last few items, because that is the material they have been exposed to just before they are ready to answer. However, primacy and recency effects often work in tandem. This results in higher reporting for the first and last few choices in a list and lower reporting for the middle items. The effect of these biases is also dependent on the media used. In self-administered instruments, primacy effects dominate. But the opposite is true for personal interview and telephone interview surveys. For these, recency effects dominate. Some of the best ways to minimize the differential effect is to keep the list short or add subtitles and use formats that present the list in shorter groups. (See figure 8.2.) Another way is to slow the reader down by turning the "check all that apply" format into a "check yes or no" format. (See figure 8.3.) Figure 8.2: List Divided Into Subgroups to Counter Primacy and Recency Biases (See figure in printed edition.) Figure 8.3: "Check All That Apply" Response Format Changed to "Check Yes or No" Format (See figure in printed edition.) Bias effects from prior processing of an item or having prior concern with the topic can sometimes be ameliorated by the placement of the item in the questionnaire. For example, community opposition is frequently cited as a problem in locating group homes in residential areas. In surveying people who operate group homes for the mentally retarded and emotionally ill, we asked them to respond on a five-choice scale. We expected a tendency on the part of respondents to focus on positions they had encountered. Therefore, the scale was constructed with support and opposition for opposite poles. Furthermore, we counteracted the inherent bias by presenting the support anchor as the first unit and the opposition anchor as the last unit. This example is presented in figure 8.4. Figure 8.4: Using Presentation Order to Counteract Expected Bias (See figure in printed edition.) PRESENTING CHOICES IN A LOGICAL SEQUENCE -------------------------------------------------------- Chapter 8:1.7 A list of unscaled response alternatives (reasons for going bankrupt, characteristics of grazing land, and the like) must be put in a logical order. That is, the options that are of primary significance to the topic being considered should be listed first. Otherwise, we will violate a rule of conversational English and perhaps confuse the respondent. For example, a questionnaire asking people why they dropped their memberships in health maintenance organizations would present the ability to choose doctors and the quality of care at the beginning of the list and paperwork and hospital decor at the end. USE OF THE "OTHER" CATEGORY AND INCOMPLETE LISTS -------------------------------------------------------- Chapter 8:1.8 Question writers often include an "other" category in unscaled response lists as a check for the completeness of the lists. The "other" category offers the respondent the opportunity to add the additional salient responses that the writer missed in providing a comprehensive range of choices. Omitting viable options as well as the other category causes overreporting in the categories presented because the respondents will force the omitted choices into these categories. Similarly, they sometimes overreport in the "other" category for the same reason. It is essential that the evaluators analyze responses in the "other" category to (1) determine the adequacy of the choices listed and (2) make adjustments for underreporting in the major categories (for example, one respondent wrote "availability of housing" under "other" when availability of housing stock was listed as an entry). BIASED EXAMPLES -------------------------------------------------------- Chapter 8:1.9 Sometimes questionnaire writers provide examples to illustrate the kind and range of information needed. Single illustrations may cause a respondent to restrict a frame of reference. For example, were we to ask students how satisfied they are, if at all, with their teachers and mention the name of only one teacher, we might get their evaluation of only that teacher rather than of their teachers in general. MEMORY ERROR ---------------------------------------------------------- Chapter 8:2 Many factors affect memory: the time since an event occurred; its saliency; the respondent's motivation, ability, and experience; the type of material to be recalled or recognized; and, most importantly, the way in which the questions and reporting formats are crafted. Memory error can result in either underreporting or overreporting. Memory error is revealed in three ways: omissions (forgetting that an event occurred), intrusions (recalling an event that never occurred), and event displacement (miscalculating when an event occurred). Consider the dynamics of memory in answering questions. The respondent must comprehend and interpret the question; decide what information is needed; search his or her memory; select, analyze, and integrate the information; and make a judgment on what and how to report. To do all this, the respondent behaves in part as if his or her brain functioned with two types of memories: a short-term memory and a long-term memory. Respondents use the short-term memory to remember the question text long enough to understand and interpret the question and initiate a retrieval process from the long-term memory. People usually retain the short-term memory information only long enough to use it (18.7 seconds). For example, they usually forget the telephone number they have just dialed or the syntax of the question they have just answered. While the two memory functions work together complementarily, they appear to be quite different. The short-term memory processes information much more quickly than the long-term memory. It handles information in limited sets of about seven chunks or units and stores it as a representational image. The slower long-term memory retrieval system stores and accesses most of what the respondent knows. It stores this information as semantic or meaningful codes rather than as representational images. We need to understand this difference because to facilitate recall, we need to write questions that satisfy the requirements of both of these processes. FACILITATING THE MEMORY PROCESS -------------------------------------------------------- Chapter 8:2.1 Some practices that have been shown to facilitate the memory process follow. 1. Use simple, direct, organized, and specific language. The memory process is facilitated by using an organized line of inquiry, by using the active voice, by avoiding lengthy qualifications, by using familiar words with a limited range of meanings, and by using simple syntax. (Complex syntactical constructions often embed the main point.) This language style facilitates the short-term memory process, because it allows the respondent to quickly identify the type of information needed for the answer without taxing the short-term processing capabilities. It also helps the long-term process by aiding the respondent to remember. 2. Be consistent with the way people have learned the information they are asked to remember. Present the question material in the same sequence, manner, terminology, level of conceptualization, detail, and abstraction in which the information was learned or is usually experienced. Sometimes even slight changes can interfere with retrieval. For example, it takes the average person 10 seconds to report the months of the year in calendar order but 2 or 3 minutes to report the months in alphabetical order. 3. Avoid reversals. For a variety of reasons, English uses negative subordinations, prepositions, and other language codes to reverse the meaning, order, or importance. For example, "not unlikely" means "likely," and "performance was worse under PFP than TQM" means that performance was better under TQM than PFP. Avoid these reverse constructions. The memory system sometimes forgets the reversal code, so that the information is recalled incorrectly. 4. Make sure the questions and the reasons for asking the questions are meaningful to the respondent. People are more likely to be able to recall information they believe is important. 5. Use the question to guide the answer search. People remember things better when the topic of the search is specified initially. They store information in related and hierarchical categories. Therefore, the question, or the stem part of the item, should ask the question in a simple, complete, and direct manner, and it should specify the category or type of material to be searched for at a level only somewhat more general than the details or specific choices presented in the response space. For example, consider the following question: "How satisfied or dissatisfied are you with the following components of your benefit package?" This stem is not as effective as an alternative stem that asks "How satisfied or not are you with the following benefit and pension components of your compensation package?" because people tend to see the compensation package as having quite different classes of components--benefits and a pension. Avoid stems that attempt to shorten the question and not identify the search category. For example, "How satisfied or not are you with the following?" Worse yet are one- or two-word stems that imply a question and use the complete sentence format. An example is "satisfaction with: health insurance? life insurance? etc." 6. Do not overload the short-term mental processing system with too many alternatives, considerations, or qualifications. An example is "How satisfied or dissatisfied were you with the information you obtained, if you obtained information, on `Brassica cultivars' in current use, special genetic stocks, obsolete cultivars, traditional varieties or landraces, distant relatives of cultivated varieties that form fertile hybrids, varietals that can be crossed and varietals that can be crossed with advanced techniques on the amounts of resources existing in nature, and gene banks and the amounts of resources that are in decline?" Here the short-term mental processing system is obviously overloaded because the respondent is asked to consider too much information simultaneously; in such cases, the respondent often resorts to inefficient coding and long-term system access strategies. In the stress of inefficient strategies, the information retrieval, integration, and judgment functions of memory recall are usually the first to break down. This does not mean that evaluators cannot make complex inquiries. But they must limit the information to be kept in the respondent's head during the comprehension and retrieval tasks to a small number of units that can be immediately processed in discrete steps. The set size of information units that can be kept in one's head varies from a few units to about seven, depending on the similarity and complexity of the units. Complexity can be handled by increasing the number of steps and presenting them serially. For example, consider the preceding example when it is decomposed into the sequence of questions in figure 8.5. Figure 8.5: Complex Question Broken Into Sequence of Questions (See figure in printed edition.) While the proposed alternative may have more words and structure, overall this line of questioning will provide faster, less burdensome, and more accurate answers. 7. Make the judgment that the respondent has to make during the retrieval from long-term memory as easy as possible. Recall is much less accurate for information that is complex, multivaried, vague, or with conflicting elements. For example, consider the following question. "Please provide an overall assessment of all the GAO reports that you read last year with respect to the following considerations: timeliness, clarity, quality of reporting, responsiveness, comprehensiveness." It would have been better to ask about a specific report that the respondent had read. Then specific questions should have been asked about each of the attributes in which the evaluators were interested. Each attribute should have been carefully specified in concrete, operational, and meaningful terms. For example, timeliness should have been resolved into two components--turnaround time and the provision of information in time to use it. The quality of the reporting should have been given at least four properties: focus and scope of the reporting, the soundness of the evidence provided, appropriateness of the qualifications, and logic of the conclusions. In a case like this, overall assessment questions should be considered last after the respondents have refreshed their memory on all the properties to be evaluated. 8. Use cues to help respondents retrieve data from their memories. Some of the examples below show a wide variety of cuing methods. For example, rather than ask "what are the ages of your children?" ask "Starting with the oldest give me the ages of your children as of their last birthday." The second alternative uses explicit language and time and episode incident referents as well as a natural order as cues. The following example uses place, time, episode, routine memory, qualification, quantification, and people referent cues: "Think about where and whom you were with when you ate breakfast this morning. List the foods you ate. Did you have more than one serving of any of these foods? Did you eat less than half of any of the servings?" Questionnaire writers also use examples to trigger recall. Consider the following case in point: "So far this year the Suburban Trust Company reported that 10 percent of their windowed return envelopes were incorrectly posted because the mailer inadvertently put the mail insert over the address that was supposed to show though the window. The question is, when was the last time you did that?" In addition to these more familiar techniques, researchers have developed some rather ingenious way of cuing. Specifically, crime report surveyers found that the conditions and activities that the victims might have experienced reduced memory errors by 30 percent, as in "Think about the times you came home late at night last year. Were you ever robbed or assaulted?" In still another approach, authors cued recall by varying the respondents' orientation. Respondents were asked to recall the details of a house visit from the viewpoint of a prospective home buyer, then from the perspective of a burglar. Each successive recall produced new and accurate observations. There are other types of cues. Examples are a calendar of political, newsworthy, or administrative events; a list of names, topics, or events relevant to the material to be recalled; and a narrative description of the respondent's routine. Question writers also use the questionnaire text to cue memory. Respondents receive cues from the organization of the survey, the line of questioning, the direction given in the question stem, the instructions, the presentation of the response alternatives, and the emphasis given. There are, of course, problems with cuing. First, if inappropriate cues or miscues are presented, the respondent usually produces an inappropriate or erroneous answer. Second, respondents who have no memory of the event or truly cannot recall may feel pressured to answer with false reports based on the cues offered rather than saying, "I can't recall." Third, if the cues are leading, the respondents may follow the cue and bias their answers. Fourth, cues that use special terms or difficult words often confuse the respondent. 9. Consider using longer questions sometimes. Longer questions may set the scene by presenting significant aspects of an argument, defining how terms are to be measured in the question, or giving examples. Short questions sometimes achieve their brevity by means of complex words. To say the same thing more simply takes some effort but may reap rewards by increasing a respondent's memory and comprehension. 10. Always consider the respondent's ability, motivation, or viewpoint. Respondents who are fresh, interested, alert, confident, and smart answer with less memory error than those who are fatigued, bored, unobservant, anxious, under stress, or less capable. Respondents who have a strong concern or bias toward a particular issue will remember things that support that viewpoint and forget facts that do not. Material that is well learned, or that the respondents are familiar with or have thought about or been extensively associated with, will be recalled much better than material that has had more limited exposure. 11. Consider the limitation on memory. Unless the information is memorable or well learned or unless a conscious effort is made, people do not store details in their memory. They will forget about 75 percent of the detailed information they were exposed to in 1 to 3 days. They code and store the information as a summary, organized around the essential facts or salient features of the observation, experience, event, or material. After longer periods of time, months and years, they may forget even these summaries. Their memories will sometimes distort or selectively add or subtract or otherwise alter the information stored. If the stored information is inconsistent or not meaningful or rational, their memories will omit the inconsistencies and add material that was never originally stored in their minds to make the information consistent, meaningful, or rational. If they later find out that some aspects of the mentally stored information are inconsistent or not important, they will forget that information and again add new information and correct important old information that was never part of the original memory. In short, people are likely to remember the gist of an event better than its details. If we need highly detailed information, we should consider using other data-collection sources, such as observations, diaries, and records rather than self-administered questionnaires. If mail questionnaires do require detailed information, respondents should be asked to refer to their records; however, the burden of this may decrease response rates. 12. Maintain a similarity in style among like items and responses. A similarity or parallelism within and among questions, choices, text cues, and presentations should be maintained for common or similar attribute measures. For example, if we start out by having positive attributions to the right and negative to the left, it is usually a good idea to keep them that way throughout the questionnaire. Numeral indexes offer another example. Respondents who see the first few high numbers as "good" may make a mistake and check the wrong number in items where low numbers are "good" and high numbers are "bad." If the writer changes cues, the writer must make it very clear when the "signals have been switched" and be very consistent with the use of signal switching cues. There are other exceptions to the general rule of maintaining similarity in response formats. If we ask people to recall an extensive and detailed set of information and then follow this up with a second request involving another extensive and detailed consideration of information that is similar, performance on the follow-up question will be degraded. This is called "forward interference." "Backward interference" can also occur. For instance, in complex questions, if we asked a third question of a similar nature, as one might do in a bridging or overall assessment question, we would again have problems, because the third question requires the respondent to retrieve information requested by the first question as well as by the second. The similarity of material in the second question interferes with the respondent's memory of the considerations he or she used to answer the first question; since this material is needed to answer the third question, the third answer is compromised. Another exception deals with capacity. It seems that there is a limit to how much we can ask about a subject at any one time. People's performance starts to degrade after retrieving information on 20 or 30 similar items, even though hundreds or thousands of items are stored in their memory. If exhaustive recall is required, the question writer should break up the topic into several questions and space these questions out. The corollary to question and appropriate response similarity also holds. That is, if the appropriate response is dissimilar to how the respondent has learned to answer, performance will be degraded. To get around this, the style of question presentation should be changed or other cues included to let the respondent know that the appropriate response is different from that which was previously learned. However, regardless of the presentation switch, the question stem, text, and format should be in agreement and consistent within the question. REMEMBERING FREQUENCY AND TIME OF OCCURRENCE ---------------------------------------------------------- Chapter 8:3 To measure frequency or time of occurrence, question writers need to relate the information about an event or series of events to a date or a specific time period. Questions measure frequency in two ways. They ask how many times the event occurred in a referenced time period (such as March 1 to April 15). This is called the "frequency method." Or they can ask when the event occurred and how much time elapsed before the next event occurred. This is called the "interval method." Interval measures often provide higher estimates, particularly if the reference periods are short and the period over which the measure is generalized is long. Recall will be more accurate if the reference period (that is, the period for which data are requested) is short and the time gap between the reference period and that of recall is also short. However, as a practical matter, writers usually want the time periods to be as long as possible and often to extend back for a long time so as to efficiently capture as many events as possible. They also like to be able to deal with long and variable, different, or nonstandardized gaps between the reference period and that of recall. But these requirements are incompatible with memory performance. Hence, the time periods are usually chosen on the basis of a trade-off between data accuracy, data-gathering efficiency, and data representativeness. The extent to which we can stretch the reference period and reference period reporting gap depends on the extent to which the events are salient or repetitive. Salient events have been defined as events that are unusual or have significant economic and social costs or benefits to an individual. Events that have continuing consequences, such as President Kennedy's assassination, have been likened to snapshots by means of which exact details of the moment are remembered. Hospitalization, marriage, and car purchases are other significant events for which people have a high level of recall. Although highly salient topics are less likely to be forgotten, they tend to be remembered as having occurred more recently than they actually did (this is called "forward telescoping"). Conversely, events that are less salient will be thought of as having occurred less recently ("backward telescoping"). For questions about the frequency or timing of salient events, respondents should be asked to report on events that occurred during the last 3 months. Periods of up to 6 months or a year have also been used. These longer periods help minimize telescoping. If telescoping becomes a problem, there is still another approach that will help reduce it. First, we can ask that people recall the time period prior to the reference time period. This will capture the telescoped event. Then we can ask for a recall of the reference time period. This is called "bounded recall." For events of intermediate saliency, about 1 month is an acceptable compromise. These time periods seem to provide the best trade-off for balancing omissions caused by forgetting and errors caused by incorrectly remembering an event against an efficient and representative time period. Getting frequency data for repetitive events poses another type of problem. To get reasonably accurate data, question writers should use time periods of a few days or a week. However, while these periods may be more accurate, they are too short to be representative. If we want representative data, the periods should be from 2 to 4 weeks. These are less accurate but they will provide data that are more representative. For many purposes, the accuracy will be good enough to get a general idea of a pattern of events. Respondents appear to use a different recall strategy for longer reference periods. They use their generic memories. That is, they report what usually happens, not what actually happened. Paradoxically, this may be more representative of normal experience. Finally, if the event is neither salient nor repetitive, the time period should be very short, one day or at best a few days. Even at that, the recall accuracy may not be very good, and other methods should be used. In summary, questions need to be asked in ways that help the respondents access their memories most efficiently and accurately and in ways that reduce their memory error. They must consider the short-term memory bias introduced by position, emphasis, and complexity or by the simplicity or similarity of the preceding text or succeeding answers. The question writers must know how the choice of time references, the saliency and repetitive nature of events, and the level of detail requested affect the accuracy of reporting. And, finally, they should take into account the limitations of a respondent's memory--that is, the types of events and time periods for which recall is usually very poor. MINIMIZING RESPONDENT BIAS ============================================================ Chapter 9 The previous chapter discussed the response inaccuracies that can occur when evaluators inadvertently ask biased questions. Bias can also occur in the responses to questions because of a respondent's style in answering, such as the tendency to agree regardless of the issue, or because respondents perceive the questions as personally intrusive, objectionable, or threatening. This chapter discusses question writing techniques that help reduce or avoid these response distortions. RESPONSE STYLES ---------------------------------------------------------- Chapter 9:1 Response styles, or biases, have been defined as the tendency to respond in certain ways regardless of a question's content. Response styles vary considerably with the behavior in question and the conditions. For instance, respondents are more likely to answer questions about their education than their income. They are more likely to underreport problems about work while they are at work than while they are at home. They are likely to underreport behavior that is socially undesirable, especially if the behavior is presented in the extreme. Hence, question writers must be aware of response-style distortions and the ways to account for or counterbalance them. Conversely, respondents may select socially desirable answers over other choices. Socially desirable responses represent culturally accepted norms for opinions and behavior. Many people give socially acceptable answers about library card ownership, reading habits, charitable giving, and voting behavior. Occupation questions frequently provide another opportunity; occupational checklists with little or no explanatory details invite overstatement. For example, shipping clerks may check the job category "traffic manager," a position that can imply substantially more responsibility. Here are two ways to reduce overreporting or overstatement of socially desirable responses. 1. The question writer should ask specific questions. For example, the shipping clerk will be reluctant to check "traffic manager" if answers to detailed questions about job responsibilities are required. 2. The question writer should make a single question containing a socially desirable response into two or more items. Respondents are more likely to answer truthfully about verifiable behavior. A series of questions can provide a respondent a "face-saving" escape. Although the behavioral question may not permit the respondent to give the most socially desirable response, topic awareness, knowledge, and other items may. An example from a GAO audit illustrates these approaches. The Food and Drug Administration requires chemical testing and inspection. Asking chemists if they can do chemical tests could yield overreporting that they can. We did ask this question but, to assess the extent of overreporting, another question measured how much preparation they would need to do the tests. See figure 9.1. Figure 9.1: Question to Reduce Overreporting (See figure in printed edition.) By taking the two questions together and interpreting the responses to both questions ("Can you do these tasks?" and "How much preparation do you need to do them?"), we could estimate overstatements of socially desirable alternatives. MAKING A GOOD IMPRESSION -------------------------------------------------------- Chapter 9:1.1 Respondents like to make a good impression. A study on personal bankruptcy illustrates the point. Individuals were asked to rate a list of factors on the extent that each contributed to their financial problems. The response "took on too many debts at one time" was underreported and "credit was too easy to get" was overreported. To help overcome this tendency, do not place the sensitive items in prominent positions but list them midway in a checklist of several other plausible choices in a matter-of-fact manner. This approach can help respondents place response options in an objective frame of reference. Also, analyze the under- and overreported categories together. For example, "took on too many debts at one time" and "credit was too easy to get" were actually two sides of the same coin. EXTREME POINTS OF VIEW -------------------------------------------------------- Chapter 9:1.2 Some people do not want to be categorized as holding an extreme point of view, even though they may feel strongly about an issue. When people are presented with three choices (for, neutral, and against, for example), they tend to select the middle category. To counteract this tendency, question writers can extend the scale to include more category ranges (definitely pro, more pro than con, neutral, more con than pro, definitely con). However, some people select choices that represent extreme points of view regardless of the topic. Providing more category ranges (such as five or seven responses), organizing related topics so they are considered as a group, and providing adequate text to describe the categories (called "anchoring") help reduce a bias toward extremes. ACQUIESCENCE -------------------------------------------------------- Chapter 9:1.3 Because some respondents demonstrate the tendency to agree, writers should limit the use of agree or disagree questions. Besides offering the opportunity for a "yea saying" bias, they provide limited information. A more detailed discussion of those points and other problems associated with agree or disagree or Likert scales is presented in chapter 4. HIGHLY SENSITIVE ITEMS ---------------------------------------------------------- Chapter 9:2 As mentioned in chapter 5, highly sensitive questions should be written with care and should be used only when the information is vital to the evaluation and cannot be otherwise obtained. Personal questions, such as data on income, sex, marital status, education, and race, may be perceived by some respondents as intrusive and should be included only if necessary. Also, socially undesirable conditions, such as being unemployed or going bankrupt, may cause respondents discomfort. Other types of questions that can be perceived as threatening are usually highly specific to the topic under evaluation and the respondent's characteristics. Examples include surveying private industry officials about their bidding strategies, asking employees to assess the management of their agency or company, and asking self-evaluation questions such as "How would you rate your job performance compared with that of others?" Questions that could ask respondents to legally incriminate themselves should probably be reworded to remove this threat. Before using sensitive items, the questionnaire writer needs to consider several questions: Can I get the answer I need through an archival source? How many people might not respond? Is the occurrence rate for the particular behavior or condition so low that asking for the data is not worthwhile? And how will the sensitive question affect GAO's image among respondents and the public? Having decided that sensitive items are necessary, the question writer should use the following guidelines to reduce underreporting and answer bias. 1. Explain to the respondent the reason for asking the question. 2. Make the response categories as broad as possible. 3. Word the question in a nonjudgmental style that avoids the appearance of censure or, if possible, make the behavior in question appear to be socially acceptable. 4. Present the request as matter of factly as possible. 5. Guarantee confidentiality or anonymity, if possible. 6. Make sure the respondent knows the information will not be used in a threatening way. 7. Explain how the information will be handled. 8. Avoid cross classification that would pinpoint the answers. For example, when evaluators ask questions about income, respondents should be asked to choose from a list of income ranges rather than to enter specific dollar amounts. The income ranges should be appropriate for the target population and broad enough to afford the respondent a feeling of privacy. An example is in figure 9.2. Figure 9.2: Question With List of Ranges (See figure in printed edition.) A series of questions and an indirect approach can diffuse the threat of asking about behavior that may be considered socially undesirable. For example, suppose that the evaluators need to find out about the job-hunting activities of the unemployed. The question series might be developed like the one in figure 9.3. Figure 9.3: Series of Indirect Questions (See figure in printed edition.) Notice that items 1 and 2 recognize that looking for work is often difficult. This puts the respondent at ease, reduces the threat of revealing possibly embarrassing information, and makes not looking for work socially acceptable by providing very good reasons. This minimizes overreporting in looking for work. Using a specified time reference can reduce a question's threat. For instance, if evaluators need to find out whether people are coming in late for work, the question writer can ask, "Were you more than a few minutes late for work this morning?" rather than "Are you usually late for work?" This is because people are more apt to admit to a single offense than to being habitual offenders. The threat of some topics can be reduced if the rationale for asking the question is provided. For example, GAO wanted to send questionnaires to its disabled employees who needed the services of a federal program for the handicapped. The questionnaire's purpose was to assess the employees' work conditions and opportunities. The only way to identify people who needed handicapped-program services was to contact all employees who reported a disability to the agency when they were hired. However, many people consider a disability a private matter and might hesitate to answer the questionnaire. To encourage responses, we explained exactly why GAO management needed the information and how it would be used. Although we always state a survey's purpose, we explained this one more completely. An example provided in chapter 8 illustrates another approach for potentially threatening questions. Private industry officials were asked to comment on competitive bidding strategies. To reduce the question's threat, we wrote the various bidding strategies (conservative, realistic, and optimistic) carefully in a way that eliminated biasing terms such as "liar's dice." In addition, the question gave equal attention to all strategies, even though only one strategy was critical to the survey. Still another way to reduce threat is to transfer or remove blame. For example, a questionnaire administered to a grief-stricken and guilt-ridden parent of a child with Reye's syndrome might ask, "Did your child take aspirin?" rather than "Did you give your child aspirin?" Another technique to minimize overreporting of desirable behavior such as conducting compliance audits or voter participation is also to ask the respondent for concrete details associated with the desirable behavior. For example, we might ask, "If you conducted a compliance audit, please write the date, title of the audit, and name of principal auditor in the space provided. If not, check `no audit conducted' and skip to 19." Similarly, we might ask voters to "List the address of the polling place" if they voted; if not, check "Have not voted" and "skip to 19." There are also some ingenious ways of minimizing underreports of undesirable behavior such as failing to report all taxable income to the Internal Revenue Service. Two examples are the "randomized response technique" and the "list technique." Both methods use a prior or subsequently determined probability to mask the respondents' answer in such a way that the respondents can readily see that they are protected. Both methods still allow for population estimates. Briefly, the methods work in the following ways. In applying the randomized response technique, we might ask only if the respondent's answers to the following two questions would both be the same or different: "Were you born in this month?" and "Did you fail to report all your taxable income when you filed this year? Answer only the same or different." "(That is, if your answer would be NO to both questions or YES to both questions, check the same. Otherwise check different.)" Since we know the population probability of being born this month, we can calculate the population probability of not filing, but we cannot identify with certainty any individual who did not file. As with the randomized response, the list technique uses a population conditional probability that can easily be determined to mask the certainty of the respondent's admission. For instance, the list technique might ask if any one of a number of infrequent events happened this month. For example, "Did you have a birthday, get a parking ticket, get a promotion, buy something that cost more than $300, underreport your taxable income, or take two or more plane trips this month? (Answer YES if any of these events happened last month)." Half the population would be asked the question with the illegal event included in the list and half without. The proportion of cheaters is calculated from the difference between the two populations. However, a word of caution is in order. While these methods work and have been used to estimate behavior such as heroin usage, they have a down side. First, they are costly and more difficult to implement. Second, they need a larger sample size than other methods. Third, they should be undertaken only under the guidance of a skilled practitioner who is familiar with them. MEASUREMENT ERROR AND MEASUREMENT SCALES IN BRIEF =========================================================== Chapter 10 The questionnaire is an instrument used to take measures. Virtually all instruments cause measurement error. For most physical measures, this measurement error is combined with all other types of error (for example, sampling error) to determine the total error. But questionnaire measures differ from physical measures because the instrument error and misspecification of variable errors are seldom determined. This is because such determinations take such laborious analysis and extensive testing and retesting that in many cases it is simply impractical to determine these errors. Hence, the convention is to report only sampling error and ignore the other sources of error, which in most cases are probably larger than the sampling error. The most practical way to address this problem is to use the guidelines presented in this transfer paper because they were specifically developed to minimize questionnaire measurement error. To make full disclosure, you should publish the questionnaire along with the sampling error so that report readers can get some idea of the quality of your measures. Broadly speaking, there are two kinds of measurement error, bias and random error. Bias, sometimes called "systematic error" or "inaccuracy," occurs when respondents consistently underreport or overreport by a constant amount or range of amounts. For example, the phrasing of a questionnaire item about income may cause respondents to fail to include a particular category of income and consistently underreport. However, some surveys consistently overestimate. For instance, in some surveys the real level of unemployment is overstated because of the way the questions categorize people who are in transition between jobs. The second kind of measurement error is called "random error" or, sometimes, "chance error," "unsystematic error," "noise," or "imprecision." Respondents may react to a vaguely worded question in many different ways, some providing an answer that gives less than the true value and others an answer that is greater. For example, we may want to know how many times a person visited a physician in the last year. If we asked, "How often have you sought health care?" our data would probably contain much random error. Some people might count visits to a podiatrist or a chiropractor and others might not. Some might count phone contacts, while others might count only office or hospital visits. Some might count a visit to a resort containing mineral springs. When a question is not precise about the information wanted, there is much opportunity for random error. For mail questionnaires, every respondent reading the form should interpret each of the questions the same way. Also every question should be designed to minimize the biases that both the questionnaire and the questionnaire respondent place on the answers. This is why the preceding chapters emphasized the need for structure, the need for pretesting, and the need to consider the effects of format, appropriateness, qualifications, clarity, memory, and respondent bias. MEASUREMENT SCALES --------------------------------------------------------- Chapter 10:1 In chapter 4, we discussed how different formats permit different levels of measurement. In selecting a question format, evaluators should think ahead to the point at which they will have finished data collection and will be starting their analysis. They should try to use the level of measurement or scale that will let them use the preferred statistical techniques without prohibitively increasing costs or respondent burden.\1 -------------------- \1 In chapter 4, we talked about categorizing, ranking, rating, equal interval, and ratio scales. In the first edition of this paper, we discussed these scales--called nominal, ordinal, interval, and ratio scales--in detail. But since then, GAO has published Quantitative Data Analysis: An Introduction, GAO/PEMD 10.1.11 (Washington, D.C.: June 1992), which deals extensively with this topic. Readers not well grounded in the use of these scales are referred to that document. EQUAL-APPEARING INTERVALS --------------------------------------------------------- Chapter 10:2 Frequently, evaluators make observations on a variable for which the scale naturally has many small categories but they choose to use a coarser scale. For instance, people might be reluctant to tell their income (a fine scale), but they will tell if their income falls into a certain broad category. When using this technique, evaluators should try to make all the categories the same size. For example, the category "from $15,000 up to $20,000" is the same size as the category "from $20,000 up to $25,000"; both measure money in $5,000 increments. Another example of the connection between the questionnaire format and the measurement scale can be seen in the Likert questions discussed in chapter 4. The Likert format has five broad categories. Should it be considered an interval scale, so that analysts can use the statistical techniques for interval data, or should it be considered only ordinal? The categories (strongly agree, agree, neither agree nor disagree, disagree, strongly disagree) do not necessarily have equal intervals. In the Likert format, we almost always treat the information as ordinal or ranking data. However, for some of the other intensity scales discussed in chapter 4, we can make a better case for an interval interpretation. For example, there may be some evidence that "generally satisfied" falls three quarters of the way between "very dissatisfied" and "very satisfied." However, even in such situations, it is usually best to show the proportion in each group and consider the category information as ranking data. Since rating categories treated this way do not give much information, we sometimes make an additional effort to qualify the rating as quasi-interval data. When we do this, we call these categories "equal-appearing intervals," because, as best we can tell, the intervals appear to be equal. The equal-appearing interval formats use words, numbers, proportions, and behavioral anchors to make intervals that appear to be equal. For example, we could assume that "somewhat difficult" falls one fourth of the way between "no difficulty" and "extremely difficult." However, such assumptions are very hard to justify. When making rating category scales, evaluators should be very careful to assign them on the basis of their knowledge of the variable in question, the literature, past experience, and pretest results. Sometimes it is a good idea to conduct a special study to verify assumptions. When uncertain about the assumption, evaluators usually treat the observations as ordinal data. If the assumptions are reasonable and the conditions are right, they sometimes treat attitude measures like "satisfaction" as interval data. ORGANIZING THE LINE OF INQUIRY =========================================================== Chapter 11 As respondents begin their questionnaires, they discover the special language and the rules of the game, such as "skip to," "check one box for each row," and "if dissatisfied, go to question . . . ." This chapter suggests techniques for organizing a collection of questions into a well-designed instrument structured to elicit valid answers and to make the respondents' task easier. For example, several specific questions preceding a broad one can help respondents understand the range of factors to consider in making an overall judgment, and hard questions can elicit better responses if they are placed about a quarter or three quarters of the way through a long survey rather than at the beginning or the middle. SETTING EXPECTATIONS --------------------------------------------------------- Chapter 11:1 A set of instructions precedes the questions themselves. The instructions prepare respondents for the question-answering task in several ways: 1. They set a framework by identifying the data-gatherer, stating the purpose of the questionnaire, and describing the range and type of information needed. 2. They motivate respondents to answer by explaining the questionnaire's importance and relevance and protections of confidentiality or anonymity. (The pledge of confidentiality is discussed in chapter 14.) 3. They provide respondents in advance with some basic information, such as whether to designate answers by check marks or narrative responses, how long it usually takes to complete the form, and whether estimated or exact amounts are necessary. SEQUENCING QUESTIONS --------------------------------------------------------- Chapter 11:2 The instructions cause respondents to expect certain types of questions and the sequence of questions should fulfill these expectations. Items should be presented in a sequence that is logical to the respondents. Frequently, the sequence mimics the flow of the process or condition under investigation. For example, in a study of printing industries, we would ask managers of firms for a description of a plant before asking for cost figures and ask for a description of equipment before inquiring about production data. If the natural or chronological flow of a topic is followed, the evaluators stand a better chance of helping respondents recognize and recall the information they need. USING SUBTITLES AS CUES --------------------------------------------------------- Chapter 11:3 Related items that are grouped and accompanied by subtitles help the respondents quickly grasp the scope and nature of the inquiry. It also enhances the organizational flow and the cuing if the individual items within the group unfold meaningfully. For example, in a GAO evaluation on how personal bankruptcy cases were handled, we grouped the questions in accordance to the bankruptcy process and gave each grouping a subtitle. The first subheading was "bankruptcy proceedings." The question under this subheading followed logically "Under what name was the bankruptcy filed? Who filed the court papers?" and so on. CHOOSING AN OPENING QUESTION --------------------------------------------------------- Chapter 11:4 The opening question should be interesting and highly salient to the topic, in order to capture the respondents' attention and demonstrate that their opinions are needed in key areas. It should introduce the language and rules of the questionnaire. Potentially objectionable and threatening questions should be avoided as initial questions. If possible, the opening item should apply to all the respondents. Questions with such response options as "do not know" should be avoided. Respondents may feel uncomfortable about not being able to answer initial items or may question the relevance of the form to them. However, in some instances, initial questions are used to determine whether respondents fit certain criteria and should complete the entire form. Respondents who do not meet the criteria should be thanked for their cooperation, told why their answers are not needed, and reminded to return their forms so that the population can be counted accurately. The following example illustrates how ineligible respondents might be notified: "THIS SURVEY ASKS ONLY ABOUT CHILD CARE FOR CHILDREN UNDER 12. IF YOU DO NOT HAVE CHILDREN IN THIS AGE RANGE, DO NOT CONTINUE. THANK YOU VERY MUCH FOR YOUR HELP. PLEASE RETURN THIS QUESTIONNAIRE SO THAT WE CAN MAKE SURE WE ARE COUNTING YOUR RESPONSE IN OUR OVERALL POPULATION ESTIMATE." A questionnaire should not be started with a broad or difficult question that will require a narrative response. Such questions require considerable effort to answer adequately. Also, the respondents have not yet learned enough about the information needed and may not provide the range and depth of data wanted. Sometimes trade-offs between question salience and ease of answering have to be made. In a survey of members of health maintenance organizations, a question asking individuals to rate their reasons for joining their plans would have been a natural starting point, but it could not be used as an opening question because of its complexity. Demographic questions are usually placed near the end of a questionnaire if they may be perceived as highly personal and as perhaps less important to the questionnaire's purpose. However, if this is not the case, they may make a good starting question because they are easy. Also, the placement of demographic items depends on the topic and the audience. For example, military personnel are accustomed to providing rank and grade first. Also, if the demographic items seem less relevant to the questionnaire topic, the questionnaire designer may want to explain why this information is needed and how it will be used. OBTAINING COMPLEX DATA --------------------------------------------------------- Chapter 11:5 Because a mail questionnaire is self-administered, it must be designed so that all or almost all respondents can faultlessly follow its instructions and feel that the form is easy to complete. For example, we surveyed congressional offices to measure their use of reports that show federal funding by geographic area. The reports provided information at various levels of detail (state, county, subcounty) and for a variety of data categories (individual programs, general functional areas, and so on). We needed to determine congressional use not only of geographical and funding categories but also of each particular combination (such as program data at the state level). In total, we needed 288 separate answers. Figure 11.1 shows how we broke down a complex question into individual items that would be easy to answer and that were sequenced logically. "Skip" and "continue" instructions accompanied each item and were set off in distinctive type to help respondents follow the item sequence. Figure 11.1: Sequence of Questions Obtaining Complex Data (See figure in printed edition.) USING TRANSITIONAL PHRASES --------------------------------------------------------- Chapter 11:6 Sometimes the respondent's task can be made easier by providing general information about the questions that will follow. Often, such text accompanies a subtitle and is used to alert the reader to a topic change. For example, in a survey of program managers of major weapon systems, a section of the questionnaire dealing with accountability was followed by a section dealing with the operating environment of acquisition personnel. Since this was a topic change, a few lines of explanatory text were included to distinguish this section from the previous one. Transitional phrases may be particularly necessary if a series of complex questions covers several pages. For example, in a survey of state coordinators for the mentally disabled, six pages were devoted to lengthy rating questions on the extent to which various federal programs encouraged or discouraged the deinstitutionalization of disabled populations. A few lines of text accompanied the section's subtitle, in order to explain the focus of the question series: "FEDERAL PROGRAMS Various federal programs provide institutional or community services to the mentally ill or mentally retarded. In the next series of questions, we ask you to tell us to what extent, if at all, various aspects of these programs currently encourage or discourage deinstitutionalization of the populations." Warning respondents about a lengthy series of questions increases the number of items that will be responded to, because the respondents know each item will address a different program aspect. Transitional phrases may also help respondents take a neutral point of view when making judgments. In a survey of an agency's employees in the field, respondents were asked to rate the benefits of rotation from a personal perspective and from the agency's perspective. To assist the respondents, transitional phrases were used. For example, after asking employees to rate rotation benefits from the agency's point of view, we wrote, "Now forget the office for a moment. How much do you think you would benefit personally from a rotational policy?" PUTTING SPECIFIC QUESTIONS BEFORE OVERALL JUDGMENT QUESTIONS --------------------------------------------------------- Chapter 11:7 Usually overall judgment questions seek to obtain an opinion that considers and weighs many factors. To get these considerations, it is best to precede a question with specific questions and items that deal with the facts, considerations, opinions, and experiences on which the overall judgment is usually based. For example, a question on overall job satisfaction should be presented after the respondents were asked to give separate consideration to the many factors that affect their job satisfaction: salary; benefits; job duties and responsibilities; supervisor, employee, peer, and subordinate relationships; career potential; and so on. A reversal of this order may prompt an unconsidered opinion to the initial response that colors the responses to the following, more specific queries. Subjects do not like to be inconsistent and will bias their subsequent responses to be consistent with their initial "top of the head response." PUT FILTERS BEFORE SPECIFIC QUESTIONS ------------------------------------------------------- Chapter 11:7.1 About half of the respondents who do not have an attitude, observation, experience, or knowledge about the topic will answer detailed questions as if they did. Answers from uninformed respondents cause error and may lead to false conclusions. One way to guard against this is to filter them out. Before the line of questioning begins, the leading question might first ask if the respondents had an opinion and then ask the extent to which an issue was considered, giving operational definitions and anchors for a well-considered issue. Other alternatives might be to ask if the respondents had experience and the time and place of the experience or to ask about the respondents' qualification or role. These questions filter out and distinguish "no basis to judge" responses from uninformed answers. THE INFLUENCE OF CONTEXT AND ORDER ------------------------------------------------------- Chapter 11:7.2 In organizing a questionnaire, the designers must realize that its context and order can affect responses to individual items. These reduce or increase response error, depending on whether they facilitate or interfere with the cognitive process. And while it is sometimes possible to assess the potential for such effects, in practice it is more difficult to predict their likelihood and whether these effects will help or hinder. Contextual effects are usually seen in three ways: they influence the way the respondents interpret the question, the way they consider the tasks to be performed, and the way they may erroneously manipulate their answers to be consistent or different. CONTEXT AFFECTS INTERPRETATION ----------------------------------------------------- Chapter 11:7.2.1 Context cues influence the way the respondents make sense and meaning of individual questions, the way they interpret their recollection of previous questions as to what is and is not to be included in considering the answer of the current question. For example, in a survey on germ plasm, the contextual presentation assumed that the term "landraces" was always interpreted as traditional varieties; likewise, in a survey of businesses, "marginal" was always interpreted as "borderline" rather than as having something to do with the amount of money placed on deposit for purchase of stocks or unit costs after the production cost has been met. Also, as we have seen in the section above, putting specific questions before proceeding to the general helps the respondent recall and assemble the data needed for an informed judgment. However, the tendency to respond to contextual cues can also have negative effects on the responses. For example, in one study we had several different questions that were introduced with qualifications and conditions under which the questions were to be considered. Unfortunately, these qualifications were quite similar. The respondents, influenced by the similarity of the introductory qualification, thought all the questions were also the same. Also they failed to see the differences among each of the succeeding questions and did not answer. There are also some situations in which the question writer wants the respondent's "gut" reaction, first impulse, or unconsidered or unguarded response. The writer must then try to isolate the respondent from contextual cues such as the specific to the general organization that may interfere with initial and unconsidered answers. Item interaction can occur even though various aspects of a topic are given equal attention. Inquiries that ask people to evaluate a topic from both a personal perspective and someone else's perspective might be difficult for respondents to answer neutrally but the order of the questions could help. As we mentioned earlier, when we surveyed an agency's field staff about rotation, we asked first about the benefits of rotation from the agency's point of view and only then about the benefits from the respondents' point of view, in order to obtain answers as objective as possible. In some cases, interaction is associated with judgmental questions in which normative values play a role. In other cases, interaction may stem from how the scope of a general question is defined. Examples from the survey research literature can illustrate these points. If we were to ask respondents to report their degree of support for the rights of workers to strike and the rights of management to lock workers out, we would get different endorsement proportions, depending on how we sequenced the two questions. Endorsement for lockouts will be slightly higher if we ask first about a worker's right to strike. It is suspected that people use a norm of equal treatment--if workers have a right to strike, business has a right to lock them out. The effect of context cues of this type can also be difficult to predict. For instance, in two separate studies involving victims of crime, the factual questions about the crimes were asked before attitude questions because it was desirable to have the attitude reflect consideration of the specific crimes considered. Pretests showed that attitudes about crime did not change regardless of whether these attitude questions were asked before or after the crime-reporting incident questions. However contrary to expectations, the crime reports were more accurate if attitude questions were asked first. Apparently, the attitude response helped cue the memory search needed to question the victimization experience. CONTEXT AFFECTS PROBLEM SOLVING ----------------------------------------------------- Chapter 11:7.2.2 A second type of contextual influence affects respondents when solving the tasks asked by a questionnaire. The respondents, perhaps subconsciously, consider the order in which things are presented, how things are related, and the extent to which items receive greater or lesser emphasis according to the rules of conversational English. For instance, if the questionnaire has a number of questions about shop safety and then reintroduces the topic of safety in another part of the questionnaire, the respondents may not consider the subsequent questions as pertaining to shop safety. They will think these questions relate to safety in some other environment because, they will reason, if they did pertain to shop safety, they would have been presented with the earlier questions. Respondents include or exclude according to their experience with conversational English. For instance if two qualifications or considerations are presented in serial order, the respondents will consider the second to be most important. Two items to be considered in contrast should be paired with a conjunction like "but," not with words like "even so," which may not be seen as denoting contrast because such words are not usually part of conversational English. Two paired statements will often be seen as mutually exclusive. An example is "How is the morale of your work unit?" followed by the question "How is the morale of your organization?" In answering the latter question, many respondents will exclude their work unit from consideration when answering about the organization. Their rationale seems to be that we would not ask about their work unit separately if we intended it to be grouped with the organization. People also group or differentiate things according to their experiences. They exclude items if they are not used to seeing them together, regardless of the logic of the specific requesters of the questions. For instance, they usually exclude the extremes of a classification. They also define constraints according to their own narrow range of experience. For example, to the citizens of Cumberland, Maryland, "Kelly," the local automobile tire company, was a big tire manufacturer. But to the citizens of Akron, Ohio, the home of Firestone and Goodyear, Kelly was a little company because they had never heard of it. CONTEXT AFFECTS WHAT IS INCLUDED IN AND EXCLUDED FROM CONSIDERATION ----------------------------------------------------- Chapter 11:7.2.3 A third type of effect influenced by context and order is governed by the respondents' need to be consistent. To illustrate this, we can consider a previous example: "To what extent, if at all, are you satisfied or not with your job?" The question was followed by a series of 14 questions that asked about job satisfaction, taking into consideration such factors as salary, benefits, supervisor relationships, collegial relationships, and physical work environment. The first question is a broad and general question that might be answered off the top of the head. If this happens, respondents will bias all their successive answers to the more detailed questions because they do not want to appear to be inconsistent with their first answer. Conversely, respondents will sometimes erroneously exclude certain groups from consideration if the comparison contrast is very high, regardless of their need to be consistent. For example, in an attitude survey on driving, adult drivers excluded teenaged drivers when giving favorable assessments. In another study, a question on abortion on demand was less favored when preceded by a question on abortion in cases of rape and incest. This was presumably because the rationale for demand seemed more trivial when compared to the rationale for rape or incest. In a study of harassment in the military academies, male cadets may answer questions about their own harassment experiences quite differently from how they would answer if these questions were preceded by inquiries concerning harassment of female cadets. Moreover, some may exclude the females from consideration because they see them as very different from the rest of the body of cadets. In designing a questionnaire, it is important to consider every part of the questionnaire where context and order can influence meaning, respondent problem solving, inclusion or exclusion, or contrast. After locating these areas of sensitivity, it is essential to conduct pretests. This is because the potential threats are not always realized and, when they are realized, the effect may not always go in the predicted direction. ANTICIPATING RESPONDENTS' REACTIONS --------------------------------------------------------- Chapter 11:8 Except with very short forms, the attention, interest level, and effort of respondents fluctuate throughout the completion of a questionnaire. As respondents begin, they may be somewhat wary and uncertain. Specific expectations have been raised by the transmittal letter and the instructions. Also, self-administered questionnaires resemble a test-taking situation in many respects. Respondents may wonder, "Can I follow the directions?" and "Where and how do I record my answers?" If the opening items are easy and nonthreatening, respondents become involved in the task and learn how to handle the format. About one fourth to one third of the way through a form of average length, the respondent's interest and motivation are at high points. Complex items or questions that are critical to the survey can be introduced. Midway through the form, the respondent's attention and interest may waver. Less-demanding and less-critical items should be given at this point. Approximately three fourths of the way through the form, the respondent's effort and attention probably rise again. This accompanies a feeling that an investment has been made and what has been started should be completed. At this point, additional demanding and critical questions can be asked. Although this pattern of reaction may not always occur, it is applicable to many GAO forms, which tend to be moderately to very long. FOLLOWING QUALITY ASSURANCE PROCEDURES =========================================================== Chapter 12 The quality of questionnaires can be checked by several methods, some of which are carried out during the design phase and others during the data collection or analysis phase. During the design phase, the questionnaire should be pretested on selected persons who represent the range of conditions likely to influence the evaluation's results. The questionnaire should also be sent out for review by experts who are familiar with both the issue area and the respondent group. Pretesting and expert review are some of the best ways to ensure that the instrument actually communicates what it was intended to communicate, that it is standardized and will be uniformly interpreted by the target population, and that it will be free of design flaws that could lead to inaccurate answers. Validating, verifying, or corroborating responses; conducting reliability studies; and analyzing nonresponses are also important aspects of GAO's quality assurance effort. These tasks, which are conducted during data collection and analysis, are described in detail following pretesting and expert review. PRETESTING --------------------------------------------------------- Chapter 12:1 By testing the questionnaire before it is distributed, evaluators can assess whether they are asking the right group of people the right questions in the right way and whether the respondents are willing and able to give the evaluators the information they need. Pretests are conducted with a small set of respondents from the population that will eventually be considered for the full-scale study. If respondents in a pretest have difficulty in responding or supplying information, it is likely that similar problems will arise in the full-scale study. If pretesting the questionnaire indicates that there is a low likelihood of obtaining accurate factual data sufficient for answering the assignment's objectives, troublesome questions should be dropped or other techniques for data collection should be pursued. Basically, pretests ask the following questions: 1. Is the content or subject matter of each question relevant to the respondent? Does the respondent have the experience and information to answer the question? 2. Are item-wording, phrasing, and other question construction components adequate to ensure that sound results will be obtained? Does the respondent understand the information request as it was intended? Are the response choices appropriate and comprehensive? Should the question be more specific? Is the time period suitable? Do filter questions and skip instructions work as planned? Are the instructions clear? Are transitions between sections smooth? How difficult is the questionnaire for the respondent? How long does it take the respondent to complete an item and to complete the entire questionnaire? 3. Are the questions asked in a way that will yield the needed information? Has a critical construct or variable been overlooked? Is the variable measured in sufficient detail? 4. Can and will the respondent give the evaluators the data they need? Can the respondent remember the type of information asked for in sufficient detail? If records must be consulted, how easily available are they? Is a question sensitive, objectionable, or threatening such that honest answers may be a major embarrassment or lead to possible punishment? Does the questionnaire adequately motivate the respondent to provide information? Mail questionnaires are pretested by means of personal interviews. During the interviews, a wealth of information can be obtained by observing respondents as they complete the form and by debriefing them about the question-answering experience. WHO SHOULD CONDUCT THE PRETEST? ------------------------------------------------------- Chapter 12:1.1 In principle, the pretest should be conducted by a single person knowledgeable about both the pretest procedures and the questionnaire's content, because respondents are more apt to confide in a single individual than in a group. When this is not desirable, both an evaluator and a measurement specialist should be present. The evaluator addresses problems related to question content, and the measurement specialist assesses the questionnaire's overall adequacy as a data collection tool. Usually, the measurement specialist conducts the initial pretest while training the evaluator in observational and debriefing techniques. Such training is essential. After participating in a few sessions, the evaluator may be able to conduct the remaining pretests alone. HOW ARE PRETEST INTERVIEWEES SELECTED AND CONTACTED? ------------------------------------------------------- Chapter 12:1.2 Pretest interviewees should be drawn from the universe being considered for the final study. The interviewees selected for pretesting should represent each of the major subgroups, conditions, and geographical or other units under investigation. The relevance and appropriateness of the questions may differ among these groups. For example, a national study of issues related to poverty should pretest the various groups of the poor in the universe--the elderly who are poor because of sickness, the elderly who are poor because they lack savings, the student poor, the disabled poor, and the welfare poor. Being poor in Maine may be quite different from being poor in Florida, so interviewees should be selected from both states. Pretest subjects need not always be selected randomly. A few people who are not typical of the universe should be interviewed in order to ensure the appropriateness of items for all potential respondents. For example, if the evaluators need to assess child-care arrangements made by employees, it is probably a good idea to test both extremes--a very large family and a family with only one child. Also, to test the questionnaire's readability, some interviewees should be selected whose language skills are somewhat less strong than those of the majority of potential respondents. In principle, enough people should be tested to obtain a statistically valid sample of participants. However, time and staff resources are usually the controlling factors. For the typical questionnaire, between 8 and 12 pretests should be planned. This is merely a guide; sometimes we have had to manage with as few as 6 and at other times we have needed as many as 50. Exploring the particular needs of the survey with a measurement specialist helps determine the number of pretests. The interviewees should be selected because they represent or have knowledge of the range of characteristics or conditions likely to be encountered--young and old, experienced and inexperienced, large and small companies, efficient and inefficient organizations, and so on. For example, in order to catch the range of conditions of the different streams of migrant workers as they moved northward, we pretested at the geographical beginning of the northward migration in Florida, Texas, and southern California and also at the middle and northernmost points. If possible, the pretest subjects should be contacted by phone or letter and asked to voluntarily participate. They should be told what the evaluation is about, why pretesting is necessary, what the process consists of, and how long the testing is going to take. However, since they must do the pretest from the respondent orientation of a cold reader, they should not be given the pretest questionnaire in advance. Arrangements should be made to meet with each interviewee at a location as free from distraction as possible and at a time and place convenient for the interviewee. Of course, it sometimes happens that the pretest subjects cannot be contacted by phone. This would probably be the same with migrant workers or people coming through a customs border. In situations like this, volunteers must be recruited on site. Care has to be taken in how a request for pretesting is communicated, because some people react with discomfort to the word "test." This kind of reaction can be allayed if the evaluators explain that the interviewee's comments and criticism are needed to test the questionnaire, not the interviewee. The lack of anonymity in a personal interview may also make the pretest candidate hesitant to participate. The candidate should be told that the information that will be provided will be treated confidentially and will not be included in actual data collection; evaluators are interested only in finding out how well the questionnaire works. HOW IS THE PRETEST CONDUCTED? ------------------------------------------------------- Chapter 12:1.3 Pretesting has three stages: introductory comments, actual completion of the form by the interviewee, and debriefing. INTRODUCTORY COMMENTS ----------------------------------------------------- Chapter 12:1.3.1 The following points should be mentioned in the telephone contact and covered briefly again at the beginning of the pretest session. The evaluators should state the role of the data collector; state the role of the person administering the pretest; state the purpose of the evaluation and the questionnaire and discuss the population to whom it will be sent; indicate the importance of the evaluation and the value of the interviewee's help in perfecting the questionnaire; remind the interviewee that responses are confidential; explain that pretesting involves the interviewee's completion of the form and will be followed by a short debriefing session to review the interviewee's comments, suggestions, and criticisms, explaining also that the interviewee will be given the same materials that would be received by mail, including a transmittal letter and the questionnaire form; state that the questionnaire should be completed as if it had been received by mail and no one else were present and mention that instructions on the form explain how to complete it and that the interviewee who cannot proceed without further explanation should stop and ask for assistance (interviewees should be encouraged to note on the form any problems or ideas that arise as the questionnaire is being completed); provide some examples of the type of item flaws or other problems the evaluators want the interviewee to look for (for example, an item may ask for dollar amounts by calendar year when amounts are available only for the fiscal year, or an item may ask for figures on the number of patients who were deinstitutionalized during a specific year but the institution's figures may count all the times each patient left who also entered more than once during the year, or the list of options may fail to include a critical component, or the interviewee may not be sure of a particular response but no category such as "Not sure" has been provided for this, or a skip in the instructions may be confusing); tell the interviewee that the evaluators will be following the sequence of questions on their own copy of the form in order to monitor the flow of questions, thus addressing any potential concern the interviewee may have upon noticing that the evaluators are entering information on their form; state that frank and honest answers are appreciated and thank the interviewee for assistance; and conduct the pretests as a one-on-one session and not in a group. If the respondents suggest a group session, explain why this is not a good idea: in reality, the questionnaire would be read by a single individual working alone, the group interaction will influence everyone's understanding of the questions, some respondents are less likely to confide, and finally the pretester cannot handle more than one respondent at a time. There are exceptions, but we will discuss them in a subsequent part of this paper. COMPLETING THE QUESTIONNAIRE ----------------------------------------------------- Chapter 12:1.3.2 The pretest administrator should carry out six tasks while the interviewee completes the form. 1. Recording the time it takes to complete each item. At the beginning of the pretest, the evaluators should position themselves so they have a clear view of the interviewee's questionnaire and face and as much of the body as possible. The start time should be recorded at the top of the evaluators' form. As the interviewee works, the evaluators should count silently the number of seconds it takes the interviewee to read the instructions or complete a question, and this time should be recorded next to the relevant section on their copy of the form. Evaluators should try to be unobtrusive. If the interviewee asks a question or the test is otherwise interrupted, the time taken out for the relevant item should be noted. Timing is obtained for two reasons: first, the average time it takes all interviewees to complete an item serves as an index to the difficulty of items and, second, the average time it takes to complete the entire questionnaire serves as an index of respondent effort or burden. 2. Talking through. Some respondents feel comfortable talking out loud while answering. When they do, they should be allowed to do so and their verbalizations should be noted. They should be asked to say what is going through their minds while answering. However, many feel uncomfortable with talking through. Hence, this approach should not be used unless it feels natural. 3. Recording questions asked and clarifications made. When the interviewee asks a question, the evaluators should record key words or verbatim text as well as their own response next to the relevant item. These comments are used as an aid in debriefing and in item rewriting. Interviewees who are confused about what a question means should be provided a straightforward answer. Probing should be done during debriefing rather than during the test to see what the problem was. Evaluators should pay particular attention to how they answer any questions the interviewee raises, and they should be careful when providing explanations or alternative wording. In deviating from the prescribed text, evaluators may rephrase questions and bias the interviewee toward a particular response. However, if the interviewee is insistent and comfortable with discussing each question in turn and is giving good observations, the dialogue should be allowed to flow. Some people may recall with better insight if they speak as they see. 4. Noting nonverbal behavior. Evaluators should record any nonverbal behavior and body language that coincide with particular questions. Such behavior as hesitance in responding, facial expressions, rereading questions, turning pages, and nervous movements (foot-tapping, fidgeting, and the like) may indicate item-design faults, question difficulty, or lack of relevance. Nonverbal observations are very important because they can be used as signals for questions that should be asked during debriefing. Methods for taking these observations will be discussed more extensively in the section on debriefing. 5. Noting whether instructions and format were easy to follow. Question instructions and format vary from item to item. Evaluators should notice how smoothly and quickly the interviewee reads directions and moves from one item to another. Did the interviewee ask questions about the instructions or the directions for filter questions? Could the interviewee follow the "skip to" or "go to" instructions with ease? 6. Noting erasures, uncompleted items, errors, and inconsistencies. These types of responses may indicate questionnaire design flaws. Evaluators can pick these up as they review the interviewee's questionnaire before debriefing. DEBRIEFING ----------------------------------------------------- Chapter 12:1.3.3 The purpose of debriefing is not only to identify items that are difficult or misunderstood but also to get at the cause of these problems. The interviewee's answers and the evaluator's observations help uncover these problems and correct them. The debriefing usually takes 1-1/2 times as long as it takes to complete the questionnaire. A debriefing should begin with a statement of its purpose, telling the interviewee that evaluators will be drawing on the interviewee's experiences and judgments to ensure that the intent of each item is clearly conveyed, evaluate the relevancy of items, and identify item-design deficiencies. The interviewee's questionnaires should be reviewed in detail, and feedback to the evaluators' probing should be obtained. The major problems to look for are improper question format, inappropriate questions, improperly qualified questions, inappropriate language, failure to present an inclusive range of mutually exclusive alternatives, complex questions, unclear questions, question bias, and improper scales. In discussing questionnaire items, GAO evaluators usually use the following sequence: (1) uncompleted items; (2) obvious errors and inconsistencies; (3) erasures; (4) items that took a long time to answer or appeared to cause difficulty; (5) items that took an unexpectedly short time to answer, possibly indicating that the interviewee missed certain key considerations; (6) questions the interviewee says caused uncertainty, undue deliberation, or difficulty; and (7) all other items not yet discussed. Alternatively, the sequence within the questionnaire may be followed. The evaluators' approach in debriefing should be nondirective. They should try to elicit the interviewee's comments, problems, and reactions to the questionnaire without leading. They should use general comments to get the interviewee to reconstruct the questionnaire experience. For example, the interviewee's answers or the evaluators' observations of behavior can be used as a take-off point: "You didn't answer . . . ," "You took a long time . . . ," "I noticed you seemed puzzled . . . ," or "Tell me what you had in mind when . . . ." Then the interviewee should be allowed to tell the reasons behind the behavior. Some areas may need a more direct approach. If "don't know" is the answer supplied, evaluators can probe to see whether the interviewee is being evasive. If evaluators believe the interviewee has an answer, they can push a little but not so much that a true "don't know" becomes a bad response. During the observation period and debriefing, the evaluators should be very observant of the interviewee's paralanguage--that is, the vocal and facial expression, gestures, and body language used to modify speech. The evaluators should be careful about their own paralanguage so as not to send out conflicting messages or to send a message by this medium that may reinforce, encourage, extinguish, or inhibit the interviewee's comments. Instead, the message sent through an open and attentive posture, interested and pleasant facial expressions, soft, encouraging, motivating, and responsive voice should be that of a responsive person very interested in what the interviewee has to say. The evaluators' posture should be open, facing the interviewee with a slight forward lean and attentive demeanor. They should sit to the side of the interviewee if possible. This signals team work and cooperation rather than competition while allowing the evaluators to see the interviewee's whole body. Before starting, the evaluators should assess the way the interviewee has arranged his or her space, and they should try to position themselves in accordance with the setting, avoiding an invasion of the interviewee's space. The evaluators should be seated at a comfortable conversational distance of 2-1/2 to 3 feet but close enough to observe the interviewee as he or she completes the questionnaire. Eye movement is very important because that is the primary way of controlling the debriefing. When people converse, they tell each other when they want to talk or not talk and when they want others to talk or not talk by eye movement. Evaluators must learn to use this language. For example, looking at an interviewee is a signal that you want him or her to talk or keep talking. Looking away, occasionally offering slight gestures, stutter-stop interruptions, and throat clearing are signals to stop. A trailing voice, long pauses, silence and a head nod, and increased eye contact tell the listener that the speaker is finished with the topic and does not want to talk anymore. An increased rate, louder voice, filled pauses, halting gestures, and reduced eye contact tell the listener that the speaker is not finished and wants to keep talking. Skillful use of these signs will allow the evaluators to manage the interview, to get the interviewee to talk, and to avoid pushing him or her beyond his or her own knowledge limit. Interviewees use paralanguage cues knowingly or unknowingly to tell interviewers what they think of the questionnaire, and these cues can often serve as a basis for conducting further probes. Some examples are speaking with variety in pitch and intensity, making pauses shorter than usual, speaking at an increased rate, and opening eyes wide to indicate involvement with the topic or certainty of the message. Hesitating speech, narrow eyes, longer pauses, many pauses, shrugs, slight side-to-side or up-and-down palm-down hand gestures, raised fingers or palms, furrowed brow, and a half smile or frown may indicate uncertainty about the information interviewees are providing or about their comprehension of a question. Slowly spoken, carefully enunciated, low-pitched speech with hesitations can signal caution. Carefully enunciated speech without hesitation but with increased rate and intensity, narrow eyes, unraised eyelids, and knitted eyebrows may be signs of annoyance, difficulty, or dissatisfaction. Narrow pupils, raised eyelids, raised eyebrows, and frowns may show unpleasant surprise, while wide pupils and a smile can denote pleasant surprise. Extended looks or gazes to the side or at the ceiling often indicate that interviewees are thinking through the information, but if the gazes turn to blank stares, their interest has been lost or they are bored. Vague answers, shrugs, don't know signs, reduced head nods, reduced eye contact, and nervous twitches in the hands or feet may indicate deception. Looking at the interviewer's forehead may indicate a question. An increase in intensity and rate interrupted by an unexpected pause often signals that the next thing the interviewee says is very important. It is very important for interviewers to be aware of and observant of this paralanguage because perhaps as much as half of the communication that takes place between an interviewer and an interviewee uses this medium. These observations generally apply to American culture and sometimes do not apply to other cultures or to individuals with certain disabilities. For example, in some non-American cultures, looking at an important speaker face-to-face is a sign of disrespect. There are also paralanguage variations particular to each of the ethnic American groups. Some have closer or more distant conversation spaces or look at their conversational partners somewhat more or less frequently than we described. However, these are differences in degree, and these guidelines will work for most situations for most American ethnic cultures. Observing the interviewee completing the questionnaire with no direct queries before the debriefing has the advantage of allowing for the assessment of contextual cues in a more realistic situation than other methods and for the use of aided recall without leading or biasing, but this method does have certain disadvantages. The interviewee sometimes loses spontaneity or forgets initial observations or first impressions. Also interviewees who cannot articulate a rational explanation for a feeling or perception may make one up. Therefore, if the interviewees really want to talk about each question as they read, evaluators should permit them to do so. It is also helpful to do at least one or two pretests using a "talking through" approach. When the debriefing has been completed, interviewees should be thanked for helping to perfect the questionnaire. As soon as possible, the evaluators' comments and observations about the pretest should be recorded. STANDARDIZED PRETESTS ------------------------------------------------------- Chapter 12:1.4 Except for the flexibility granted to the interviewer to probe the subject according to his or her pretest observations, the pretest protocol should be standardized as much as possible. The main reason for standardization is to promote a sufficient number of replications to evaluate the pretest findings. However, in certain circumstances, it may be more efficient to revise the pretest instrument in midstream. This usually happens in the following situations: (1) when the design errors are so obvious that there is little doubt about how to make the correction; (2) when the initial instrument is far off the mark (an example is when more than one third of the questions need major revision); (3) when the corrections are so difficult to make that the question writer is not certain as to whether he or she has fixed the problem. When these situations occur, it is better to revise the instrument and begin a second round of pretests. EXPERT REVIEW --------------------------------------------------------- Chapter 12:2 Because GAO's studies are wide ranging, we frequently need to seek outside comments on the questionnaire approach. The purpose of this expert review is twofold. First, we want to determine whether the questions and the manner in which we ask them are adequate for addressing the larger questions posed by the evaluation. Second, we want to find out whether the target population for the survey has the knowledge to answer the questions. In many instances, the agency officials whose program is under review can help provide this information. People who provide expert reviews do not act as pretest interviewees; they do not answer the questions but provide a critique. Only on rare occasions does a reviewer serve as a pretest subject, too. The expert must have a thorough knowledge of the target population. For example, in a study of the Foreign Corrupt Practices Act, a former head of the Securities and Exchange Commission served as an expert. In a survey on indirect costs of research grants, we sought the help of the president of the National Association of College Business Officers, because most research grants are administered by members of this society. VALIDATION AND VERIFICATION --------------------------------------------------------- Chapter 12:3 Validation is an effort to ensure that the questionnaire is actually measuring the variables it was designed to measure. Validation is important because if the questions are not valid measures of the constructs we are studying, even answers verified as accurate will not provide us with the quality data needed for our findings, conclusions, and recommendations. Verification is a way of checking or testing questionnaire answers with records or direct observation to reduce the risk of using data that are inaccurate. Verification is different from validation. For example, suppose we are interested in the quality of health care and propose the number of visits to a doctor as an indicator. To validate, we would have to show that the number of visits could be taken as a measure of the quality of health care. And in proving this, we are likely to find that this indicator is valid only under certain conditions. However, if we wished to check the accuracy of the patients' self-reports as an estimate of the numbers of doctor visits, we might compare this estimate with physicians' records. In doing this comparison, we are testing the soundness of self-reports only as a measure of visits (verification), not as a measure of the quality of service (validation). Verification tells us if the subjects self reports can be trusted as an accurate measure but not necessarily as a valid measure. Verification is ideally conducted by testing a population sample. Since this is not always practical, GAO often shows that other comparable studies had similar findings or cross-checks for internal consistency. Corroboration (referred to as validation in some circumstances) of questionnaire results against similar information from another, independent source can also provide supporting evidence to increase confidence in the relative accuracy of questionnaire data. The reliability of questionnaire results tests whether a question always gets the same results when repeated under similar conditions. Answers can be highly reliable without being either verified or valid. Why do evaluators have to validate, verify, corroborate, and make reliability checks? GAO has to do much of this work because most of the time it cannot use "standardized" instruments--those that have already been tested during their development. We are either measuring things that have not been measured before or measuring previously measured things under different circumstances. Since we most often do our own instrument development work, these essential attributes are discussed in more detail. VALIDATION OF THE QUESTIONNAIRE ------------------------------------------------------- Chapter 12:3.1 To validate we show that the observation measures what it is supposed to measure. The best way to demonstrate validity is to demonstrate the relationship between the measurement and the construct being measured in a setting as controlled as possible. This is called "construct validation." For example, we wanted to use the time it took to complete questionnaire items as a measure for the construct "item difficulty." To validate this, we deliberately constructed sets of items that varied in difficulty by changing the reading levels, the concepts, the memory requirements, the decisions, and the operations until we had developed a set of items that spanned the range from easy to extremely difficult. Then we administered this test to a number of people under controlled conditions. We measured the time to complete the item, the number of mistakes (another possible measure of difficulty), and the respondent's ratings of the difficulty of the items. As the difficulty of the items increased, so did the mistakes, the respondent's ratings of difficulty, and the response times. We concluded that the time it took to complete an item could be taken as a valid measure of the item's difficulty. In another study, evaluators used supervisory ratings as a measure of employee performance. To validate this, the evaluators compared the supervisors' ratings of employees with employee performance test scores. These performance tests were conducted independently of the supervisory rating. Few measures are completely valid; the more rigorous and varied the validity tests are, the stronger is the case that can be made for a measure. There are a number of other ways to test validity. Although most of them are less convincing than construct validation, they are easier to apply. But no validity assessment is perfect, and no single method is best suited for all situations. A very practical method of assessing validity is to use "content validity." In this approach, evaluators might ask experts to make sure that the measure includes the content they want to measure. For example, in a study of the Financial Integrity Act, several measures of financial integrity were proposed: time since audit, number of audits, amount of cash, cash controls, ease of access to cash, number of people with access to cash, and so on. Financial accounting experts reviewed the measures and concluded that they would be valid indicators of financial integrity. Prediction is also used to assess validity. For example, in one study, we developed an instrument that would measure the restrictiveness of zoning laws and practices. We validated the measure, in part, by showing that the restrictiveness score was correlated with land-use patterns. Criterion comparisons are also used. For example, if a new test is supposed to measure intelligence, then the people who take it ought to get similar scores on the Stanford-Binet IQ test (a time-honored and extensively validated test). Validity can be tested by looking at the relationships between factors that should be positively correlated or negatively correlated. For example, measures of the quality of training ought to correlate positively with productivity. If they do, we have some confidence in the validity of the measures. The measure of a participative management style ought to correlate inversely with a measure of an authoritative management style. If it does, confidence in the validity of the measure is strengthened. Although the rigor and pluralism of methods that are used determine the credibility of a claim for validity, resources are often limited. We tend to validate most often when the measures are complicated and abstract, or unproven, or critical to the study findings and likely to be challenged. VERIFICATION ------------------------------------------------------- Chapter 12:3.2 Our measures must provide accurate data. We test for this precision by comparing the data against an accurate source, by putting in controls that reduce observation errors, or by repeating the measurement process. This practice is often called verification or corroboration. Determining how much verification should be done to ensure the quality of data obtained through questionnaires is a management judgment. The extent of verification should be based on the type of data, its use as evidence to address the assignment's objectives, the relative risk of it being erroneous, and alternatives available to verify data, including time and resource constraints. Opinions and attitudinal data, on the one hand, are testimonial evidence and could ideally be verified by checking the consistency of the answers with actual experiences and behavior. However, this is not often easily done and may not be necessary since the data are presented only as opinion. Factual data, on the other hand, can be verified through observation, cross checked with other witnesses, or checked against records. The most convincing method of verification is to compare on a test basis the respondent's answers with evidence developed from an "on-site inspection" that involves direct observation or a review of records. Such verifications are ideally conducted on a statistical sample of the respondent population. Practically, a judgment sample considered typical of the population is often used. In addition to on-site verification, or when such verification is not practical, the following types of steps can be taken to raise the evaluators' level of confidence in the reasonable accuracy of the data: Ask respondents to send a copy of specific records when they return the completed questionnaire. Telephone and obtain clarification from respondents who provided important data that seemed out of line when compared to the data provided by similar respondents. Telephone a random sample of respondents and attempt to ascertain the extent to which they consulted appropriate records to obtain the most significant factual data provided in their responses. Corroborate or verify through other data bases, records, or prior reports. Corroborate the questionnaire results by comparing them to the results of similar studies or having them reviewed by outside experts knowledgeable about the program or topic. Cross check aggregate statistics from the questionnaire against data reported by other organizations. Include consistency checks in the questionnaire by asking for the same or similar information in more than one question. Another aspect of verification is checking the accuracy of keyed data by comparing the keyed records with the original source records. Data entry operators verify by keying in the source document twice and check to see if they get identical answers each time. However, GAO sometimes also uses controls to verify the accuracy of the data entry such as checking for illegal codes or out-of-range values. Initial plans for verification should be part of the data collection and analysis plan that is completed during the design phase. The type and amount of verification should be appropriate for ensuring that the evaluators will have sound evidence to address the assignment's objectives. The initial plans may need to be modified when the questionnaire is pretested, when the questionnaires are returned and the responses are being analyzed, or whenever there is reason to doubt the accuracy of the questionnaire results. TESTING RELIABILITY ------------------------------------------------------- Chapter 12:3.3 "Reliability" refers to the consistency of measures. That is, a reliable measure is one that, used repeatedly in order to make observations, produces consistent results. Testing reliability is difficult, and expensive, because the evaluators have to either replicate the data collection or return to those who were questioned before. People do not like to be retested. Because of this, GAO often does not test reliability if we have good reason to believe our measures are stable. If we cannot make this assumption with a high degree of certainty or if we are likely to be challenged on this issue, we should test this assumption. Some situations in which the reliability testing of the questionnaire should be conducted follow. First, if the respondents as a group lack motivation or interest, they may not invest much care or thought in the questionnaire and their answers may vary randomly over time. Second, if respondents are expected to purposely exaggerate, retesting sometimes brings a more sober reconsideration. Third, for some topics, asking respondents to complete the questionnaire at home may produce different results from having them fill it out in another setting. For example, a questionnaire on military reserve training completed at home produced different answers than one completed while reservists were at summer training with their units. Fourth, there is a tendency when most respondents take an extreme position for extreme values to drift toward the norm when the measures are repeated at different times. It is important to note that the procedures for testing the reliability of answers are different from those for verifying answers. When information is verified, evaluators usually go to a different source for the same information or use a different technique on the same source, such as observations or in-depth interviews. To test reliability, evaluators have to administer the same test to the same source. ANALYSIS OF QUESTIONNAIRE NONRESPONSES --------------------------------------------------------- Chapter 12:4 Item and questionnaire nonresponses also must be analyzed because high or disproportionate nonresponse rates can threaten the credibility and generalizability of the findings. Suppose only half the people respond. Nothing is known about the other half. In particular, the reason they did not respond may be related to important differences between them and the responding group. Usually, if the nonresponse rate is small, we can make plausible assumptions that discount the potential effect of the nonrespondents. However, even in this case we should use whatever prior information we have to check for systematic differences between respondents and nonrespondents. In mail surveys, it is rare to get an answer from every questionnaire recipient. Some people in the sample may never have received the questionnaire. Some who did will choose not to answer. Thus, the original sample of recipients can be expected to shrink somewhat. The real problem is not so much the decreased sample but whether those who chose not to answer had disproportionately different views from those who did. For example, most of those who did not respond might have been opposed to something favored by those who did. We would then mistakenly believe in the generalizability of our sample responses, unless we investigated the reasons for nonresponse. This would threaten the representativeness of the sample and the ability to generalize from the sample to the population. In GAO, we account for all questionnaires mailed or interviews attempted in our workpapers and in our products. This includes the number of questionnaires returned or interviews completed, the number of intended respondents who refused, the number of questionnaires that were undelivered or interviews that could not be conducted, and so on. We calculate a response rate that is the percentage of eligible study cases drawn from the sample or population list that provide usable data. We also obtain and analyze information about all nonresponse groups to determine how they differ from those who did respond. (The current policy guidance on accounting for survey responses is included in chapter 10.5 of GAO's General Policies/Procedures Manual.) In order to make plausible generalizations, the effective response rate should usually be at least 75 percent for each variable measure--a goal used by most practitioners. By effective response rate, we mean the percentage of people who return the questionnaire minus the percentage of people who failed to answer for the variable in question. Small to moderate differences between the respondent and nonrespondent populations will then usually have little or no bias effect on the results. Transmittal letters that convey the relevance and importance of the questionnaire and systematic follow-ups help bring high response rates. Nevertheless, the nonrespondent population should be analyzed unless the response rate is over 95 percent. A comparison of respondents and nonrespondents with regard to demographic and other important characteristics can reveal whether or not nonresponse occurred systematically (for example, in a particular region or other segment of the questionnaire group). In a survey of employees who were subject to an agency's reduction in force, we found a high nonresponse rate in the Atlanta region. In another survey on block grants, all respondents whose last names began with "U" were missing. In both surveys, the mailgram contractor had neglected to send out follow-up notices. This could have resulted in misrepresentation of the respondents' views, insofar as the groups that were excluded differed from those that were included. Aside from reflecting mailing mistakes, the nonresponse rate may reflect certain conditions or respondent attributes. In a study of zoning and group homes, we analyzed responses to see whether people from states with unfavorable zoning laws did not respond. We also compared response rates for the types of population that facilities served (for example, the mentally retarded or emotionally ill). The workpapers should document the analysis of the composition of the nonrespondents, indicate the number and type of categories excluded from the expected population or sample, and document attempts to verify or trace the correct addresses of those who could not be reached by mail. If a nonresponse bias is detected, and we can make assumptions about the nonrespondent population, the survey results should be adjusted. For example, if a disproportionate number of nonrespondents are from California and we can assume that they are no different from the California respondents but we find that the people from California respond very differently from people in the rest of the nation, we should weight the California responses to account for this underreporting bias. If the response rate is lower than 75 percent and the standard follow-up procedures have been followed, it may be necessary to telephone or interview a random sample of nonrespondents to obtain answers to key questions or to find out why they did not complete the form. This information is important for two reasons: it brings more confidence to the evaluator about the meaningfulness and systematic nature of the nonresponses, and it helps assess the data that were returned. A discussion of the nonresponses should be included in the workpapers and in the discussion of methodology. In addition to the people who do not return the questionnaire, some proportion of the people who do respond will not complete some items. Thus, the average nonresponse rate should also be calculated for each item in order to determine whether the data from an item can be included in the analyses. Item nonresponse rates average about 3 percent. If the rate is more than about 7 percent, it should be analyzed to determine if the item presented a threat to respondents, was not perceived as relevant to the questionnaire focus, or contained design flaws or other factors that caused the low response rate. If the nonresponse rate is uncharacteristically large and, consequently, the item is excluded from our analysis, the final report should disclose this. Again, the item nonresponse analyses should be included in the workpapers and the discussion of methodology. DESIGNING THE QUESTIONNAIRE GRAPHICS AND LAYOUT =========================================================== Chapter 13 A questionnaire should be easy to read, attractive, and interesting. Good graphics design and layout can catch the respondent's attention, counteract negative impressions, cut the respondent's time in half, and reduce completion errors. If the design format works, respondents will feel they have received an important document outlining a reasonable request on which they should act. The front page of a GAO questionnaire has a title, instructions, and logo or seal. The text of the instructions should have two columns to promote ease of reading. At the normal reading distance, the eye cannot span much more than 4 inches without refocusing, and most people cannot immediately take in more than seven to nine words in a single glance. A string of seven to nine words with the type size GAO usually uses (10-point type) usually takes up 3-1/2 inches. Furthermore, the two-column format gives the page a formal and patterned look. To reduce bulk, both sides of a page are often printed. Usually, the pages are stapled in the upper left corner to look more like a letter and better suit the mail-out package. Booklets are used when a sturdier construction is needed or when the respondent has to refer back and forth to related questions. The questionnaire may or may not have a cover. INSTRUCTIONS --------------------------------------------------------- Chapter 13:1 The first part of the questionnaire should present the introduction and instructions. Because the transmittal letter is frequently separate from the questionnaire, instructions should repeat some of the material in the transmittal letter. The instructions should state the purpose of the survey; explain who the data collector is, the basis of its authority, and why it is conducting the survey; tell how and why the respondents were selected; explain why their answers are important; tell how to complete the form; provide mail-back instructions; list the person to call if help is needed to complete the form; provide assurances of confidentiality and anonymity when appropriate; tell how long it will typically take to complete the form; explain how the data will be used; explain who will have access to the information; disclose uses that may affect the respondents; and present the response efforts as a favor and thank the respondents for their cooperation. The instructions should be concise, courteous, and businesslike. QUESTIONNAIRE FORMAT PREPARATION --------------------------------------------------------- Chapter 13:2 Most GAO questionnaires and most pretests are reproduced from texts prepared on word processors. Computer programs such as QUEST, Wordperfect, and other desk-top publishing packages convert this to typographic text suitable for publishing. Sometimes the text is typed directly in publishable form. These texts are almost as attractive and readable as texts prepared by commercial printers and they are quicker and cheaper to produce. However, an attractive, readable, and business-like style and type should be used. Documents that look official, professional, and inviting are likely to be answered. Good layout and composition can cut reading time in half and can reduce the respondent's burden. This is particularly important when the respondent group has low literacy, the questionnaire is very long and complex, a large population is being surveyed, a prestigious group is being addressed, or the data collector's professional image is very important. TYPOGRAPHIC STYLE --------------------------------------------------------- Chapter 13:3 The size, style, and density of type are signposts to guide the respondent's eye and to signal the kind of information being presented. An example is in figure 13.1. Figure 13.1: Partial Questionnaire (See figure in printed edition.) (See figure in printed edition.) (See figure in printed edition.) (See figure in printed edition.) The title is the most noticeable feature on the questionnaire's front page. It should be a short statement (12 words or less) that identifies the population from which information is sought and gives a clear idea of what the questionnaire is about. Because of its importance, it should be printed in large type (for example, 14 point in bold). GAO uses Universal, or a similar typeface, because it is official-looking and easy to read in bold capital letters. (Usually, capital letters are much more difficult to read than lowercase letters.) Another feature of the title page in GAO questionnaires is GAO's logo or seal and its name. Here, we use 12-point Universal demi-bold because it looks official and businesslike without being pretentious. The headings and subheadings, which attract the respondent's eye next, are short phrases that tell what each part of the questionnaire is about. They stand out in 12-point Universal bold and 11-point Universal demi-bold or similar typefaces. Most of the questionnaire is text containing the instructions, questions, and answer spaces. Here, GAO usually uses 9-point or 10-point Times Roman, Baskerville, Press Roman, or similar type. These are clear, simple, easy-to-read, official-looking typefaces, with good height-to-width ratios, and the 9-point or 10-point size is large enough to read easily yet small enough to keep the questionnaire from getting too bulky. Once respondents begin to answer the questions, they see the response instructions. These are short texts, usually in parentheses, that tell how to answer--for example, "(Check one.)." Response instructions are usually in an italicized version of the typeface used for the text and are the same size. Like the response instructions, fill-in-the-blank instructions are in italics and parentheses. After answering a question, the respondent is frequently directed to another part of the questionnaire by instructions to "skip" or "go to question . . . ." These are usually in 9-point or 10-point bold type. The bold type emphasizes the skip instructions and this helps reduce errors. Occasionally, bold type is used to emphasize a key point in a question or text, such as an important qualifier that might be overlooked. GAO prefers bold rather than underlining because underlining stops eye movement and slows the respondent down. Next comes the response space--little boxes to check; a row, column, or matrix box to fill in; or sometimes a line for the respondent to write in information. All little boxes for single-response alternatives are left-justified or aligned to the left of the response. The use of square boxes yields fewer errors than circles or other shapes, as does the left justification over right justification, unless a row, column, or matrix format is used. Rows or columns or column-row matrixes are justified to the right, so that they line up with the row and column headings. Boxlines are used instead of leaders because they guide the eye better. All line work should be a half point or 1 point in width. The page looks too dense if the lines are much thicker. The row headings are in the same type as the text. Sometimes the column headings are in Gothic or similar type. Such typeface can be squeezed more than most others without destroying the letter symmetry of the word and without running the letters together, therefore not interfering with readability. Gothic typeface also reads well for very short passages, but it does not work as well for long passages. All questions and response alternatives are numbered rather than lettered. These numbers double as codes for information field identifiers for use in data reduction. Tiny numbers in parentheses to the right of the questions tell the keypunch operator what column to punch in tabulating responses. These column codes are in 5-point or 6-point Gothic italics or similar type. They are not big enough to distract the respondent nor are they too small for the keypunch operator to read. Shading is used to fill in space that the respondent might confuse with response space. The shading prevents respondents from writing in the space. A row of light shading can also be used to separate rows of text on a long horizontal layout or to guide the respondent across the page. The form design also makes use of white space. Leaving good margins, top and bottom space, and space between the text columns reduces the clutter, separates key parts of the questionnaire, and makes it look more inviting. Questionnaire designers should try to give the respondent as much white space as possible without expanding the number of pages. PREPARING THE MAIL-OUT PACKAGE AND COLLECTING AND REDUCING THE DATA =========================================================== Chapter 14 In addition to developing the questionnaire itself, GAO evaluators generally complete several other tasks, as summarized below: develop a computerized mailing list, a cover letter, and other mail-out materials and assemble the mail-out package; monitor and edit the returns and conduct follow-ups; key in the responses, verify the computer file, and develop the data base. PREPARATION OF THE MAIL-OUT PACKAGE --------------------------------------------------------- Chapter 14:1 Before the questionnaire is mailed to potential respondents, a computerized address file has to be developed, a cover letter has to be prepared, and other materials (such as return envelopes) have to be assembled for the mail-out package. ADDRESS FILES ------------------------------------------------------- Chapter 14:1.1 Concurrent with designing and testing the questionnaire, evaluators should select the population sample cases for the survey. (See chapter 3.) It is usually a good idea to send or distribute the packages directly to an individual rather than to rely on intermediaries. Transmittals that rely on intermediaries usually do not work well, and when they go wrong, the survey loses credibility because control of the sample has been lost. In one instance, we gave questionnaires to Veterans Administration hospital administrators to distribute to the staff, and in another we gave them to union leaders to give to their members. Both distributions were incomplete, and both surveys had to be discounted because of poor response rates and uncontrolled sample selection. It is normal to begin with a hard-copy list of addresses. This list should be reviewed, and careful attention should be paid to the following matters to ensure that it is current, complete, and accurate: spelling and capitalization, titles (Dr., Ms., Mr.), job titles (as appropriate), street addresses with room numbers and apartment numbers (as appropriate), and city, state, and zip code. The revised hard-copy list must now be put into a computerized file. This can be done in several ways. For example, the list can be keyed on tape or disk and entered into the appropriate computer system. The list can also be typed on a word-processing system disk and then transferred to the system, or it can be typed directly into a system file from a remote terminal. Once the file is in the system, a hard-copy list can be prepared, reviewed, corrected, and case numbered. The address file is in this format: Mr. John Doe 226 Main St. Middletown, NY 00000 At this point, a hard-copy log with case numbers should be printed for use in controlling mailed and returned questionnaires. TRANSMITTAL LETTER ------------------------------------------------------- Chapter 14:1.2 Because respondents see the cover letter first, their decision to participate in the survey is often made on the basis of the letter's strength. Therefore, the letter should pay attention to the following guidelines, which have been found to increase the likelihood of a reply: 1. Design the mail-out package so the letter is seen first. 2. Have the letter neatly typed to look like a personal, individualized communication rather than printed or xeroxed. 3. Use an official-looking format and style of writing but avoid being impersonal, ambiguous, or unclear. 4. Address the letters to each individual. 5. Explain what GAO is and why it has a legitimate and purposeful role in collecting these data. 6. Without being pretentious, explain that GAO is an important agency working for the Congress. 7. State the purpose of the project. 8. Stress the importance of the project. 9. Relate the project to the respondent. 10. Stress the importance of the answers and the study to the respondent and the nation. If possible, make references to possible benefits to respondents. 11. Tell how and why the respondent was selected. 12. State that the questionnaire can be answered easily and in a short time. Tell truthfully how long it should take to complete the questionnaire. 13. Emphasize the importance of replies from everyone sampled. 14. Ask a favor. 15. When necessary, ensure anonymity or confidentiality and no uses other then those stated. 16. Ask for honest and frank answers. 17. Urge prompt responses. 18. Alert the respondent that there will be a follow-up for those who do not reply. 19. Mention the possibility of a verifying personal interview when appropriate. 20. Provide a name and a phone number in case the respondent needs assistance in completing the form. 21. Express appreciation for the respondent's assistance. 22. Have the letter signed by hand in blue ink by the person with the highest appropriate responsibility. If many letters are to be sent out, have several clerks sign them. 23. Send the package by first-class mail. (The return envelope should also be for first class.) The pledge of confidentiality is worthy of further discussion. In GAO, we use pledges of confidentiality only when it is essential for meeting the assignment objectives and the data cannot be obtained in another way. We use pledges that individual and organizational names will not be released and that responses will generally be reported in an aggregate form to help increase the response rate and the truthfulness and candor of the respondents. (Before a pledge of confidentially is used, a written justification is prepared and approved by the assistant comptroller general of the division.) For work being done for the Congress, GAO's pledge is approved in writing by each requester. When GAO's pledges are given, the link between individuals and their responses may be destroyed after all analysis, referencing, and supervisory reviews have been completed. (If a follow-on review is anticipated, it may be necessary to retain the linkage.) The current policy guidance on pledges is included in chapter 6 of GAO's General Policies/Procedures Manual. Once the transmittal letter has been written, edited, reviewed, and revised, it is ready to be typed into the computer system as a separate file. GAO can run a computer program to produce the transmittal letters by merging each address in the address file with the transmittal letter file. At this point, the letters are ready for signature. OTHER MAIL-OUT MATERIALS ------------------------------------------------------- Chapter 14:1.3 The following materials should be prepared and printed (by printing services) to complete the mail-out package. 1. Preaddressed, postage-paid return envelopes are used to return the questionnaires and are usually addressed to an individual on the project team. 2. Preaddressed, postage-paid postcards for respondents to indicate that they have returned the questionnaire separately are used when the respondents are to remain anonymous to GAO. They tell GAO that the respondents have sent in their questionnaire so we do not follow up on them. 3. Business letter envelopes can be used if the questionnaire is six pages or less. Window envelopes are sometimes used to avoid labels. Large questionnaires and booklets require large envelopes and mailing labels. 4. Occasionally, letters of endorsement from influential people are included in the mail-out package if it is believed they will increase response rates or result in more complete and honest answers. For example, a survey of Navy contractors might be enhanced by including a letter of endorsement from the admiral in charge of contracts or another senior Navy official. Once all the materials have been gathered together, an assembly line is formed to fold, stuff, seal, and control the mail-out package, using the address list as a control log. These activities are normally done in-house; however, they can also be done by an outside firm when a long lead time is available, the sample is large, and the benefits outweigh the costs. DATA COLLECTION --------------------------------------------------------- Chapter 14:2 Essential to a good data collection phase is the monitoring of responses (and nonresponses) and a continuing effort to get the responses. Generally, GAO attempts to attain a response rate of 75 to 95 percent, which is the generally accepted standard of the survey research community. MONITORING RETURNS ------------------------------------------------------- Chapter 14:2.1 The address list developed for the mail-out package is an excellent tool for monitoring returns and ensuring that an outcome--a return or a reason for no return--is recorded for each sample unit. This same list will serve as the basis for mailing follow-up materials to nonrespondents. Maintaining this log is very important because it also serves as a control to document the cases that were entered into the computer. The earliest returns may be undeliverable packages. For each undeliverable, a note should be made on the control list of why the package could not be delivered. Incorrect addresses should be recorded and new mailings should be prepared when feasible. Other early returns may come from those who were erroneously included in the sample and therefore should not complete the questionnaire. It is important to separate inappropriately sampled units so that both the sample size and the population size can be adjusted. The return of questionnaires should be noted in the control log (usually with the date of return). When anonymity was assured, the returned post cards serve this purpose. FOLLOW-UP PROCEDURES ------------------------------------------------------- Chapter 14:2.2 Follow-ups can take several forms and can be conducted with varying frequency. For example, a project might begin with an initial mailing and then be followed by one or two follow-ups, using the normal postal system. Final follow-ups might then be conducted, using telephone contacts, mailgrams, or telegrams. Each technique has its advantages in certain situations. About 3 weeks after the initial mailing, responses will probably drop off each day. They are likely to trail off to a response total of about 30 percent to 50 percent. At this point, a follow-up is needed. Over the years, GAO has found that a single follow-up will bring in about one third to half of the outstanding questionnaires. Thus, we expect that about 3 weeks after mailing the first follow-up, we will have about 50 to 75 percent of our responses. A second mailed follow-up may be helpful at 8 to 9 weeks. At about the 11-week point, the response rate should be reevaluated in light of project goals. It may be possible to stop or perhaps to try one last follow-up by telegram or telephone. This decision should be based on such factors as (1) the number of outstanding responses (it is practical to call 75, but not 750, nonrespondents), (2) the availability of staff to make calls, and (3) the availability of resources (telegrams can be costly). Follow-up letters are prepared and produced in a manner similar to the preparation of the initial transmittal letter. The names of those who responded are subtracted from the mailing list, and a new file is created with the new letter. In the manner described previously, these two files are then merged, a new set of cover letters is produced, and new mail-out packages are assembled and mailed. See figures 14.1 and 14.2 for examples of initial transmittal and follow-up letters. Figure 14.1: Initial Questionnaire Transmittal Letter (See figure in printed edition.) Figure 14.2: Questionnaire Follow-Up Letter (See figure in printed edition.) EDITING ------------------------------------------------------- Chapter 14:2.3 As questionnaires are returned, they must be edited before they can be keypunched and entered into the computer system as a file. The editing is done in accordance with a preestablished protocol designed to identify obvious respondent errors and missing data points, make corrections and missing data estimates systematically and appropriately, and make sure the data entry operators can follow and accurately key the responses. The editing process can take weeks to complete, but a team can begin editing as soon as responses are received. Editing should not have to continue more than a short time after the last questionnaire has been received. To determine whether the responses are adequate, evaluators should look for the following kinds of items: 1. Is the response complete? 2. Did the respondent follow instructions? Skip appropriate questions? Answer appropriate questions? Check the correct number of responses to each question--one or all that apply? Place responses correctly in the response space provided? 3. Is the response sufficiently clear for data entry? 4. Do the open-ended responses provide useful data? 5. Did the respondent just check any response or make wild guesses without consideration by reviewing the consistency of the response pattern? 6. Did the respondent leave a space blank to indicate no or answer a question that should not have been answered? 7. Did the respondent answer clearly? For example, did he or she write "5K" when the correct notation was 5,000? 8. Are there missing values, misplaced answers, unclear responses, and inappropriate answers? 9. Are the missing values clearly distinguishable from the not-applicable answers, skips, or zero values? 10. Are all numbers right justified or positioned to the right to allow for either leading zeros or blank spaces? 11. Were extreme values checked that look inconsistent? 12. Are there responses that are logically inconsistent? 13. Are there mathematical errors in the responses? After the editors have reviewed perhaps 50 or 100 questionnaires according to these guidelines, they should prepare a written edit protocol that specifies the procedure for making edit changes. For example, this procedure should specify what to do if the respondent checks two alternatives of a set when he or she should have checked only one, and it should specify what items to look at to check for inconsistencies. Furthermore, some of the edit checks and corrections may be done by the computer after the data have been keyed and loaded on the computer; these include respondent math errors and coding blanks as missing values, no's, or "not applicable." These computer edit protocols should also be specified in the manual edit protocol before the editing starts, to minimize overlooking any edit procedure. Both the manual and computer edit protocols are used to develop the data entry protocols. Once edit protocols have been tested, inadequate or obviously incorrect responses must be assigned as missing values or sometimes corrected according to an established protocol for identifying the logically correct answer or adjusted according to further contact with the respondents (usually by telephone). Once the evaluators are satisfied that the responses meet project standards, the data reduction phase of the survey can begin. DATA REDUCTION --------------------------------------------------------- Chapter 14:3 Before the data can be analyzed, they must be moved from hard-copy form (the questionnaire) into a computerized data file that accurately reflects the hard-copy data. This process begins with keying the data onto a medium and in a format that the computer can read. KEYING ------------------------------------------------------- Chapter 14:3.1 Keying for GAO questionnaires is normally done by an outside contractor. Nearly always, the contractor keys from one of two sources--the questionnaires themselves or a coding sheet usually laid out in an 80-column card image format and prepared by the project team. Many GAO questionnaires are coded on an 80-column card format for ease of editing, not for ease of keying. The keying is generally done onto a tape or disk (not cards) that can readily be entered into the computer system as an unedited raw data file. Every data entry key stroke is first verified by the contractor. Keying instructions unique to the individual job are provided to the keyers for guidance. These should be written in conjunction with the manual and computer edit protocols so none of the edit or keying considerations are left to chance. Two of the evaluators' primary tasks are to ensure that the questionnaires given to the keyers are keyed and that all original questionnaires are returned--a control function. KEYED DATA VERIFICATION ------------------------------------------------------- Chapter 14:3.2 In a first, short, but necessary step, the tape or disk containing the unedited raw data file is loaded in the computer system. Once loaded in the computer system, the unedited raw data file can be converted to hard copy, in order to verify for a second time that the computer file accurately reflects the contents of the questionnaires. For GAO projects, at least 99 percent of the keyed strokes must be correct to be considered accurate. When unacceptable error rates are found, the data are keyed in again. Rather than verify the entire file, every question in every questionnaire of a sample of questionnaires can be verified. This is a cluster sample. How large should the sample be? It should be large enough to statistically ensure, at the 95-percent confidence level, that the data entry error rate is not more than 1 percent (1 plus or minus 0.4 percent). This often amounts to a 10-percent sample of cases--for a typical job of about 400 questionnaires and a typical questionnaire of about 250 characters. However, if the number of questionnaires or the number of characters per questionnaire is smaller or larger than the typical case, it is necessary to sample more than or less than the 10 percent, respectively. Also, if a greater precision is needed (error rate less than 1 percent) or if no error rate is permitted, then much larger samples or all the data must be verified by the evaluation team. Table 14.1 shows the percentage of questionnaires that might be sampled for a 1-percent error rate. However, the table is only a rule of thumb. A sampling expert should be consulted to determine the appropriate sample for keyed data verification. The verification process works best when two evaluators work together; one reads from the questionnaire while the other views the printed computer file. Table 14.1 The Percentage of Questionnaires That Should Be Randomly Sampled to Determine the Keypunch Error Rate Number of questionnaires More per keypunch job 1-99 100-300 than 300 -------------------------------------------------- -------- -------- -------- 1-29 100.0 100.0 100.0 30-99 50.0 40.0 25.0 100-499 10.0 10.0 7.5 500-999 7.5 5.0 5.0 More than 1,000 5.0\a 5.0\a 5.0\a -------------------------------------------------------------------------------- \a The maximum number of questionnaires in the sample should be 384. Even when an acceptable error rate is found, errors noted during the review should be corrected for the sampled cases. In addition, noted error patterns should be investigated. For example, assume the reviewers note (frequently a judgment call) that the keyer misinterpreted the responses to a question. Then all the responses to that question should be verified and corrections made. An additional edit should be made on all questions that can take on only a limited number of values. For example, a yes-no question may have values limited to 1 or 2, and a question asking about an item's cost may be known to have an upper limit of $10,000. A computer program that checks for out-of-range values should be run and corrections made. After this process has been completed, an edited raw data file is available that can be used in the initial steps of the analysis phase, as discussed in the next chapter. It is also important to note that the data verification noted above is a minimum protocol and that each GAO division may have additional, more specific and rigorous requirements. ANALYZING QUESTIONNAIRE RESULTS =========================================================== Chapter 15 ANALYSIS PLAN --------------------------------------------------------- Chapter 15:1 As noted earlier, a data analysis plan should have been developed as part of the evaluation design, after the questionnaire and sample have been developed but before any data are collected. Thinking through the data analysis may cause evaluators to reconsider their data-collection plan or even the evaluation questions themselves. In planning the data analysis, they might realize, for example, that they need additional data that they had not thought of before. An analysis plan also forces evaluators to decide what kind of findings they do and do not need to complete the evaluation. This process is important, because it is very easy to overburden the study with unnecessary analyses. Since most standard analysis packages can provide millions of analyses that would take many years to interpret, evaluators have to run the analysis; otherwise, it will run them. Also, unplanned analysis can result in fishing or data dredging--that is, the running of analyses without regard to a design or preconceived reason, just to see what will turn up. However, while a plan helps, evaluators cannot always predict relations that might emerge in exploratory analysis. The selection of analysis techniques and the variables to be analyzed will be determined to a large extent by the evaluation questions and the design requirements. Evaluators also need to make sure that their statistical analysis software routines can satisfy these requirements. For example, can they handle the size, number, specification, and measurement of the variables? And can they do the analyses required? Furthermore, the choice to do certain kinds of analysis often requires a respecification of the variables, measures, and variable relationships. Later, when the analysis begins, the evaluators will know how adequate their planning and data collection have been. If the measures were properly defined, relevant, and sound, and if the data relationships turn out as hypothesized, then the analysis will proceed as planned. However, projects are rarely perfect--there usually are some gaps in the planning and problems in the data collection. Measures are not always properly specified. Some important data may not be collected and some of the data that are collected may be irrelevant or unsound. Evaluators need then to modify the analysis plan, perhaps by scaling back the effort, expanding it to cope with unexpected developments, selecting methods to handle missing data, or exploring different ways of answering the evaluation questions. Regardless of departures from the original plan, however, the analysis must still proceed logically and step by step from very simple analyses to a limited number of more complex analyses.\1 -------------------- \1 The point of this chapter is to provide guidelines for developing a data analysis strategy. For a more detailed discussion of the quantitative techniques to implement this strategy, see U.S. General Accounting Office, Quantitative Data Analysis: An Introduction, GAO/PEMD-10.1.11 (Washington, D.C.: June 1992). ITEM RESPONSES AND UNIVARIATE ANALYSIS --------------------------------------------------------- Chapter 15:2 The first step is to go just a short way beyond the raw data on questionnaires by producing column, row, group, and subgroup tabulations and percentages, often called a "code book." The code book tells how people answered each item on the questionnaire by frequencies and percentages for each possible response category. Going one step further in the data analysis, evaluators can compute descriptive statistics and other indicators that help describe the frequency distributions. BIVARIATE ANALYSIS AND COMPARISON OF TWO GROUPS --------------------------------------------------------- Chapter 15:3 Comparisons between groups of respondents can be made. If evaluators want to study the relationship between two variables, they use correlational techniques, which show that a change in one variable is associated with a change in another. For example, we might want to determine whether the performance of the Federal Aviation Administration's flight-station service specialists decreases appreciably with age. We would plot the performance scores of specialists of various ages and see whether performance is related to age. We might use an analytic technique such as correlational analysis, which shows the degree to which two variables are related. Or we might compare the differences between two groups rather than the association between variables. For example, we might compare the performance of younger specialists with that of older specialists. Other primary analysis techniques would include cross tabulations, chi-square comparisons, "t" tests, and analyses of variance. MULTIVARIATE ANALYSIS AND COMPARISON OF MULTIPLE GROUPS --------------------------------------------------------- Chapter 15:4 This level of analysis is used when what is wanted is a look at the associations between more than two variables or at differences between more than two groups. For example, we might want to study the effect of age and experience on Federal Aviation Administration specialists' performance or the effect of age, experience, training and education, and recency of training and education all together. Here, we could use such multivariate techniques as partial correlations, multiple regression analysis, and factor analysis. We could also compare performance by looking at the differences between groups that have varying levels of each trait (older and experienced, younger and experienced, older with limited experience, younger with limited experience, and so on). We might use such techniques as multiple analysis of variances, discriminant analysis, linear structural relations, or log-linear analysis. CHOICE OF ANALYSIS METHODS --------------------------------------------------------- Chapter 15:5 The choice of data analysis methods depends largely on the evaluation questions and subject matter under study and on the type of variables and what levels of measurement they satisfy. For example, if we had a question about whether the performance of Federal Aviation Administration specialists is different at different ages, and if we had reason to believe that performance was related to age and little else, a simple correlational analysis would reveal the degree of the relationships. But the matters GAO studies are usually more complicated than this, so we would expect other variables such as experience, education, training, and recency of education and training to be related to performance. We would need then to perform multivariate analysis in order to determine the relationships of the variables. Likewise, it might be important to compare performance across several groups rather than to confine the analysis to simple contrasts between pairs. The more complex analyses should usually be undertaken only after the results of simpler analysis have been examined. Sometimes evaluators have a choice between using associations and using group differences, and sometimes they do not. The shape of the data distribution, the measurement scales, and the plots of the functional relationship between the variables may rule out the use of correlation techniques. For example, sometimes we have to study group differences because the distribution of the observations is not normal; we could not then use certain correlational statistics. Correlational techniques are also inappropriate when the variables are scaled with ordinal data and when the relationships under study are not linear--that is, the plot between the variables cannot be transformed into a straight line. It is important to realize that correlational techniques cannot by themselves be used to show causality. Because questions about cause and effect are sometimes posed, we must note that special designs such as nonequivalent comparison groups, regression discontinuity, and interrupted time-series are usually necessary for establishing causality. The logic of the evaluation design, not the analytic technique, is crucial in drawing inferences about causality. ADAPTATIONS FOR THE DESIGN AND USE OF TELEPHONE SURVEYS =========================================================== Chapter 16 Telephone surveys are occasionally used in GAO assignments as another method for collecting structured data. However, there are differences between mail and telephone surveys, and they cannot be used interchangeably. To help evaluators appropriately use telephone surveys, we discuss the principal advantages and disadvantages of this methodology and the design requirements, adaptations, and administrative considerations in this concluding chapter. ADVANTAGES AND DISADVANTAGES OF TELEPHONE SURVEYS --------------------------------------------------------- Chapter 16:1 We use telephone interviews at GAO when time is essential. With sufficient staffing, a telephone survey can be completed in days as opposed to weeks for a face-to-face interview and months for comparable mail surveys. For some assignments, they are the only feasible approach. For example, in one audit we were required to estimate all the homeless children in shelters nationwide on a given day. With prior arrangements and scores of callers, we called a national sample of shelters to get the count. Given the nature of the shelter environment and the prohibitive costs of face-to-face interviews, no other method would have been possible. While not as cheap as mail surveys, telephone interviews cost much less than face-to-face interviews. Telephone surveys may cost between $40 and $75 per case as opposed to hundreds of dollars for personal interviews and a few dollars per case for mail questionnaires. Telephone surveys have certain advantages and disadvantages. For one thing, people seem to prefer mail surveys and personal interviews. For another thing, while telephone surveys are less sensitive to certain design problems, they are more sensitive to others. For instance, telephone instruments may be less sensitive to design flaws likely to cause primacy bias errors but more likely to be affected by recency bias errors. In addition, telephone responses are less complete and less accurate than mail or face-to-face responses. Many people do not like to talk on the telephone for long periods of time, because they do not like tying it up. They feel pressured to answer. They are more likely to answer in extremes. They answer from the top of their head and truncate their memory search earlier than they do in other modes of data collection. They are also more likely to acquiesce, guess, or give any answer or an easy answer or the same answer to all than in some of the other methods of interviewing. However, telephone interviews are an important and valid means of collecting data. In fact, the private sector, partly because it does not enjoy GAO's high mail response rates, relies very heavily on telephone surveys. Many government agencies and private sector businesses that must deal with the public sector also depend on the telephone because they sometimes have difficulty obtaining current and accurate address lists. Hence, these comments are not meant to discourage the use of telephone surveys. Telephone surveys are an important data collection method available for the evaluators' use. But what this means is that telephone surveys have to be very carefully crafted and adapted to the telephone medium. DESIGN GUIDELINES --------------------------------------------------------- Chapter 16:2 1. Minimize instrumentation errors. As a medium, the telephone magnifies the effect of certain design problems and minimizes the effect of others. In this section, we consider design problems that have a much greater effect on telephone surveys than on the mail questionnaire or on face-to-face interviews. This sensitivity is partly caused by the telephone medium's not having the added cues inherent in the mail and face-to-face interviews. In the mail survey, all the information is presented simultaneously. The respondent can easily skip back and forth and use the context for help. In face-to-face interviews, both the interviewees and the interviewers use the paralanguage (gestures, facial expressions, and so on) to help understand. Thus, the interviewee can ask for help without actually asking and the interviewers can often tell if their messages are understood. Also, in face-to-face interviews, this advantage can be complemented by the use of visual cues such as show cards. The telephone medium worsens the effects of design problems in clarity, construct development, language level selection, qualifications, question format selection, response categorization, question bias, facilitating memory recall, and minimizing undesired ordering and recency effects. However, on the positive side, it may be that design flaws in accounting for primacy and undesirable context effects will have less effect when the question is asked on the telephone. For guidelines to resolve these types of flaws, the reader is referred to chapters 2, 4, and 6-12 of this publication and to Using Structured Interviewing Techniques.\1 These materials provide sufficient guidelines to correct most design flaws. However, in the rest of this section, we add some design notes on clarity, facilitating short-term mental processing and long-term retrieval, guiding the line of questioning, minimizing the cognitive tasks, minimizing recency effects, setting up reasonable interview time lengths, and pretesting. 2. Stress clarity. Follow the guidelines on clarity specified in chapter 6 to the letter. Pay particular attention to the following suggestions. Use conversational English. Write with short and simple syntax. Limit the sentences or syntactical structure to 20 or 25 words. When possible, use familiar words. Use concrete words if abstract words can be avoided. Use words that are easy to picture or imagine. Make sure all the words you use and the way you use them have a single meaning or a very limited number of meanings. Make sure all important qualifications are stated in such a way that they will be noted and understood. 3. Write to facilitate both short-term processing and long-term memory retrieval. Follow the guidelines specified in chapter 8 and carefully consider the following. Limit to 25 words or less each idea or unit of information that is to be kept in the listener's head long enough for the higher cognitive process to work. This is because respondents have trouble comprehending speech that is spoken faster than 100 words a minute or 25 words per 15 seconds. The immediate memory span or processing capability, the amount of time most people can keep information in their heads without losing it or subjecting it to additional mental processing, is 15 seconds. Limit the higher-level cognitive tests to steps or responses that require no more than 15 or 30 seconds to answer. If you cannot do this, alter the question or the script so that there is interviewer or interviewee feedback or interaction at least every 45 seconds. Respondents feel pressured to answer quickly to alleviate the silence between question and answer during a telephone interview. Forty or 50 seconds of silence is often too much pressure. To resolve it, many will cut the quality of their mental processing in order to answer more quickly. 4. Guide the line of questioning. Respondents provide more complete and accurate answers if they can anticipate the line of questioning and the information they must retrieve. Telling the respondents where you are going or providing transitions so that the next questions can easily be inferred or anticipated helps them to some anticipatory cognitive processing that improves their ability to answer. In some of the alternative methodologies, the respondents can use the context of the instrument to warn them of what is coming next. But telephone interviews are devoid of such context cues. 5. Decompose the cognitive tasks. Make sure all complex and difficult comprehension tasks are broken down into small steps and that these steps form discrete, complementary, and logical operations. Unlike with other media, respondents do not have a visual format of the problem. They must keep all the rules, conditions, and qualifications in their heads. Also, remember that many feel they must answer quickly. If the task appears difficult, the interviewee often resorts to inefficient and error-prone heuristics and strategies. This does not mean that we cannot use the telephone to audit complex issues. It just means that we must break down a complex inquiry into smaller, logically ordered operations. 6. Minimize recency effects. Perhaps the biggest difference between telephone surveys and alternative methods is a pronounced recency bias. That is, alternatives and conditions presented in the latter part of the question will be remembered best and, hence, are more likely to be chosen. One way to mitigate this effect is to limit the alternatives to seven choices or, if the choices are more complex, to five or fewer. 7. Minimize the tendency to extremes. Telephone respondents appear to be more likely to answer in extremes than their counterpart mail and face-to-face interview respondents. To minimize this tendency, use the techniques discussed in the latter part of chapter 4 on intensity scale formats. That is, use a branching format with a middle alternative whenever possible. Also, be careful to use well-anchored, equal-appearing intervals in the response scales. 8. Keep the interview short. For a variety of reasons, some people get uncomfortable during extended telephone interviews. If the interview goes over half an hour, they feel somewhat stressed. Repetitions become tedious. They are more likely to acquiesce, to guess, to cut their memory or cognitive tasks short, to answer from the top of their heads, to select extremes, or to use other forms of shortening their responses. 9. Pretest the interview under realistic conditions. Follow the procedure described in chapter 12 with the following exceptions. Administer the pretest over the telephone so the interviewer is not in the presence of the respondent. Have an observer (a person different from the interviewer) be present to observe the respondent and take notes and response times as he or she would in a normal pretest procedure. The observer should be able to hear the interviewer over an extension. If this is not possible, the observer should at least have a copy of and be familiar with the script in order to follow the interview. The pretest debriefing should be conducted as if it were a normal pretest. Here are some cautions concerning the dual administration of telephone and mail surveys. On occasion, evaluators may consider using mail, telephone, and face-to-face interview methods to administer the same instruments. For example, they might want to use a telephone survey to complete the last follow-up of a mail survey because the telephone methods require less calendar time. Another example is the use of a mail survey to contact part of a telephone survey population that could not be reached because of unlisted numbers, duty overseas, or other reasons. While this can sometimes be done without compromising the survey, it is usually not a good idea. As we can infer from the preceding chapters on mail surveys and the previous discussion on telephone surveys, these methods have different effects and can produce different results. Consider these differences. Mail surveys are more prone to primacy bias and contextual cues than telephone surveys. Telephone surveys show more recency bias, social desirability bias, and tendencies toward extremes than mail surveys. Furthermore, these differences become even greater as the cognitive requirements of the subject matter become more difficult. This does not mean that we can never use a mixed mode. We have in fact used it successfully in several studies. However, in each of these cases, we were careful to plan and design for a mixed mode administration. In addition, if mixed approaches are to be used, it is imperative that the survey responses be tested to rule out or account for mode differences. -------------------- \1 U.S. General Accounting Office, Using Structured Interviewing Techniques, GAO/PEMD-10.1.5 (Washington, D.C.: July 1991). ADMINISTRATION --------------------------------------------------------- Chapter 16:3 Telephone survey administration requires an advance letter, a contact log, a trained staff, a monitoring procedure, and, if possible, computer assistance. ADVANCE LETTER ------------------------------------------------------- Chapter 16:3.1 The purpose of the letter is to alert respondents as to who you are, why you are calling, and when you expect to call. This establishes your legitimacy prior to the call, thus breaking down some of the respondents' reluctance. It also facilitates the interview because the respondents can refresh their memory, consult records, and sometimes have the necessary information at hand. The letter minimizes the chances of contacting the wrong person and helps increase first-call contact rates because the respondent is aware of your interview schedule. While advance letters are not essential, and you can obviously conduct an interview without one, they have a very great effect on facilitating the data collection. TELEPHONE LOG ------------------------------------------------------- Chapter 16:3.2 All interviewers should keep a telephone log; it becomes part of the data collection record along with the completed interview. The major purposes of the log are to keep a nonrespondent record, to provide data to make sampling adjustments, to facilitate call backs, and to identify reluctant respondents. In GAO's experience, an interviewer can complete from 6 to 12 half-hour interviews a day. We use the log because it is rare that an interview is completed on the first call. It usually takes at least three calls to get a completed interview. The time and days of the calls are varied so as to increase the chances of getting a contact. If a contact for other than the respondent is made, then the caller should verify the respondent's identifying information and seek referral information. For example, ask for the best time, day, date, and number to reach the respondent or other numbers or other people who may help locate the respondent. Referrals may even help in the search for proxies, if this option was part of the design. Finally, the log should state the status of the interview. That is, was the interview completed, partially completed, or refused? Are more calls planned? (Give a justification, if not.) Are there call-back appointments? (Give time and place, if yes.) TELEPHONE INTERVIEW LOG ENTRIES ----------------------------------------------------- Chapter 16:3.2.1 1. Case number 2. Sample strata number 3. Name 4. Title 5. Address 6. Work number 7. Home number 8. Other numbers 9. Call try, 1st, 2nd, 3rd, etc. 10. Date and time of call 11. Contact, yes or no (if "no," why no: busy, no answer, disconnected, wrong number, no other number, moved, or other) 12. Contact party (if yes): respondent, yes or no other, yes or no If other, name, title, and number 13. Referral information: name and number of referral names, titles, and numbers of possible locators and date and times for best chances of contact 14. Interview status: eligible, yes or no complete, yes or no partially complete, yes or no refusal, yes or no call back, yes or no if no, justification call-back appointment, yes or no date and time of appointment, if yes other status (specify) 15. Proxy information, if relevant: name, title, address, and number of proxy justification for proxy use REFUSALS, NO CONTACTS, AND PROXIES ----------------------------------------------------- Chapter 16:3.2.2 As with all surveys, decisions have to be made with refusals, no contacts, and proxies. In telephone surveys, if refusals are numerous enough to be of concern, they are often referred to a more experienced or different interviewer. After a week or two, this interviewer again calls the persons who refused and attempts to persuade them to complete the interviews. Interviewers experienced with "conversions," as they are called, can usually convert from one third to half of the refusals. If the refusals fail to convert, then the interviewer tries to see if an alternative method would be acceptable. If not, the interviewer tries to get a limited response or reasons for refusing. Alternative methods are mail questionnaires and face-to-face interviews. However, if alternatives are used, they should be analyzed for media and reluctant respondent effects before they are included in the data base. A limited response might be the answer to one or two questions if there are a few questions that are much more important than the rest. These questions should be reviewed before they are included in the data base because they were taken out of interview context. Such interviewees should be considered reluctant respondents. Also, the interviewer should attempt to find out the respondent's reasons for refusing. In addition to statements like "too busy," "not interested in the problem," "don't give interviews," "don't know who you are," valid reasons why the respondent should not have been part of the population are sometimes given. If the reasons are valid, then such cases can be dropped from the sampling group. The refusal group should also be analyzed, if possible, for characteristic differences from the respondent sample. A case is labeled "no contact" if the respondent could not be contacted after several tries (usually seven) and little or no referral information is available. As with the refusal group, the no-contact group should be analyzed to see if its members have left the population and if they are different from the respondents. Very often the no-contacts should not have been considered part of the sampling frame. Possible reasons for no-contacts are that they left the area, changed jobs, retired, died, discontinued or changed or reallocated operations or responsibilities, gone out of business, and so on. The analysis of the no-contact group characteristics is also important. The extent to which they are different from those of the respondents may affect the external validity. Sometimes it may be possible or even more appropriate to substitute a proxy for a selected respondent. For example, the interviewer may have found out that the respondent's responsibilities were transferred to another department or that his or her responsibilities were shared by another co-worker or supervisor. However, this substitution should be justified before it is implemented. TRAINING, MONITORING, AND COMPUTER ASSISTANCE ------------------------------------------------------- Chapter 16:3.3 All interviewers should be trained and rehearsed in the administration of the interview. Most important, they should be trained to speak at no more than 100 words per minute. If the interview is not standardized, then the interviewer should follow the probes denoted in the script as well as other problem areas that may have been signaled by the respondent's paralanguage. Examples are changes in pitch, enunciation, speech rate, and word trailing. Methods for this type of interview are included in chapter 12. Methods for training in standardized interviews are described in Using Structured Interviewing Techniques. These references stress the importance of interview standardization and interviewer training. Each contact should be interviewed according to the prepared script and asked the same questions in the same way. The telephone interviewing should be monitored. Most commercial telephone interviewing operations have a centralized system in which monitors can hear both parties as the interviews are being conducted. Other systems of monitoring allow for more limited monitoring such as on-line sampling of two-way conversations, supervisor monitoring of just the interviewer's conversation, or a recording of the two-way conversation. While the centralized system is superior to the other alternatives, monitoring is very important. It is very important that all interviewers maintain the same level of enthusiasm and professionalism necessary to keep the interview going and on track and avoid feedback that will have untoward or biasing effects. Often telephone interviews are programmed into a computer-assistance package that facilitates response recording and the administration of the interview. These methods are described in Using Structured Interviewing Techniques. BIBLIOGRAPHY ============================================================ Chapter 1 Baddeley, A. D. The Psychology of Memory. New York: Basic Books, 1976. Belson, W. The Design and Understanding of Survey Questions. London: Gower, 1981. Biderman, A. D. (ed.). An Inventory of Surveys of the Public on Crime, Justice and Related Topics. Washington, D.C.: U.S. Government Printing Office, 1972. Biemer, Paul, et al. (eds.). Measurement Errors in Surveys. New York: John Wiley and Sons, 1991. Bradburn, M. N., and S. Sudman. Response Effects in Surveys. Chicago: Aldine, 1974. Converse, Jean, and Stanley Presser. Survey Questions: Handcrafting the Standardized Questionnaire. Beverly Hills, Calif.: Sage Publications, 1986. Deming, W. W. Sampling Design in Business Research. New York: John Wiley and Sons, 1960. Dillman, D. A. Mail and Telephone Surveys. New York: John Wiley and Sons, 1978. Erdos, P. L. Professional Mail Surveys. New York: McGraw-Hill, 1970. Flesch, R. Say What You Mean. New York: John Wiley and Sons, 1974. Frey, S. H. Survey Research by Telephone. Beverly Hills, Calif.: Sage Publications, 1983 Groves, R. M. Survey Errors and Survey Costs. New York: John Wiley and Sons, 1989. Groves, R. M., et al. Telephone Survey Methodology. New York, John Wiley and Sons, 1988. Krosnock, J., and L. Fabrigar. "Cognitive Perspectives on Survey Questionnaire Design." Manuscript, Ohio State University, Columbus, Ohio, 1991. Lockhart, D. C. (ed.). Making Effective Use of Mailed Questionnaires. San Francisco: Jossey-Bass, 1984. Moser, C. A., and Graham Kalton. Survey Methods in Social Investigation, 2nd ed. London: Heineman, 1971. Oppenheimer, A. N. Questionnaire Design and Attitude Measurement. New York: Basic Books, 1966. Payne, S. L. The Art of Asking Questions. Princeton: Princeton University Press, 1951. Rosenberg, M. The Logic of Survey Analysis. New York: Basic Books, 1968. Rossi, Peter, James Wright, and Andy Anderson (eds.). Handbook of Survey Research. New York: Academic Press, 1983. Schuman, H., and S. Presser. Question and Answers in Attitude Surveys. New York: Harcourt Brace Jovanovich, 1981. Sudman S. Applied Sampling. New York: Academic Press, 1976. Sudman, S., and M. N. Bradburn. Asking Questions. San Francisco: Jossey-Bass, 1982. Sudman, S., and M. N. Bradburn. Response Effects in Surveys. Chicago: Aldine, 1974. Turner, C. F., and E. Martin (eds.). Surveying Subjective Phenomena, Vols. 1 and 2. New York: Russell Sage Foundation, 1984. U.S. General Accounting Office, Pell Grant Validation Imposes Some Costs and Does Not Greatly Reduce Award Errors: New Strategies Are Needed, GAO/PEMD-85-10. Washington, D.C.: September 1985. Warwick, D. P., and A. C. Lininger. The Sample Survey: Theory and Practice. New York: McGraw-Hill, 1975. GLOSSARY ============================================================ Chapter 2 ANCHORS -------------------------------------------------------- Chapter 2:0.1 Anchors are items that serve as reference points from which other items in the series or other points in the scale are judged or compared. For example, the opposite ends or poles of a scale identify the extremes so all values within the scale are either greater or less than one of these extremes. Also, the scale midpoint serves as an anchor in that it either divides the scale into categories or quantifies the half value. ATTRIBUTE -------------------------------------------------------- Chapter 2:0.2 A characteristic that describes a person, thing, or event. For example, being female and male are attributes of persons. BIAS -------------------------------------------------------- Chapter 2:0.3 Words, sentence structure, attitudes, and mannerisms that unfairly influence a respondent's answer to a question. Bias in questionnaire data can stem from a variety of other factors, including choice of words, sentence structure, and the sequence of questions. Both interviewer and instrument bias can exist. BIVARIATE ANALYSIS -------------------------------------------------------- Chapter 2:0.4 An analysis of the relationship between two variables. CONFIDENCE LEVEL -------------------------------------------------------- Chapter 2:0.5 The level of certainty to which an estimate can be trusted. The degree of certainty is expressed as the chance that a true value will be included within a specified range, called a confidence interval. CONSTRUCT -------------------------------------------------------- Chapter 2:0.6 A concept that describes and includes a number of characteristics or attributes. The concepts are often unobservable ideas or abstractions, such as "community," "well-being," performance," or "democracy," that are represented by observable measures. ESTIMATION ERROR -------------------------------------------------------- Chapter 2:0.7 The amount by which an estimate differs from a true value. This error includes the error from all sources (for example, sampling error and measurement error). JUDGMENT SAMPLE -------------------------------------------------------- Chapter 2:0.8 A sample selected by using discretionary criteria rather than criteria based on the laws of probability. MEASURE -------------------------------------------------------- Chapter 2:0.9 A neutral concept that determines which data will be collected. The chief methodological concern in developing a useful measure is its validity. MEASUREMENT ------------------------------------------------------- Chapter 2:0.10 A procedure for assigning a number to an object or an event. MEASUREMENT ERROR ------------------------------------------------------- Chapter 2:0.11 The difference between a measured value and a true value. MULTIVARIATE ANALYSIS ------------------------------------------------------- Chapter 2:0.12 An analysis of the relationships between more than two variables. NONRESPONDENT ------------------------------------------------------- Chapter 2:0.13 A person who fails to answer either a questionnaire or a question. OPERATIONALIZATION ------------------------------------------------------- Chapter 2:0.14 A process of describing constructs or variables in concrete terms so that measurements can be made. PRECISION ------------------------------------------------------- Chapter 2:0.15 The exactness of a question's wording or the amount of random error in an estimate. RELIABILITY ASSESSMENT ------------------------------------------------------- Chapter 2:0.16 An effort required to demonstrate the repeatability of a measurement--that is, how likely a question may be to get consistently similar results. It is different from verification (checking accuracy) or validity (see Validity Assessment). RESPONSE STYLE ------------------------------------------------------- Chapter 2:0.17 The tendency of a respondent to answer in a specific way regardless of how a question is asked. SAMPLING ERROR ------------------------------------------------------- Chapter 2:0.18 The maximum expected difference between a probability sample value and the true value. SCALE ------------------------------------------------------- Chapter 2:0.19 A set of values with a specified minimum and maximum. STANDARDIZED QUESTION ------------------------------------------------------- Chapter 2:0.20 A question that is designed to be asked or read and interpreted in the same way regardless of the number and variety of interviewers and respondents. UNIT OF ANALYSIS ------------------------------------------------------- Chapter 2:0.21 The class of elemental units that constitute the population and the units selected for measurement; also, the class of elemental units to which the measurements are generalized. UNIVARIATE ANALYSIS ------------------------------------------------------- Chapter 2:0.22 An analysis of a single variable. VALIDITY ASSESSMENT ------------------------------------------------------- Chapter 2:0.23 The procedures necessary to demonstrate that a question or questions are measuring the concepts that they were designed to measure. VARIABLE ------------------------------------------------------- Chapter 2:0.24 A logical collection of attributes. For example, each possible age of a person is an attribute, and the collection of all such attributes is the variable age. VERIFICATION ------------------------------------------------------- Chapter 2:0.25 An effort to test the accuracy of the questionnaire response data. The concern is uniquely with data accuracy and deals with neither the realiability nor the validity of measures. PAPERS IN THIS SERIES ============================================================ Chapter 3 This is a flexible series continually being added to and updated. The interested reader should inquire about the possibility of additional papers in the series. The Evaluation Synthesis. GAO/PEMD-10.1.2. Content Analysis: A Methodology for Structuring and Analyzing Written Material. GAO/PEMD-10.1.3, formerly methodology transfer paper 3. Designing Evaluations. GAO/PEMD-10.1.4. Using Structured Interviewing Techniques. GAO/PEMD-10.1.5. Using Statistical Sampling. GAO/PEMD-10.1.6. Developing and Using Questionnaires. GAO/PEMD-10.1.7. Case Study Evaluations. GAO/PEMD-10.1.9. Prospective Evaluation Methods: The Prospective Evaluation Synthesis. GAO/PEMD-10.1.10. Quantitative Data Analysis: An Introduction. GAO/PEMD-10.1.11.