Gods grace research bureau: RELIABILITY AND VALIDITY

Warwick and Linninger (1975) point out that there are two basic goals in questionnaire design.

1. To obtain information relevant to the purposes of the survey.

2. To collect this information with maximal reliability and validity.

How can a researcher be sure that the data gathering instrument being used will measure what it is supposed to measure and will do this in a consistent manner? This is a question that can only be answered by examining the definitions for and methods of establishing the validity and reliability of a research instrument. These two very important aspects of research design will be discussed in this module.

Validity

Validity can be defined as the degree to which a test measures what it is supposed to measure. There are three basic approaches to the validity of tests and measures as shown by Mason and Bramble (1989). These are content validity, construct validity, and criterion-related validity.

Content Validity

This approach measures the degree to which the test items represent the domain or universe of the trait or property being measured. In order to establish the content validity of a measuring instrument, the researcher must identify the overall content to be represented. Items must then be randomly chosen from this content that will accurately represent the information in all areas. By using this method the researcher should obtain a group of items which is representative of the content of the trait or property to be measured.

Identifying the universe of content is not an easy task. It is, therefore, usually suggested that a panel of experts in the field to be studied be used to identify a content area. For example, in the case of researching the knowledge of teachers about a new curriculum, a group of curriculum and teacher education experts might be asked to identify the content of the test to be developed.

Construct Validity

Cronbach and Meehl (1955) indicated that, "Construct validity must be investigated whenever no criterion or universe of content is accepted as entirely adequate to define the quality to be measured" as quoted by Carmines and Zeller (1979). The term construct in this instance is defined as a property that is offered to explain some aspect of human behavior, such as mechanical ability, intelligence, or introversion (Van Dalen, 1979). The construct validity approach concerns the degree to which the test measures the construct it was designed to measure.

There are two parts to the evaluation of the construct validity of a test. First and most important, the theory underlying the construct to be measured must be considered. Second the adequacy of the test in measuring the construct is evaluated (Mason and Bramble, 1989). For example, suppose that a researcher is interested in measuring the introverted nature of first year teachers. The researcher defines introverted as the overall lack of social skills such as conversing, meeting and greeting people, and attending faculty social functions. This definition is based upon the researcher’s own observations. A panel of experts is then asked to evaluate this construct of introversion. The panel cannot agree that the qualities pointed out by the researcher adequately define the construct of introversion. Furthermore, the researcher cannot find evidence in the research literature supporting the introversion construct as defined here. Using this information, the validity of the construct itself can be questioned. In this case the researcher must reformulate the previous definition of the construct.

Once the researcher has developed a meaningful, useable construct, the adequacy of the test used to measure it must be evaluated. First, data concerning the trait being measured should be gathered and compared with data from the test being assessed. The data from other sources should be similar or convergent. If convergence exists, construct validity is supported.

After establishing convergence the discriminate validity of the test must be determined. This involves demonstrating that the construct can be differentiated from other constructs that may be somewhat similar. In other words, the researcher must show that the construct being measured is not the same as one that was measured under a different name.

Criterion-Related Validity

This approach is concerned with detecting the presence or absence of one or more criteria considered to represent traits or constructs of interest. One of the easiest ways to test for criterion-related validity is to administer the instrument to a group that is known to exhibit the trait to be measured. This group may be identified by a panel of experts. A wide range of items should be developed for the test with invalid questions culled after the control group has taken the test. Items should be omitted that are drastically inconsistent with respect to the responses made among individual members of the group. If the researcher has developed quality items for the instrument, the culling process should leave only those items that will consistently measure the trait or construct being studied. For example, suppose one wanted to develop an instrument that would identify teachers who are good at dealing with abused children. First, a panel of unbiased experts identifies 100 teachers out of a larger group that they judge to be best at handling abused children. The researcher develops 400 yes/no items that will be administered to the whole group of teachers, including those identified by the experts. The responses are analyzed and the items to which the expert identified teachers and other teachers responding differently are seen as those questions that will identify teachers who are good at dealing with abused children.

Reliability

The reliability of a research instrument concerns the extent to which the instrument yields the same results on repeated trials. Although unreliability is always present to a certain extent, there will generally be a good deal of consistency in the results of a quality instrument gathered at different times. The tendency toward consistency found in repeated measurements is referred to as reliability (Carmines & Zeller, 1979).

In scientific research, accuracy in measurement is of great importance. Scientific research normally measures physical attributes which can easily be assigned a precise value. Many times numerical assessments of the mental attributes of human beings are accepted as readily as numerical assessments of their physical attributes. Although we may understand that the values assigned to mental attributes can never be completely precise, the imprecision is often looked upon as being too small to be of any practical concern. However, the magnitude of the imprecision is much greater in the measurement of mental attributes than in that of physical attributes. This fact makes it very important that the researcher in the social sciences and humanities determine the reliability of the data gathering instrument to be used (Willmott & Nuttall, 1975).

Retest Method

One of the easiest ways to determine the reliability of empirical measurements is by the retest method in which the same test is given to the same people after a period of time. The reliability of the test (instrument) can be estimated by examining the consistency of the responses between the two tests.

If the researcher obtains the same results on the two administrations of the instrument, then the reliability coefficient will be 1.00. Normally, the correlation of measurements across time will be less than perfect due to different experiences and attitudes that respondents have encountered from the time of the first test.

The test-retest method is a simple, clear cut way to determine reliability, but it can be costly and impractical. Researchers are often only able to obtain measurements at a single point in time or do not have the resources for multiple administration.

Alternative Form Method

Like the retest method, this method also requires two testings with the same people. However, the same test is not given each time. Each of the two tests must be designed to measure the same thing and should not differ in any systematic way. One way to help ensure this is to use random procedures to select items for the different tests.

The alternative form method is viewed as superior to the retest method because a respondent’s memory of test items is not as likely to play a role in the data received. One drawback of this method is the practical difficulty in developing test items that are consistent in the measurement of a specific phenomenon.

Split-Halves Method

This method is more practical in that it does not require two administrations of the same or an alternative form test. In the split-halves method, the total number of items is divided into halves, and a correlation taken between the two halves. This correlation only estimates the reliability of each half of the test. It is necessary then to use a statistical correction to estimate the reliability of the whole test. This correction is known as the Spearman-Brown prophecy formula (Carmines & Zeller, 1979)

Pxx" = 2Pxx'/1+Pxx'

where Pxx" is the reliability coefficient for the whole test and Pxx' is the split-half correlation.

Example

If the correlation between the halves is .75, the reliability for the total test is:

Pxx" = [(2) (.75)]/(1 + .75) = 1.5/1.75 = .857

There are many ways to divide the items in an instrument into halves. The most typical way is to assign the odd numbered items to one half and the even numbered items to the other half of the test. One drawback of the split-halves method is that the correlation between the two halves is dependent upon the method used to divide the items.

Internal Consistency Method

This method requires neither the splitting of items into halves nor the multiple administration of instruments. The internal consistency method provides a unique estimate of reliability for the given test administration. The most popular internal consistency reliability estimate is given by Cronbach’s alpha. It is expressed as follows:

where N equals the number of items;

equals the sum of item variance and

equals the variance of the total composite.

If one is using the correlation matrix rather than the variance-covariance matrix then alpha reduces to the following:

alpha = Np/[1+p(N-1)]

where N equals the number of items and p equals the mean interitem correlation.

Example

The average intercorrelation of a six item scale is .5, then the alpha for the scale would be:

alpha = 6(.5)/[1+.5(6-1)]

= 3/3.5 = .857

An example of how alpha can be calculated can be given by using the 10 item self-esteem scale developed by Rosenberg (1965). (See table) The 45 correlations in the table are first summed: .185+.451+.048+ . . . + .233= 14.487. Then the mean interitem correlation is found by dividing this sum by 45: 14.487/45= .32. Now use this number to calculate alpha:

alpha = 10(.32)/[1+.32(10-1)]

= 3.20/3.88

= .802

The coefficient alpha is an internal consistency index designed for use with tests containing items that have no right answer. This is a very useful tool in educational and social science research because instruments in these areas often ask respondents to rate the degree to which they agree or disagree with a statement on a particular scale.

Cronbach’s Alpha Example

Questions	1	2	3	4	5	6	7	8	9	10
1	2	2	2	3	4	5	2	1	2	4
2	1	1	2	4	5	5	1	2	2	2
3	1	2	2	5	5	4	1	2	2	1
4	3	2	2	2	1	3	2	2	2	2
5	5	5	5	4	4	3	3	2	3	4
6	1	1	1	1	5	1	1	1	1	1
7	2	2	2	2	2	2	2	2	2	2
8	2	1	2	2	4	1	3	3	1	1
9	5	5	1	1	1	2	1	2	5	4
10	4	3	3	3	1	2	1	1	3	4

N	10	10	10	10	10	10	10	10	10	10
	26	24	22	27	32	27	17	18	23	25
	2.6	2.4	2.2	2.7	3.2	2.8	1.7	1.8	2.3	2.5
	90	78	60	89	130	98	35	36	65	79
	22.4	20.4	11.6	16.1	27.6	19.6	6.1	3.6	12.1	16.5
S²	2.5	2.3	1.3	1.8	3.1	2.2	.68	.4	1.3	1.8