Handout 1

 AN OVERVIEW OF PSYCHOLOGICAL RESEARCH METHODS

AKA Ms. Morrow's Phat and Righteous Guide

(You'll see. Trust me!) 

Citation format:

Morrow, J. (2002). An overview of psychological research methods. Unpublished manuscript. Vassar College.

 

Introducing the Introduction: Basic Goals of Psychology

The overarching goal of psychology is to understand behavior. To this end, psychologists use the scientific method to observe, to test, to predict, and to explain behavior. The scientific method has a rich tradition with roots in rationalism and empiricism. Rationalism is the principle that reason is the true source of knowledge, and empiricism is belief that knowledge is acquired through direct observation and experimentation. Observations can not be understood, integrated, or placed in a broader theoretical context without the use of reason and logic. Enter logical positivism, which holds that science is the unification of observation, logic, verification, and falsification. Psychology puts it all together and rests on the general principles that (a) questions and controversies about behavior are best addressed through empirical methods, (b) hypotheses must be logical, testable, and falsifiable, and (c) theory, integrated sets of principles that systematically explain and predict behavior, should guide and be guided by sound empirical research.

With these principles as its this foundation, the science of psychology emerged in the late 19th century with the experiments of Mueller, Helmholtz, Wundt, and others. Even back in the "olden days," the guys with big, bushy beards tested their ideas the same way modern psychologists do. All research starts with a hypothesis, a proposed explanation for an event or series of events.

Before testing a hypothesis, researchers must concretely and precisely define the concepts we wish to test. We do so by operationally defining concepts. An operational definition represents concepts in terms of the specific procedures used to measure or to produce them. Operational definitions specify (a) which variables are observed, (b) how observations are made, (c) how levels of variables are measured or induced, and (d) the procedural steps followed. Exact operational definitions coupled with the scientific method allow psychologists to test, to extend, and to challenge ideas.

In our ultimate quest to understand behavior, psychologists move from generating hypotheses to conducting research, and guiding us along the way are four broadly envisioned, interconnected goals. One goal is to predict (P) behavior. For example, an airline may want to predict the safest hours to fly from New York to Los Angeles, the best types of pilots for certain kinds of planes, or the fastest routes during specific times. The answers to these questions usually won't provide information about why particular times are good for flying or certain pilots are skilled with certain planes, but they will help the airlines provide safer and better service by allowing them to predict outcomes with a high degree of certainty.

The next step toward understanding entails explaining how (H) behavior works. Following on the flying example, psychologists not only want to predict which pilots are better, they want to know why certain pilots are better. They want to be able to explain the relationship between pilot characteristics and flying skill. Why does this matter? Well, prediction is based on the degree to which two or more variables are associated, which leaves a lot of room for misinterpretation and error, as you will see when you learn more about natural groups and correlational designs. Causal explanations tell us why and how behaviors come about. That's where the real action is. Establishing causality helps illuminate the mechanisms involved in generating behavior. Also, it's just plain interesting to know how behavior works, well for some of us any way. People seem to want to know "why." (You'll learn about this tendency when we discuss theories of attribution and causality.) Explaining behavior means determining the underlying processes. Explanations focus on the mechanisms of behavior. To sum it all up, the "H" in "phat" research refers to pinpointing "how" behavior comes about.

Understanding the causes of behavior also can help us to evoke and influence behavior, for better and for worse. For many psychologists, the application (A) of our knowledge about behavior is another goal of psychology. Research helps bring us closer to understanding how to train pilots, to cut down on the spread of HIV, to treat anxiety disorders, to negotiate contracts, to help children and adults deal with bullies, and to make a client's testimony more believable. But only righteous research can lead to effective applications (I'll explain. Be patient).

Here are a few words of caution. Ethical issues arise in the application of research. Even ethical research can be misused, and this can make the application unethical or questionable. Just because behavior can be changed or influenced, doesn't mean it should be.

The final goal of psychology is theory (T) building. Building a theory requires thoughtful analysis and integration of research findings. Theories go beyond making distinct little explanations of compartmentalized behaviors to constructing more comprehensive explanations of how processes work together to produce behavior. Theories provide a general picture of the mechanisms underlying behavior and describe the functions of behavior. Theories also specify detailed relationships among processes and behaviors. For example, people are likely to obey authority figures, even when asked to commit hurtful or humiliating acts. We know this based on the studies done and inspired by Stanley Milgram. But levels of obedience sometimes vary depending on the circumstances, that is, when different factors are involved. For example, people are more likely to obey an authority figure when they have already done so, and they are more likely to disobey when they have witnessed someone else doing so. Based on numerous studies, Milgram has constructed a theory that predicts when and explains why people obey (Milgram, 1974). In general, theories specify and explain the "conditions under which" certain behavior occurs.

Theory and research are inextricably bound. Good research must be informed by theory. And good theories rest on good research. Finally, research and theory co-evolve.

Let's get back to our pilots. They are so lonely flying about without us. The ideas for how pilots make decisions and successfully fly their planes are likely to fit into the broader theoretical categories of decision making, attention deployment, and performance. There will be some aspects of flying that will be specific to the task, and a good theory allows for those differences and explains them.

Without theory one can describe and often predict behavior. Without theory and research, behavior can't be well understood or explained.

So, now you know that the four general goals of psychology are PHAT. The righteous part of research is more layered in meaning, but well worth the time and effort. Read on in anticipation, you eager and budding researchers. "The truth is out there" but you need to recognize when you see it and when you don't.

 

Just for Good Measure: Good Research Relies on Good Measurement

 

Operationalization: An Ugly Word but a Defining Feature of Good Research

We can't understand behavior if we don't measure it accurately. Righteous research relies on accurate measurement. Accurate measurement begins with good operationalization and rests squarely on validity.

Recall that an operational definition represents concepts in terms of the specific operations used to measure or produce them. For example, if you are interested in seeing whether social interaction causes people to feel aroused, you need to specify what constitutes social interaction and arousal. You might represent arousal as autonomic nervous system (ANS) activity and decide to operationalize ANS activity by measuring galvanic skin response (GSR), heart rate (HR), and blood pressure (BP). Of course, you also would need to operationalize social interaction. You could choose to have three people in a room or two or whatever number works for your purposes. You could ask your subjects to work on a task or discuss current events. You should choose the activity that functionally best represents the type of social interaction you want to understand.

Let's say you are interested in memory performance. You might decide to count the number of words recalled from the list of words you present. The words you use, how you measure recall, and how you present the list all are aspects of operationalization. If you are interested in the effects of cognitive load on memory performance, you might use three levels of cognitive load: no load, low load, and high load. Load would be operationalized by what you have subjects do. In the no load condition subjects would have no additional task and they would perform the memory task. In the low load condition they could rehearse a 3-digit number while learning the word list, and in the high load condition they could rehearse a 10-digit number.

In summary, an operational definition refers to what you measure and how you measure it (e.g., ANS activity, specifically, GSR, HR, BP; number of words recalled; the list and how it is presented; and level of load and how load is induced).

Let's say you are a brilliant researcher who is interested in a devising a new and improved measure of psychological stress. You decide to measure some of the dimensions which have been proposed to be important components of psychological stress, like arousal, low control, and anxiety. You need to test your new measure, the BRISS (Brilliant Researcher's Index of Self-reported Stress), to see if it is reliable and valid.

 

Reliability

Reliability refers to the consistency of a measurement device or procedure. It is the extent to which a test, measure, or procedure produces the same observation or relative score each time it is applied to a specific case. An exception is when scores or responses are supposed to systematically change under specific conditions. That is, a measuring device should be sensitive to real change and it should produce scores that represent such change. Think of a thermometer measuring whether an individual has a fever. It should be able to register important or meaningful changes in temperature. If it doesn’t register changes from 99 to 102 degrees that could happen in the course of a few minutes or an hour, it wouldn't be a good way to measure temperature. If it keeps registering 99 degrees, it would be a reliable thermometer, the recordings would be consistent, but it certainly wouldn’t be a good or sensitive device, and the scores it produces would be misleading. It wouldn’t be a valid way to measure temperature. Indeed, it could be a very dangerous measurement device.

How reliability is tested depends on the type of research being conducted. However, it all comes down to the same idea--does the measurement device or procedure show a high degree of consistency and repeatability. Reliability can be enhanced by (a) standardizing tests, measurement devices, and procedures, (b) making a sufficient number of observations; that is, testing enough behaviors and subjects so that atypical responses do not distort the findings, (c) decreasing biases and random error, (d) increasing sensitivity, and (e) maintaining adequate control (Zimbardo, 1992).

Continuing with the stress measure example, let's say after pretesting, you decide on the final 22 items that make up the BRISS. For the BRISS to be a good measure, responses on the 22 items need to be consistent and related; that is, they need to be highly intercorrelated. This kind of reliability, when items in a scale are intercorrelated, is called internal reliability or internal consistency (Carver & Scheier, 1992).

There are a number of different ways to test for internal consistency. One is split-half reliability. If your scale has 22 items, you determine the split-half reliability by dividing the items in half and comparing your subjects' scores on each half of the test. If the scale is a reliable measure, there should be a very strong positive correlation between scores on each half. Thus, individuals who score low on one half should score low on the other half, and those who score high on one half should score high on the other half.

Test-retest reliability refers to obtaining the same results each time a type of measurement is taken on the same people. Let's say you are interested in measuring intelligence and you expect intelligence to be stable over time. If the test is reliable, your subjects should not have markedly different IQ scores when they take the test a second, third, or fourth time. If you were measuring a more changeable quality, like state anxiety, you would expect the scores to change depending on the circumstances. If state anxiety scores did not change when the circumstances actually were more or less anxiety provoking, you'd have stable measure with high test-retest reliability but it wouldn't be sensitive enough to be a good (valid) measure of fluctuations in state anxiety. A problem similar to the one we ran into with the thermometer.

Observer ratings consist of having judges rate aspects of someone else's behavior. When there is high agreement among different raters, that is when their ratings positively correlate, the measure has high inter-rater reliability.

Validity

Reliability is only one aspect of sound measurement, and it is often the easiest to achieve. It's important for you to remember that a measure can be very reliable, you get the consistent responses, but totally meaningless. Validity is the key to meaningful measurement. Thus, validity is the key to well-conducted research.

Given all that you have studied in class, you know that validity refers to the truthfulness, accuracy, and representativeness of measure. In a measurement device (e.g., IQ test, GSR, BRISS) it is the degree to which the device measures what it is supposed to measure. There are numerous types of validity, some of them are closely related, and I'll touch on only a few. When assessing the meaning of research, keep in mind that if the design contains important lapses in validity, any definitive conclusions based on the research are suspect. So, believe the maxim "GARBAGE IN = GARBAGE OUT."

Face validity is when a measure appears to be reasonable index of the construct of interest. As a representation of ANS arousal, heart rate, and GSR have more face validity than creativity does. Construct validity is the degree to which a test, measure, or variable measures the theoretical concept is claims to measure. External validity is the extent to which research findings can be generalized to other settings and populations.

Criterion validity refers to the correspondence between the variable of interest and an independent and supposedly objective measure of that variable. Concurrent validity, a type of criterion validity, refers to correspondence between variables when they are measured at the same time. Predictive validity, also a form of criterion validity, is the extent to which the measure of interest predicts future performance or responses. Convergent validity and discriminant validity are forms of criterion validity that can be tested concurrently or at different times.

 

Convergent validity refers to the degree to which measures of conceptually related constructs correlate. Devices that measure similar constructs, though measuring something a little different, should be related, and the size and direction of the correlation should reflect the degree and type of conceptual similarity.

To test whether the BRISS is a good measure of stress, you could correlate it with conceptually related measures. Good measures of anxiety and agitation should all be positively correlated with the BRISS. So, the BRISS should be highly positively correlated (r > .70) with very similar scales, like established measures of stress. More moderate positive correlations (.30 ≤  r ≤ .60) would be expected with more distantly related constructs like helplessness and worry.

Because convergent validity refers to the relatedness of concepts, concepts that reflect opposite qualities should be negatively correlated. Thus, measures of qualities that are opposites of stress such as calmness, control, and relaxation should be negatively correlated with the BRISS. The size of the negative correlation will depend on how antithetical the concepts are. Depending on the proposed conceptual relationship, a moderate (- .60 ≤ r ≤ - .30) to high negative correlation ( r ≤ - .70) between related, yet opposite constructs reflects good convergent validity.

Note that measures of related concepts, whether reflecting similar or opposite qualities, won't and shouldn't be perfectly correlated with the BRISS because they aren't testing the exact same concepts.

Not only is it important that a measurement device measures what it intends to measure, but it also is important that the device doesn't measure qualities it isn't supposed to measure. Basically, measures of unrelated concepts should not correlate. This sort of validity is called discriminant validity.

In your quest to develop a good measure of stress, you want to make sure that you aren't measuring assertiveness, friendliness, stinginess, flatulence, or other constructs that are conceptually unrelated to stress. You check the discriminant validity of the BRISS by making sure it doesn't correlate with valid measures of unrelated concepts. You will have good discriminate validity if the correlation between two conceptually unrelated measures approaches zero, r 0.00.

Validity can be enhanced by (a) using established and tested measures that are reliable and sensitive, (b) decreasing biases, (c) establishing experimental realism, (d) establishing good control, (e) disguising measures, (f) striking the appropriate balance between impact and control, and (g) eliminating or ruling-out confounds. I'll discuss validity in terms of experimental designs later in this guide. Stay tuned.

 

Research Design and Model Building

Correlational Research

In class you learned about correlation, and you just read about some of the ways correlational methods are used to address validity and reliability issues. This section should help you learn how to depict and describe correlational findings.

Simple Correlations. Correlational methods are used to uncover systematic functional relationships between variables. Simple correlations measure whether changes in one variable are associated with changes in another variable. Specifically, a correlation measures the degree of consistency in the relationship between two variables, X and Y, where for each subject, each particular value of Xi is paired with a particular value of Yi. For example: The subject's performance on a task and the amount of time the subject spends imagining succeeding at the task; the subject's college GPA and the number of parties the subject attends; the subject's height and the subject's SAT scores. These sorts of relationships are represented by the correlation coefficient, r, which it ranges in value from -1.0 to +1.0. Correlations also can be represented graphically.

A positive correlation indicates that the scores on each variable tend to move up together and down together. As one set of scores increases, so does the second set (the more of this variable, the more of that variable), likewise, as one set of scores decreases, so does the second set (the less of this variable, the less of that variable). For example, the less time one spends imagining succeeding at the task, the lower one's task performance. This kind of correlation is depicted here:

 

 

When two variables are negatively correlated they move in opposite directions:, as one set of scores increases, the other set decreases (the more of this variable, the less of that variable). For example, the more parties students attend, the lower their GPA. This sort of correlation is depicted here:

 

 

 

A correlation coefficient can also tell you when there is no systematic relationship between two variables. When the correlation coefficient equals or approaches zero (respectively, r =0.0 or r 0.00), the two variables are not related. Knowing the value of one variable tells us nothing about (will not predict) the value of the other variable. This kind of lack of association is shown here:

 

 

Always keep in mind that correlational methods have limitations. One important limitation of correlational designs is that they never establish causal relationships. On pages 18-23 I'll go through an extended example that illustrates this fact. But for now, I want to address what the findings from correlational studies can tell us.

Correlational methods can address other goals of psychological research; namely, description and prediction. But doing so requires more than just seeing if two variables correlate. One must look at the significance, direction, and strength of the relationship, as well as rule out or consider the role of other variables. Basically, one needs to determine if the relationship is meaningful.

Of course, the first step in assessing the meaningfulness of a significant correlation is to determine whether the measures and procedures are valid. If they are not, the findings can’t help you understand the nature of the relationship between the variables. If the design is valid, consider the size and representativeness of the sample. Next, note the direction of the correlation, positive or negative. Then, look at the size of the correlation by calculating its absolute value. Is it relatively big (e.g., > .60) or small (e.g., < .20)? Once the validity of the measures, design, and basic correlation, r, is established, the strength of the relationship determines the practical importance of the correlation. The strength of the correlation is represented by the coefficient of determination, r2. It can take on positive values ranging from 0.0 to +1.0. The stronger the association between the two variables, the closer r2 gets to +1.0, and the more accurately one variable predicts the other. The coefficient of determination, r2, measures the amount of variance in one variable that is explained by the other. That is why it is referred to as a measure of "variance explained." The more variance explained, the better one variable predicts an other. The coefficient of determination, r2, is the result to look for when you are reading about correlations! Think of it as the currency of meaning for correlations (coefficient of determination = C.o.D. = cash on delivery). Considering all of these features should help you determine the meaningfulness of a statistically significant correlation and help you understand and describe the relationship it measures.

When thinking about the meaning of a correlation, the value of r2 is more helpful than the value of p alone. This is because with correlations the critical value of r is sensitive to the sample size, the number of subjects in a study. So, when n is large (n > 100), a relatively small correlation can reach significance, but, in more practical terms, it may not be a very meaningful relationship because the variables don't predict each other very well. For example, when n = 102, if r = .20, then p < .05. So, the finding is significant, it is unlikely to have occurred by chance alone, and reliable, but is it meaningful? With these results r2 = .04, a weak relationship. In fact, 96% of the variance is left unexplained! So, the variables don’t predict each other very well at all. They really don’t say much about each other; that is, knowing the value of one won't tell you much about the value of the other variable.

Here’s a more jarring example, the sort you might find in some kinds medical or sociological research because they like to use big n’s (You can guess what Freud would say about that). If n = 1000, and r = .08, then p < .05. With these results r2 = .0064. So, the correlation is significant and reliable, but the variables aren't very good predictors of each other. In fact, 99.36% of the variance is left unexplained.

Yield for, but don't stop at, variance explained because you can get into trouble if you solely rely on the coefficient of determination without considering sample size. For example, let's say your measures correlate nearly perfectly, r = .96, which would indicate a good deal of variance explained, ~ 92%. Sounds pretty good, right? Not necessarily. Even though the correlation is significant and the variance explained quite high, the sample is frighteningly small. If you only surveyed 4 people you probably wouldn't be comfortable making strong conclusions, or possibly any at all, because so few people were sampled (n = 4). The best approach is to understand and consider all three measures (p, r2, & n) when evaluating correlational findings ( r ). As always, you’ll need to keep the validity of the measures and design in mind.

Table 1 depicts some of the critical values for determining the two-tailed significance (at p < .05) of a Pearson correlation (r) and presents their corresponding coefficient of determination (r2).

 

Table 1


df = n-2

 r

 r 2

 

1

 

.997

 

.9940

 

2

 

.950

 

.9025

 

3

 

.878

 

.7709

 

4

 

.811

 

.6577

 

5

 

.754

 

.5685

 

6

 

.707

 

.4998

 

7

 

.666

 

.4436

 

8

 

.632

 

.3994

 

9

 

.602

 

.3624

 

10

 

.576

 

.3318

 

20

 

.423

 

.1789

 

21

 

.413

 

.1706

 

22

 

.404

 

.1632

 

23

 

.396

 

.1568

 

24

 

.388

 

.1505

 

25

 

.381

 

.1452

 

30

 

.349

 

.1218

 

40

 

.304

 

.0924

 

100

 

.195

 

.0380

Partial Correlation. Let's say you are interested in knowing the relationship between the number of negative events experienced and level of negative mood. You certainly could measure these two variables and examine their simple correlation. However, let’s say you are confident that social support is related to both variables, and you want to know the relationship only between negative events and negative mood. To do so you need to remove the common variance in the relationship between mood and events that is explained by social support. Partial correlation can be used to calculate the relationship between mood and events with the value of social support held constant or "controlled." Table 2 presents the simple correlations.

Table 2

 

 

The partial correlation between negative events and mood with social support held constant is .47. Comparing the original correlation between negative events and negative mood (.60) with the partial correlation (.47) suggests that the simple correlation between events and mood is inflated because both variables share variance with social support. Unless the shared variance is considered statistically, the correlation between negative events and negative mood gives an inaccurate representation of the relationship between the variables.

Using partial correlation does not ensure that the relationship represented is completely accurate. In this example, it does tell us the relationship between negative events and negative mood when the value of social support is held constant, but it doesn't consider other important variables that may share variance with events and mood. Depending on how much shared variance they explain, other variables could be inflating or deflating the apparent relationship between mood and events, even though social support is held constant. Can you think of other potentially important variables?? What about variables like extroversion, self-esteem, optimism, coping styles, or negative affectivity? One way to test how well these variables predict negative mood would be to use multiple regression.

Multiple regression. Multiple regression establishes the functional relationship among variables in the same way correlation and simple regression do, but it considers numerous (multiple) predictor variables (PVs). For example, multiple regression could be used to predict performance on a final exam (the response variable, RV) based on level of self-efficacy, number of lectures attended, amount of time spent reviewing the course material, and midterm exam score. Regression will provide a summary of the relationship between the RV (final exam performance) and the PVs (self-efficacy, number of lectures attended, amount of time spent reviewing the course material, and midterm exam score), and allow you to predict the value of the RV based on different values of the PVs. Multiple regression will tell you the amount of variance in the response variable that is explained by the predictor variables. Based on the model you choose and type of regression you use (e.g., hierarchical, simultaneous), you can see which predictor variable or combination of predictor variables best predicts final exam score.

We won't spend a great deal of time on regression, as most of the research you will read tends to use other statistical techniques. What you need to understand for this class is that regression tests the relationships among variables and suggests the best model (formally, the best fitting line) that describes the relationships among the variables of interest. As with correlations, the p-value will tell you whether a variable, set of variables, or model is a significant predictor of the response variable, but R2 will indicate how much variance is explained (R and R2 are akin to their lowercase correlation counterparts r & r2 ). Relatedly, the same issues arise concerning variance explained (R2), significance (p ), and sample size (n) when trying to evaluate and understand the meaning of regression findings. In addition, the researcher usually determines which models to test, hence the models are only as good as the researcher's ideas. Moreover, the models are only as good as the variables in them. So, if the variables aren’t valid or important variables are overlooked, the meaning of the results will be limited. Again, the statistical results may be significant (p < .05) but because of lapses in validity they may not tell us anything useful.

 

Experimental Designs: Conducting Righteous Research

The best way to establish causal relationships is to conduct righteous experiments. In a true experiment the investigator systematically varies a factor, or factors, holds all other factors constant, and measures systematic variation in the outcome of interest. The essential features of a true experiment are: manipulation of an independent variable, control, measurement of a dependent variable, and random assignment. However, righteous research requires much more and rests on validity (truth). The basic requirements of a sound experiment are (VICAR):

  • 1. Validity

    2. Independent Variable (w/a DV)

    3. Control

    4. Assignment: Random assignment

    5. Realism

  • I placed the requirements in this order so you can easily remember each feature. Pardon my priestly pun, but I hope it will help you remember that good research, righteous research, rests on the features represented by the acronym "VICAR." I already have gone over some issues related to validity, but I will mention a few more here. I will discuss the features in an order that makes sense given the entirety of the information I want to convey.

     

    The independent variable (IV) is the factor that is manipulated by the investigator. It is "independent" because the investigator defines it before the experiment begins and dictates its form and amount. Independent variables must have at least two levels. So, an experiment compares at least two situations, conditions, or groups. For example, let's say an experimenter believes caffeine causes better memory performance (the hypothesis). All subjects drink one cup of coffee before completing a memory test. Members of "Group A" drink a cup of decaffeinated coffee (decaff condition) and members of "Group B" drink a cup of regular coffee (caffeinated condition). Thus, the independent variable (type of coffee consumed) has 2 levels: decaffeinated versus caffeinated. The "caffeinated condition" is the experimental condition, the group that is given the treatment being tested. The "decaff" condition is the control group, it does not receive the treatment and provides a comparison. This specific type of control group is called a placebo group or placebo control. The placebo group thinks it is getting the treatment (caffeinated cup o' Joe) but it does not. Its purpose is to be certain that treatment effects are due to changes in the IV instead of subjects' expectations.

    A dependent variable (DV) measures the proposed outcome or effect of the independent variable. It is the response whose form or amount is expected to vary with changes in the independent variable. If the suspected causal relationship exists, its value will "depend on" or be caused by levels of the independent variable. In the coffee example, memory performance, operationalized as the number of words recalled, is the dependent variable. A DV is necessary to measure the proposed effects of an IV. Thus, a true experiment has a DV, and true DVs only occur in true experiments.

    An experiment is valid if it tests what it claims to test. So, in an experiment, validity refers to how well the IVs, DVs, and experimental procedures represent what they are supposed to represent. In an experiment, internal validity is the degree to which changes in the DV can be attributed to the manipulation of the independent variable.

    Control is essential in an experiment, and experimental results are interpretable only if the independent variable is the only important factor that distinguishes the groups. All other potentially influential factors must be held constant if causal conclusions are to be made. Control refers to intentionally manipulating factors that produce a phenomenon and holding other factors constant.

    Control can be taken too far. In class, we discussed inducing anxiety. We wanted to induce anxiety and no other mood state. Isolating anxiety in this way would represent good control. But, we also talked about what would happen to the validity of our induction if we isolated anxiety so much that we failed to induce worry. Worry and anxiety often co-occur in the real-world. Moreover, some theorists believe that worry is a defining feature of anxiety. If we are interested in anxiety that contains worry, controlling anxiety to the degree of failing to induce worry would mean we went too far. We failed to balance validity and control. If we did want to look at anxiety without worry, as some researchers might, we've done a good job. Yippee!

    Control easily can be lost if the researcher is not careful. One way control can be lost is when biases enter into an experiment. Any factor that unintentionally and systematically influences important factors in experiment is a potential bias. Biases cloud the true relationships among variables. Experimenter effects and subject biases are types of bias that occur when the expectations of the experimenter or subject unintentionally influence the outcome of a research project. Demand characteristics are cues in an experiment or study that influence subjects' behavior so the IV is not the only factor exerting an effect on the DV. Demand characteristics often lead subjects to react as they believe the researcher wants them to react. "Nasty" subjects can sabotage an experiment if demand characteristics provide cues to what is expected and they decide to act in different way. "Kind" subjects can cause problems by trying to help the experimenter prove the hypotheses.

    Potential biases can be reduced in a number of ways. Standardized procedures help avoid bias by keeping the important elements constant. Double-blind procedures and replication also minimize bias. An experiment follows double-blind procedures if the subjects and the experimenter do not know which subjects are in which groups. Also, the subjects are not told about the expected results before the experiment. In highly controlled studies, the person running subjects in the experiment doesn't even know the hypotheses.

    An experiment or study is replicated when the research is re-conducted using the same procedures and approximately the same results are obtained. Similar, but slightly different, procedures can be used to see if the findings generalize. This sort of replication can provide a test of convergent and external validity by demonstrating that the findings don't depend on a limited set of circumstances. It is particularly important for different experimenters obtain basically the same results when they re-conduct the research because it is less likely that experimenter bias is responsible for the results. Placebo groups may decrease biases by permitting the researcher to test whether subjects' expectations about the condition cause changes in the DV.

    In our java experiment, control would be undermined if subjects in one of the conditions differ from subjects in the other on a factor that could systematically bias the results. That is, Group A and Group B systematically differ on a variable other than the IV. Let's say the experimenter decided to assign the decaff condition to all subjects who signed up for morning sessions of the experiment. It could be that subjects who came in the morning had fewer hours of sleep the night before, which could possibly affect their memory performance. If subjects in the decaff condition had slept fewer hours the night before than did subjects in the caffeine condition, the differences in amount of sleep could account for any group differences in memory performance. Amount of sleep is a confounding variable. A confounding variable is a factor that unintentionally and systematically varies with the levels of the independent variable. Because it varies with the IV, there is no way to tell whether the IV or the confounding variable systematically influenced the dependent variable. Confounds may represent fatal flaws in control that undermine validity. Confounding variables are potential alternative explanations for the results, thus they threaten the validity of an experiment.

    One way to eliminate this sort of confound is to use random assignment. Random assignment means that every subject has an equal chance of being placed in any particular group in the experiment. The goal of random assignment is to establish groups that are equivalent in all important ways except for levels of the independent variable by randomly distributing differences in subject variables across conditions. This procedure increases experimental control and is intended to ensure that extraneous variables, known and unknown, will not systematically bias the results. In the coffee example, if all subjects are equally likely to be placed in either condition, it is unlikely that hours of sleep, or other confounding variables related to characteristics of the subjects, would systematically differ between groups.

    Random assignment doesn't get rid of all confounds, just systematic biases related to the qualities subjects bring with them. Steps must be taken to ensure that confounds don't enter in other ways, for example through how the variables are operationalized. Let's go back to the coffee example. We have an independent variable, type of coffee consumed, with two levels, caffeinated or decaffeinated. We randomly assign subjects to one condition or the other and measure how many words participants recall (DV) from the word list we present to them. Low and behold, those in the caffeinated condition, recall more words. But what if the conditions varied in other ways? Ways related to levels of the IV? "Houston, we have a problem."

    Believe it or not non-coffee-drinkers among you, caffeinated coffee and decaffeinated coffee don't always taste the same. If they did, decaffeinated coffee makers would not have to advertise about how good their coffee tastes.

    Let's say our caffeinated coffee does taste better. What would it mean if tastiness (good versus bad) varies directly with the levels of the independent variable (caffeinated versus decaffeinated). If the conditions produce different levels of memory performance we would have a difficult time knowing whether the amount of caffeine or the level of tastiness produced the differences in performance.

    It may not even be the tastiness but some factor related to it, for example, mood. How about this alternative explanation: The decaffeinated coffee is terrible, making subjects feel disgusted or somewhat displeased, and the caffeinated coffee is spectacular, inducing the warm glow of happiness? Performance could differ because taste influenced mood and mood influenced how well people did. Memory was better when subjects were happy than when they were disgusted by the thick, decaffeinated swill. Thus, taste and mood could be confounding variables.

    We can rebuild it, "we have the technology." To control for the taste confound, we would try to make the types of coffee taste the same or administer the caffeine in a different form (e.g., water Joe, a new product, really, it's caffeinated water). Then, we'd make sure ahead of time that judges rated the different drinks as tasting the same.

    Why might we want to measure mood rather than control for it? Think about it. Ask your peers. Ask me. Hmm....could it be that's how coffee works? Not just through arousal, but through mood too? Of course we could test this idea too, and we should if we think it's a plausible explanation.

    Keep in mind that confounding also can occur in non-experimental designs. In these cases, the levels the confounding variable vary directly with the levels of a predictor variable.

    To summarize the issues relevant to experimental control: Control refers to intentionally manipulating factors that produce a phenomenon and holding other factors constant. Thus, control refers to (a) manipulating the independent variable, (b) keeping important factors other than the independent variable constant, (b) eliminating important extraneous factors (e.g., confounds, biases), (d) having a comparison or control group, and (e) using random assignment. Causal conclusions may be drawn only when control and validity are maintained.

    Realism is necessary condition for validity and reliability. There are two types of realism: experimental realism and mundane realism. As you might guess from their names, experimental realism is considered more central to validity. In fact, without experimental realism, an experiment is not valid.

    Experimental realism is the degree to which research procedures absorb and involve the subjects. For subjects' responses to be true indicators of the behaviors and psychological processes of interest, subjects must be caught up and invested in what's going on in the experiment. Only then are subjects behaving in a natural way. For this to happen, the experimental set-up must have impact, it must be meaningful, credible, and involving to the subjects. All types of designs should have high experimental realism for the responses and results to be valid and reliable.

    Mundane realism is the degree to which research procedures, including variable operationalization, appear to be similar to the everyday situations and behaviors of interest. Mundane realism is less important for most types of validity. Many students wrongly think that mundane realism is central to validity, especially external validity. But, if subjects are not acting naturally (low experimental realism), their responses are not true, thus it does not matter how well the set up of the experiment resembles the real world. Obviously both factors are important, but validity hinges on experimental realism. Remember, the key is for the variables and experimental procedures to be functionally the same as the factors you want to measure and manipulate. Just because the set-up mimics the real world doesn’t mean the variables are functioning the way they do in the real world. Experimental realism along with well-operationalized variables are more likely to produce functionally representative variables than mundane realism is.

    The validity of experiments can be enhanced by (a) using random assignment, (b) using measures that are established, reliable, and sensitive, (c) decreasing biases, (d) establishing high experimental realism, (e) establishing good control, (f) disguising measures, (g) balancing impact and control, and (h) eliminating confounds.

     

    Keep in mind that the same basic principles apply to all types of sound research, with the exceptions that non-experimental research has a RV (response variable) and not a DV because at least some of the predictor variables are measured and not manipulated (i.e., they are not IVs).

     

    Natural Groups Designs

    Although an independent variable is a necessary component of a true experiment and required for establishing causality, it is not always possible to manipulate all variables of interest. Many factors can not or should not be manipulated. For example, a psychologist may be interested in comparing the size of the ventricles among older adults (older than 55 years). Certainly, one could not make some persons old and others very old. Under these sorts of conditions, the best a researcher can do is compare groups that naturally vary on this quality (see figure 2). For example, one could compare adults who are young-old (ages 50-60), middle-old (ages 61-80), and old-old (older than 80).

     

    Figure 2.

    The qualities subjects bring with them (e.g., age, gender, ethnicity, mental health status, personality traits, and other characteristics) that are measured and used to predict other variables are termed subject variables, natural groups variables, grouping variables, or predictor variables. I tend to refer to them as subject variables (SVs) or predictor variables (PVs). Causal conclusions are undermined when a design contains even one subject variable.

    Dividing subjects into groups according to qualities they bring with them and studying the systematic relationships between SVs and other variables creates a natural groups design. Levels of the subject variables (SVs) are not manipulated, but are selected by the experimenter (e.g., High IQ versus Low IQ). Thus, the response the experimenter is trying to predict is not truly dependent because not all of the variables used to predict it are manipulated. In natural groups designs, changes are observed not in a DV, but in a "response variable" (RV), "outcome variable" (OV), or "criterion variable" (CV). These terms all refer to the same type of variable.

    Please note that the literature gets a little sloppy when referring to variables. Most predictor variables will be called "IVs" and most response variables will be called "DVs." To understand the research you'll need to look at the design and assess the type of design. Don’t let the labels fool you into believing the design supports causal conclusions.

    Figure 3 presents the hypothetical relationship between schizophrenia status (SV), drinking history (SV), and ventricle size (RV), a 2 X 2 natural groups design. Notice that the pattern of results resembles those depicted in figure 2. Their similarities illustrate one of the drawbacks of natural groups designs. Because the SV is not manipulated there is very little control and other variables, possibly explanatory ones, covary with levels of the SV. The variables that covary could be responsible for the relationship that appears to exist among the other variables. Thus, natural groups designs, because of the third variable and directionality problems stemming from the lack of experimental control, can not establish causation. They do show predictive relationships among variables, but the predictive relationship may obscure the true causal variable(s).

     

    Figure 3.

    An easy way to understand the drawbacks of natural groups designs is to think of the designs as correlational. As you know, conclusions about causation are unwarranted when variables are not manipulated or control is undermined. If a variable is not manipulated, it is not an independent variable. Measuring variables rather than manipulating them undermines control and makes confounding likely, thus validity is threatened. For example, variables that covary with the subject variables actually may be the variables producing changes in the RV. So, the breakdown of control means that variables other than the SV may actually be causing changes in the RV. One still can describe the relationships, as one does with a correlations, but they are predictive not causal.

    What follows are detailed examples illustrating these drawbacks and introducing ways of ruling out alternative explanations. You may recall some of these types of examples from class. I hope you do. We went over them in some detail.

     

    Let's say an investigator, George, is interested in the possible relationship between self-esteem (SE) and grade point average (GPA). Because George believes self-esteem is a core feature of people that is not easy to manipulate, he chooses to measures self-esteem (X) and selects the levels of interest. In this case, he is interested in comparing those who rank in the top 20% on self-esteem (High SE) with those who rank in the bottom 20% (Low SE).

    George calculates subjects' GPA (Y) using their college transcripts and compares the two groups. He finds that subjects with High SE have significantly higher GPAs than do those with Low SE. But problems arise when he tries to draw causal conclusions based on these findings. For example, there may be a problem determining the direction of the relationship. Doing well in school may make students feel better about themselves, or vice versa. This drawback is called the directionality problem:

    Another limitation of these sorts of designs relates to the third variable problem. In this situation, two variables are related not because one of them is influencing the other, but because each is being influenced by or merely is associated with a third variable. In the GPA and self-esteem example, psychologists sometimes find that anxiety (Z) could be responsible for the obtained relationship. Specifically, people with high SE may be less anxious about their ability to do well than those with low SE are, and level of anxiety may influence academic performance and GPA. So, level of anxiety (Z) covaries with SE (X) and GPA (Y), and the variance shared with anxiety may be responsible for the relationship seen between SE (X) and GPA (Y). So, in this example, the causal relationship is:

    Because of the third variable (or Z-variable) problem, you can’t really be sure of the precise causal relationship because other variables may covary along with the ones you have studied, and those variables, or ones associated with them, may be the causal factors.

    One way to address the third variable problem is to measure the third variable (Z) along with X and Y, and see if it explains the relationship. Depending on the design, one could use partial correlation, regression, or ANCOVA to test this idea. If you discover Z does not explain the relationship between X and Y, you can at least rule out that particular alternative explanation; however, it is difficult to rule out all possible explanations. Another variable or set of variables could be responsible for the relationship seen between X and Y. Thus, no causal conclusions can be made.

    Let's look more closely at this example (see figure 4a) and the possible relationships among the variables being examined. Let's say that research has been conducted that helps clarify the relationships among the variables of interest.

     

    FIGURE 4a.

     

    According to this model, GPA and SE aren't directly related through anxiety as proposed earlier. Sure, if we only looked at the original three variables, a relationship would emerge, but a fourth variable actually could be responsible for the relationship among those variables. According to figure 4a, the explanatory variable is performance, and anxiety isn't directly associated with self-esteem or GPA. Instead, anxiety predicts performance and performance predicts GPA. Moreover, the figure suggests the direction of the relationships (the arrows), and self-esteem appears to be a product of performance and GPA but does not influence those factors.

    Be aware that figure suggests directionality based on statistical procedures (e.g., path analysis), but neither causality nor directionality can be established because the design is non-experimental.

    This example illustrates that if performance is not examined, the associations among the variables are obscured and we arrive at a false understanding of the relationships among the variables. Sure, we still could predict GPA based on SE or anxiety, but our understanding of how they are related would be inaccurate, and our predictions would be better if we used performance to predict GPA.

    Believe it or not, there could be additional explanatory variables we need to consider if we are going to derive the best predictive model, one that explains the most variance in the response variable and that best represents the associations among the predictor variables.

    Recall from our discussions that self-efficacy is a defined as the belief in one's ability to execute a course of action or to perform a specific task (Bandura, 1997). People who are highly efficacious in regard to specific academic endeavors are likely to engage in behaviors that make them more successful (e.g., studying, not worrying, preparing, not procrastinating). So, high efficacy and the behaviors it engenders may lead to better performance. Our study and its findings won't tell us the causes, unless we conduct a true experiment, but we can see and model the associations (See figure 4b).

     

    Figure 4b.

     

     

    What do all of these models tell us? Well at first when measuring only GPA and self-esteem, we may have made the mistake of thinking GPA boosted self-esteem or vice versa. A closer examination of the important variables suggested that at least one causal explanation is unlikely, that self-esteem boosts GPA (notice the direction of the arrow between the two variables). Next, we moved toward examining anxiety and its possible role. The correlations suggested that anxiety, not self-esteem, was a better predictor of GPA, and anxiety was the reason that self-esteem and GPA seemed to be related. But again, the picture and our understanding remained incomplete and inaccurate. Once performance is considered, the direct association between anxiety and GPA disappeared and, instead, anxiety is associated with GPA and esteem because anxiety is associated with performance. Enter efficacy into the equation and a new picture emerges, one that seems to better represent the associations among the variables. The associations can be specified further by the amount of variance explained in one variable by another, as well as by the amount of variance explained by the entire model. Let's examine an actual example based on math anxiety research.

     

    Hackett (1985) investigated how math-related educational choices, and thus career choices, may develop. Figure 5 presents some of the findings that emerged in that study:

     

    Figure 5.

    Wow. At first glance figure 5 appears confusing, but if you describe the associations in a step-by-step manner, the entire model will make sense. Take note of the "weights," or the values, of the correlations between variables. Also look at the proposed direction of the associations based on the findings of the path analysis.

    The figure shows that years of high school math predict college major selection (A), however, it also shows that math self-efficacy (MSE) may mediate this relationship, and that efficacy is a better predictor of major choice than high school preparation or math anxiety. Notice that years of math (B) and math achievement (E) both feed into MSE. That is, they predict choice of major through their relationship with self-efficacy. There's also a similar relationship between MSE and math anxiety (F). So, efficacy predicts level of anxiety (F) and then anxiety predicts college major choice (G). Notice there are direct links between MSE and major choice (C), as well as between MSE and anxiety (F), but you need to consider what feeds into or predicts MSE and math anxiety to get the full picture.

    Overall, these findings suggest that even after events have occurred, like previous math achievement and number of high school math courses taken, math-related choices are predicted by math efficacy. This suggests that improving math efficacy may open doors to math-related majors and math-related careers, even after a history of poor performance and anxious feelings. Still, the findings don't prove that influencing efficacy changes major choice, but they suggest that this possibility exists. The next step, of course, would be to experimentally test the model. In fact, experiments have shown that improving efficacy can change performance and ultimately change choices like college major (cf., Bandura, 1997).

    Model building and model testing are important parts of theory building, and vice versa. These examples are intended to suggest that without careful thought into the relationships among variables, inaccurate views of behavior will emerge. Here again, we are considering the issue of validity. A model is only as good as the variables considered and omitted.

     

    IVs and SVs Together

    What conclusions can be drawn based on a natural groups design that also contains an independent variable? There is no simple answer to that question. A few examples will help illustrate the relevant issues.

    Let's say you are studying an independent variable (Induced Mood: Neutral or Angry) and a subject variable (Gender: Male or Female) and you want to see the how they relate to a response variable (memory performance). If the design is valid and the findings are significant, you are limited in the statements you can make about causality. You can make causal statements about the IV but not about the SV. But be careful how you phrase your conclusions! Don't let unsubstantiated causal conclusions creep into your description.

    Let’s say you found a significant interaction (note: not a true interaction because of the SV). So, what you found was that anger produces better performance (effect for mood) and there is a difference between men and women. Men perform better than women do when angry, and women perform better than men do when neutral.

    Why can't you say that being female or male causes anything? Because gender was not manipulated, so factors that covary with gender could be explain or cause the association between gender and performance. Can you think of any of the possible alternative explanations? Could it be that men are more comfortable with anger and that comfort is the key factor? Could it be that women are socialized not to express anger, and their attempts to inhibit anger may increase their cognitive load, arousal, or level of distraction, thus undermining performance? Could response bias or demand be the causal factor? For example, angry women are able to perform just as well as angry men, but they didn’t because they thought that was what was expected. Or, men tried harder because they expected they should when they were angry. Figure 6 depicts the findings.

     

    Figure 6.

     

     

    There are many other plausible, alternative explanations which could explain the findings, and the main point is that the results could be due to any number of factors that covary with the SV or the SV and IV together. Even if you can't think of alternative explanations or confounding variables, they could still be there. So, you can't draw causal conclusions, no matter how plausible they may seem.

    What can you say? You can say, "In men, anger led to better memory performance than neutrality did." You can also say, "In women versus men, neutrality produces better performance." But remember, it's still just an association in reference to the SV and interaction.

    Here are some actual research examples that may help bring home the point that SVs do not afford causal conclusions and that demand or other explanations can explain SV-based differences. Figures 7 and 8 are adapted from Steele's (1997) discussion of how stereotypes may shape intellectual identity and performance. Students in my social psychology classes actually will read the paper for class.

    There is a stereotype that men are better at math than women are. Furthermore, it is often presumed that these supposed differences are biologically based. In fact, research typically shows consistent gender differences in math performance when the material is difficult. Figure 7 depicts the findings of a study in which subjects were told that the math test generally 1) showed gender differences in performance, or 2) did not show gender differences in performance. All subjects actually took the same test. The only thing that varied was what subjects were told about the test (IV). Women did significantly worse than men did when the test was characterized as showing gender differences (condition 1). But, the gender difference in performance disappeared when the test was introduced as gender-neutral (condition 2).

    Here's the catch: Most of the time, because of the pervasiveness of the stereotype, people taking such tests will assume there is a gender difference; consequently, gender differences in performance will emerge. This might tempt people conducting the research or reading about it to wrongly think something biological or inherent about being male or female caused the differences in performance. Only when showing the influence of stereotypes does it become clear that conclusion is wrong. But that conclusion always would be wrong, because it is a causal conclusion based on a SV. Gender didn't cause the differences.

     

     

    Figure 7.

     

     

    As figure 8 shows, similar findings emerge when considering race and racial stereotypes of intellectual ability. In this study, Black and White university students took a test that contained the most difficult items culled from the verbal section of the GRE. The test was presented as a test of intellectual ability (diagnostic condition) or laboratory problem-solving task unrelated to ability (non-diagnostic condition). Figure 8 clearly illustrates that for Black participants, performance depended on how the test was construed.

     

    Figure 8

     

    The figure suggests that if only the diagnostic condition were run, which is what basically happens when people take the GRE & SAT, a difference could emerge. An inaccurate SV-based conclusion of performance would be that being Black causes lower performance on standardized tests. Or, being White causes better performance. These conclusions could lead to all sorts of false race-based and racist conclusions. In class I will present these findings in more detail.

    I hope these studies send a message about the reasons why SVs and natural groups designs prevent causal conclusions.

     

    Ethics

    One final note about well-designed research: it must be conducted in an ethical manner. Please refer to your texts and your class notes for an explanation of the principles of ethical research. Remember what we learned from Dawes, et al. (1977); "it can still hurt, even when it's benign."

     

    What Does it all Mean? Or, I'm Going Crazy Trying to Integrate all of this Information

     

    The take home points from the section on design and model building are (a) the variables examined determine, for better and for worse, the models that can be tested and achieved; (b) designs must be guided by sound logic, theory, and prior research to obtain meaningful results; (c) correlational designs can describe and predict relationships, but their limitations can engender misunderstandings; (d) it's difficult to rule out all third variables, alternative explanations, and possible confounds with correlational designs; (e) significance is important but doesn't guarantee meaningful results; (f) replication and extension can help clarify findings and their meaning; (h) without validity, good control, and the right manipulation, research can not determine causality; and (I) if ethics and validity clash, ethics always must win.

     

    Let's Talk About Stats., Baby: A Brief Overview

    Statistical issues were covered in class and in a number of readings. Here I'll review a few key concepts. For more detailed explanations, refer to your course materials or Mr. Lowry’s wonderful site: http://faculty.vassar.edu/~lowry/webtext.html.

    An important idea to keep in mind is that when we do research, we propose research hypotheses (e.g., H1: caffeinated coffee will produce better recall than will decaffeinated coffee), but our statistical procedures work to test the antithesis of our hypotheses (e.g., H0: there is no difference in recall between the conditions). The antithesis is called the "null hypothesis." It’s easy to forget that statistical tests are designed to test the null hypotheses and not the research hypotheses because we don’t spend much time talking about the null hypothesis. When a finding is statistically significant, we reject the null hypothesis (in this example, we reject the idea there is no difference in recall between the coffee conditions). For a discussion of the null hypothesis, go to this page on my favorite site for statistical guidance: http://faculty.vassar.edu/~lowry/ch7pt1.html.

     

    Measures of Central Tendency

    Measures of central tendency are used to describe a set of data (e.g., Mode, Median, Mean). The mode is the score that occurs most frequently in a set of data. The median is the halfway point in the data. That is, half of the scores fall above the median and half fall below it. If there is an even number of observations, the median falls right between the two middle numbers. The mean is the arithmetic average. To compute the mean (Mx ) of a set of scores, sum the scores and divide by the number of scores, thus, Mx =…Xi /N. Remember, the mean is easily influenced by outliers.

     

    Measures of Variability

    Measures of variability describe the spread or dispersion in a data set (e.g., range, standard deviation). The range is the difference between the lowest and highest value in the data set. The standard deviation ("s" or "SD") measures the average difference between each score and the mean. Basically, it tells us how far on average a data point is from the mean. See Mr. Lowry’s guide for a more detailed discussion of variability: http://faculty.vassar.edu/~lowry/ch2pt2.html

     

    The t-test.

    The two sample t-test compares two sample means to determine whether they differ, and the results it generates allow us to make predictions concerning the population means. The test considers the means, SDs, and number of observations in each group.

    In an experiment, the results of a t-test tell us whether the difference between two sample means reflects the effects of the IV or chance alone. In a study, the test determines if levels of the response variable are associated with levels of the SV or there is no association except what would be expected by chance.

    The result of a t-test is statistically significant if the probability is below .05 (p < .05). When p < .05, the likelihood (probability) is less than 5 in 100 that the result occurred by chance alone. If so, we reject the null hypotheses that the means do not differ. We consider the means significantly different from each other. Because of the fact that p > 0 and how inferential statistics work, we can’t be absolutely sure that the difference between the means didn’t occur by chance; thus, we don’t ever prove the alternative hypothesis. Instead, we reject the null hypothesis that the means do not significantly differ from zero. For more on t-tests , please see: http://faculty.vassar.edu/~lowry/ch11pt1.html.

     

    Analysis of variance

    Analysis of variance (ANOVA) is used to compare the means of more than two groups or conditions. Again, the point is to test for differences among groups. Thus, conceptually ANOVA is very similar to t-test. It has the additional features of allowing us to look at the differences among more than two groups and to examine more than two variables. In the case of ANOVA, the null hypothesis is that none of the groups differ.

    ANOVA tests a number of questions. For example, in an experiment, oneway ANOVA tests whether the independent variable has an overall effect on the DV (omnibus test). In an experiment examining the effects of music on productivity, the question is whether the IV (music type) has some effect on productivity. Did the music (level of the IV) matter? The same basic issues are considered with a natural groups design, but associations, not causal effects, are tested.

    ANOVA considers means, SDs, and N. The results of an ANOVA (or F-test) are statistically significant if the probability is below .05 ("p < .05") that the difference among the groups is due to chance. When p < .05, the likelihood is less than 5 in 100.

    In an experiment, the results of the omnibus test tell us whether the IV has an effect on the DV, but it does not pinpoint the source of the effect. There are more advanced statistical techniques that help us locate the source of the effect.

    Recall that true interaction effects only can occur when there are two or more independent variables. The simplest experimental design that permits the examination of interactions is a 2 X 2 factorial design. A 2 X 2 design is one in which there are two IVs and each IV has two levels.

    For example, let’s say an experimenter, Jane, wants to examine the effects of anxiety on performance. Jane thinks that the difficulty of the task also may be important. She decides to test the effects of each variable and their interaction on performance. She tests subjects’ verbal performance by examining their performance on a series of anagrams. Here, the DV is the number of anagrams solved. Subjects try to complete difficult or easy anagrams. Thus, anagram difficulty represents one IV with two levels (easy vs. difficult). Jane also manipulates anxiety by telling subjects that failure to reach a certain performance goal either will result in no penalty or a mild shock. So, the second IV is degree of anxiety, and it has 2 levels (shock vs. no shock). This design produces four groups (2 X 2 = 4) : 1) easy/no shock, 2) easy/shock, 3) difficult/no shock, and 4) difficult/shock.

    The ANOVA will test for (1) a main effect of shock level, (2) a main effect for task difficulty, and (3) an interaction effect. Again, the p-value (p < .05) for each test indicates whether the result is significant; that is, whether the null hypothesis can be rejected. If each test of the ANOVA is significant, then (1) a main effect for task difficulty suggests that task difficulty influenced performance; (2) a main effect for anxiety suggests that different anxiety levels produced different effects on performance; and (3) an interaction suggests that the effect of one IV depended on the level of the second IV. For more on one-way ANOVA see: http://faculty.vassar.edu/~lowry/ch14pt1.html. For more on two-way ANOVA see: http://faculty.vassar.edu/~lowry/ch16pt1.html. For a conceptual overview of ANOVA go to: http://faculty.vassar.edu/~lowry/ch13pt1.html

     

    Some Fun Stuff.

    ANCOVA, analysis of covariance, can be understood as a form of multiple regression. It usually is used when the experimenter wants to partial out the influence of a variable or variables. For example, if you had an experiment where you looked at the effects of thought valance (positive versus negative) and task feedback (positive versus negative) on number of problems solved, you could use traditional ANOVA to test for the main effects of and interactions among these two IVs. But, let's say theory and previous research suggest that self-concept (SV) relates to how people react to feedback and how well they can focus on positive or negative thoughts. These variables covary. So, you may want to control for the shared effects of self-concept by partialling-out its relationship with each IV. By controlling for self-concept, you are left predicting the RV (number of problems solved) based on each IV and their interaction. Note, I used the term "RV" because there is a SV in the study. For a discussion of ANCOVA see: http://faculty.vassar.edu/~lowry/ch17pt1.html

    Path analysis and SEM. Conceptually, path analysis builds on ANCOVA and it can handle very complex models. Path analysis is used to test multiple models of relationships among numerous variables. Usually path analysis is used in correlational designs. Thus, the drawbacks of correlational designs apply. The research I discussed on choice of math-related major used path analysis to determine the relationships among variables. Figure 5 presented the some of the results of the path analysis used to analyze those data.

    The use of structural equation modeling (SEM) is becoming very common in psychology. Basically, very basically, SEM allows us to test a variety of models to see which ones best describe the relationships among the variables examined. The difference between this technique and ANCOVA, path analysis, or regression is that SEM allows us to test latent variables. Latent variables are unobserved constructs that are hypothesized to be represented by groups of variables that are measured, manifest variables. For example, figures 3 and 4 presented a variable called "exam/paper performance." This variable would be a manifest variable if actual measures of exam and paper performance were measured and averaged. Under different conditions it could represent a latent variable. For example, if a number of separate measurements of scholastic performance were taken and presumed to represent a general scholastic performance variable. The construct of general scholastic performance (GSP) would not be measured but inferred from the grouping of the manifest variables. In that case, GSP would be a latent variable.

    When reading about SEM or path analysis, look carefully at the variables and models being tested. Ask yourself: Do they make sense? Are the variables valid? You'll see terms "good fit" or "best fit." They refer to how well the model explains the covariance among variables. Again, like many of statistical tests, we are looking for the amount of variance explained.

    Be careful, SEM and path analysis often are discussed in terms of causal modeling, and causes and effects are discussed. Of course, causal conclusions rest on the validity of the design and type of design. No matter how complex or impressive the statistics appear to be, causation is not established based on correlational designs. Path analysis and SEM are nearly always used with correlational designs.

     

    One final tip about drawing conclusions based on significant findings.

    Always look at the design and ask yourself these questions: 1. Is it a true experiment?, and 2. Is the design righteous? If the answer is "no" to either question, causal conclusions can not be made. This is true even if the findings are significant and replicated many times.

     

    Concluding remarks

    Always think critically and constructively when reading the literature and when deciding what the research means. Always try to keep validity issues in mind. Do your best to be open to ideas. Remember that even if you (a) don't like what the research suggests, the research still could be meaningful, informative, and right, or (b) agree with the conclusions, or they seem to fit with what you've experienced, the conclusions could be misleading or wrong. Use these questions to guide your thinking about the research, theories, and conclusions you encounter:

  • 1. What am I being asked to believe or accept?

    2. What evidence is available to support these assumptions?

    3. Are there alternative ways of interpreting the evidence?

    4. What additional evidence would help me evaluate the alternatives?

    5. Given the answers to 1-4, what conclusions seem most reasonable?