Linear Correlation and Regression:
Concepts and Computational Nuts-and-Bolts


To keep things as uncluttered as possible, I will illustrate these points with a bivariate distribution that is very simple and streamlined. When you perform the computational mechanics for linear correlation and regression, what you are essentially doing is defining the straight line that best fits the bivariate distribution of data points, as shown below in the graph. This line is spoken of as the regression line, or line of regression, and the criterion for "best fit" is that the sum of the squared vertical distances () between the data points and the regression line must be as small as possible. As indicated, the regression line in this particular example yields a sum of squared distances of =39.2. Any other straight line projected through this particular bivariate distribution would have a value of greater than 39.2, hence would be a less good fit of the data.


>>The slope of the regression line (upward or downward) indicates the direction of the correlation (+ or -).
>>The closer the data points lie to the regression line, the greater the strength of the correlation; the more they are scattered away from the regression line, the smaller the strength of the correlation.
>>Note that the regression line always passes through the point where the mean of X and the mean of Y intersect. (For the present example, =3.5 and =7.0.)

As described in Chapter xxx, the measurement of linear correlation by way of the Pearson Product-Moment Correlation Coefficient comes down to a very simple ratio between (i) the amount of covariation between X and Y that is actually observed, and (ii) the amount of covariation that would exist if X and Y had a perfect (100%) positive correlation.



Although in principle this relationship involves two variances and a covariance, in practice it comes down to something much simpler, involving the prior calculation of only three values of SS. Namely

The following table shows all of preliminary calculations that would be needed for the calculation of the correlation coefficient. You will see in a moment that these preliminary calculations also provide most of the groundwork for performing a subsequent regression analysis.

Pair
X
Y
X2
Y2
X x Y
a
1
6
1
36
6
b
2
2
4
4
4
c
3
4
9
16
12
d
4
10
16
100
40
e
5
12
25
144
60
f
6
8
36
64
48
sums
21
42
91
364
170
means
3.5
7.0

Summary:
Calculation of

relevant SS values:

Once you have these preliminaries, you can then easily calculate the correlation coefficient as

Regression Analysis:

The regression line that has been implicitly generated by the preceding calculations can be precisely defined by just two numerical values. The first of these, known as the intercept, indicates where the line starts; and the second, known as the slope, indicates the rate at which the line angles either upward (+) or downward (-), once it gets started. The formulas and calculations for intercept (a) and slope (b) are shown below. (Note that the slope is shown first, because the value of the slope must be known before you can calculate the value of the intercept.)

In the following figure I show the same graph that appears at the top of this page, but now constructed in such a way as to emphasize the intercept and slope of the regression line. ~~ The intercept, shown on the left-hand side of the graph, is the point at which the dotted extension of the regression line crosses the vertical Y axis--providing that the Y axis is lined up with the point on the horizontal axis where X is equal to zero. (Be careful with this, because bivariate coordinate plots do not always begin the X axis at X=0.) ~~ The slope of the regression line is indicated by the pattern in the graph that looks like a flight of stairs. What this pattern shows is that for each increase of one unit in the value of X, the value of Y increases by 1.31 units. Thus, when X is equal to zero, Y is equal to the intercept, which is 2.4; when X=1.0, Y is equal to the intercept plus 1.31 (i.e., 2.4+1.31=3.71); when X=2.0, Y is equal to the intercept plus 2.62 (i.e., 2.4+2.62=5.02); and so on.


Standard Error of Estimate

The slope and intercept of the regression line are in fact already generated implicitly, behind the scenes, when you perform the calculations for the correlation coefficient. They need to be drawn out explicitly only for the practical purpose of making predictions based on the observed correlation. As discussed in class, the general form of such a prediction is

The measure of probable error in this situation is a quantity known as the standard error of estimate, which is esentially a standard deviation, a measure of the aggregate degree to which the observed bivariate data points deviate from the line of regression. As indicated in Chapter xxx, the standard error of estimate takes somewhat different forms, according to whether it is regarded as a descriptive measure or an inferential measure.

To illustrate, consider again the bivatiate values and scatter plot first shown at the top of this page.


The sum of squared vertical distances from the regression line, calculated as =39.2, can be regarded as a sum of squared deviates; divide that quantity by N, which in this case is equal to 6, and you end up with a variance. This particular measure of variability is spoken of as the residual variance of Y, so named because it is the amount of variability in the variable Y that is not associated with variability in the variable X. ~~ In practice, you will not actually need to calculate the sum of the squared distances from the regression line, because the value of can be much more easily calculated as SSY(1-r2). At any rate, take the square root of this (descriptive) residual variance and you end up with the (descriptive) standard error of estimate.

residual
variance
(descriptive)
standard
error
(descriptive)

Replace N in the above expression with the appropriate number of degrees of freedom (within the context of correlation and regression, df=N-2), and you have the standard error of estimate that is usable for inferential purposes.

standard
error of
estimate
(inferential)

As shown in the following graph, this calculated value for the inferential standard error of estimate (SE) corresponds to a pair of lines running parallel to the regression line, the first lying 3.13 units of Y above the regression line, and the other lying 3.13 units of Y below it.



Taking as our measure of probable error, we can now recast the prediction formula given earlier as . Suppose, for example, that you wanted to predict the value of Y that would probably be associated with a newly observed value of X=4. As shown in the above graph, what you are essentially doing when you apply the prediction formula is starting out on the X axis at the point where X=4, going from there straight up to the regression line, then turning left and going straight over to the Y axis where you arrive at a predicted value of Y=7.64. Add to this value as a measure of probable error, and you end up with

Although the underlying logic of the point will not be clear until we are well along in our consideration of concepts of probability, the basic meaning of the prediction is that we can have about 68% confidence that the value of Y actually associated with our newly observed value of X=4 will fall somewhere within the range bounded at the one extreme by 7.64-3.13=4.51 and at the other by 7.64+3.13=10.77.


Return to Prospectus Main Page