Perpendicular Regression Of A Line

When we perform a regression fit of a straight line to a set of (x,y) 
data points we typically minimize the sum of squares of the "vertical" 
distance between the data points and the line.  However, this isn't 
the only possible approach.  For example, we might choose to optimize 
the horizontal distances from the points to the line, or the 
perpendicular distances to the line.

The reason we don't usually apply any of these alternative approaches 
is that, in general, the units of x and y may be different, and so 
the "angles" of lines in the xy plane do not have any absolute 
significance.  For example, if x is time, and y is intensity, we 
have no way of weighting x errors in relation to y errors, so there 
is no unique notion of "perpendicular" in the time-intensity plane. 
So, for most regression applications (where x and y have arbitrary 
units), we just use the vertical distance, with the idea being that 
we can sweep up the error in x as just one more contribution to the 
error in y at a specific value of x.

On the other hand, in cases where x and y have the same units, it's 
feasible to regress both x and y by minimizing the sum of squares of 
the perpendicular distances from the line.  In such cases the result 
has absolute significance, because the notion of "perpendicular" has 
an absolute meaning.  

One way of approaching this is to find the "principle directions" of 
the data points.  Let's say we have the (x,y) coordinates of n data 
points.  To make it simple, let's first compute the average of the 
x values, and the average of the y values, calling them X and Y 
respectively.  The point (X,Y) is the centroid of the set of points.
Then we can subtract X from each of the x values, and Y from each of 
the y values, so now we have a list of n data points whose centroid 
is (0,0).

To find the principle directions, imagine rotating the entire set of 
points about the origin through an angle q.  This sends the point 
(x,y) to the point (x',y') where

              x'  =   x cos(q) + y sin(q)
              y'  =  -x sin(q) + y cos(q)

Now, for any fixed angle q, the sum of the squares of the vertical 
heights of the n transformed data points is S = SUM [y']^2, and we 
want to find the angle q that minimizes this.  (We can look at this 
as rotating the regression line so the perpendicular corresponds to 
the vertical.)  To do this, we take the derivative with respect to q 
and set it equal to zero.  The derivative of [y']^2 is  2y'(dy'/dq), 
so we have

  dS/dq = 2 SUM [-x sin(q)+y cos(q)][-x cos(q)-y sin(q)]

We set this to zero, so we can immediately divide out the factor 
of 2.  Then, expanding out the product and collecting terms into 
separate summations gives

  [SUM xy] sin(q)^2  + [SUM (x^2 - y^2)] sin(q)cos(q)

         - [SUM xy] cos(q)^2  =  0

Dividing through by cos(q)^2, we get a quadratic equation in tan(q):

    {xy}tan(q)^2 + {x^2 - y^2}tan(q) - {xy} = 0

where the "curly braces" indicate that we take the sum of the contents 
over all n data points (x,y).  Dividing through by the sum {xy} gives

          tan(q)^2  +  A tan(q) - 1  =  0

where A = {x^2-y^2}/{xy}.  Solving this quadratic for tan(q) gives 
two solutions, which correspond to the "principle directions", i.e., 
the directions in which the "scatter" is maximum and minimum.  We 
want the minimum.

Just to illustrate on a trivial example, suppose we have three data 
points (5,5), (6,6), and (7,7).  First compute the centroid, which 
is (6,6), and then subtract this from each point to give the new set 
of points (-1,-1), (0,0), and (1,1).  Then we can tabulate the sums:

            x   y   x^2 - y^2   xy
           --- ---  ---------  ----
           -1  -1       0        1
            0   0       0        0
            1   1       0        1
                      -----    -----
                        0        2

In this simple example we have {x^2-y^2} = 0 and {xy} = 2, which
means that A = 0, so our equation for the principle directions is 
simply 
                   tan(q)^2 - 1 = 0

Thus the two roots are tan(q)=1 and tan(q)=-1, which corresponds 
to the angles +45 degrees and -45 degrees.  This makes sense,
because our original data points make a 45 degree line, so if 
we rotate them 45 degrees clockwise they are flat, whereas if we 
rotate them 45 degrees the other way they are vertically arranged.
These are the two principle directions of this set of 3 points.  
The "best" fit through the original three points is a 45 degree 
line through the centroid - which is obvious in this trivial
example, but the method works in general with arbitrary sets 
of points.

For another example, suppose we have four data points (2,6), (4,2), 
(16,8), and (14,12).  The centroid of these points is (9,7), so we 
can subtract this from each point to give the new set of points 
(-7,-1), (-5,-5), (7,1), and (5,5).  Then we can tabulate the sums:

                x    y     xy    x^2 - y^2
               ---  ---   ----   ---------
               -7   -1      7       48
               -5   -5     25        0
                7    1      7       48
                5    5     25        0
                          ----     ----
                   sums:   64       96

In this case we have {xy} = 64 and {x^2-y^2} = 96, which gives A = 3/2, 
so our equation for the principle directions is

              tan(q)^2 + (3/2)tan(q) - 1 = 0

The two roots are tan(q) = 1/2 and -2, which correspond to the angles
+26.565 degrees and -63.434 degrees.  This is consistent with the fact
that our original four data points are the vertices of a rectangle
whose edges have the slopes 1/2 and -2.  The "best" fit through these 
four points is a line through the centroid  with a slope of 1/2.

(It's interesting that the two quantities which characterize the 
points, namely xy and x^2 - y^2, are both hyperbolic conic forms, 
and they constitute the invariants of Minkowski spacetime when 
expressed in terms of null coordinates and spatio-temporal 
coordinates, respectively.)

Return to MathPages Main Menu
Сайт управляется системой uCoz