Perpendicular Regression Of A Line
When we perform a regression fit of a straight line to a set of (x,y)
data points we typically minimize the sum of squares of the "vertical"
distance between the data points and the line. However, this isn't
the only possible approach. For example, we might choose to optimize
the horizontal distances from the points to the line, or the
perpendicular distances to the line.
The reason we don't usually apply any of these alternative approaches
is that, in general, the units of x and y may be different, and so
the "angles" of lines in the xy plane do not have any absolute
significance. For example, if x is time, and y is intensity, we
have no way of weighting x errors in relation to y errors, so there
is no unique notion of "perpendicular" in the time-intensity plane.
So, for most regression applications (where x and y have arbitrary
units), we just use the vertical distance, with the idea being that
we can sweep up the error in x as just one more contribution to the
error in y at a specific value of x.
On the other hand, in cases where x and y have the same units, it's
feasible to regress both x and y by minimizing the sum of squares of
the perpendicular distances from the line. In such cases the result
has absolute significance, because the notion of "perpendicular" has
an absolute meaning.
One way of approaching this is to find the "principle directions" of
the data points. Let's say we have the (x,y) coordinates of n data
points. To make it simple, let's first compute the average of the
x values, and the average of the y values, calling them X and Y
respectively. The point (X,Y) is the centroid of the set of points.
Then we can subtract X from each of the x values, and Y from each of
the y values, so now we have a list of n data points whose centroid
is (0,0).
To find the principle directions, imagine rotating the entire set of
points about the origin through an angle q. This sends the point
(x,y) to the point (x',y') where
x' = x cos(q) + y sin(q)
y' = -x sin(q) + y cos(q)
Now, for any fixed angle q, the sum of the squares of the vertical
heights of the n transformed data points is S = SUM [y']^2, and we
want to find the angle q that minimizes this. (We can look at this
as rotating the regression line so the perpendicular corresponds to
the vertical.) To do this, we take the derivative with respect to q
and set it equal to zero. The derivative of [y']^2 is 2y'(dy'/dq),
so we have
dS/dq = 2 SUM [-x sin(q)+y cos(q)][-x cos(q)-y sin(q)]
We set this to zero, so we can immediately divide out the factor
of 2. Then, expanding out the product and collecting terms into
separate summations gives
[SUM xy] sin(q)^2 + [SUM (x^2 - y^2)] sin(q)cos(q)
- [SUM xy] cos(q)^2 = 0
Dividing through by cos(q)^2, we get a quadratic equation in tan(q):
{xy}tan(q)^2 + {x^2 - y^2}tan(q) - {xy} = 0
where the "curly braces" indicate that we take the sum of the contents
over all n data points (x,y). Dividing through by the sum {xy} gives
tan(q)^2 + A tan(q) - 1 = 0
where A = {x^2-y^2}/{xy}. Solving this quadratic for tan(q) gives
two solutions, which correspond to the "principle directions", i.e.,
the directions in which the "scatter" is maximum and minimum. We
want the minimum.
Just to illustrate on a trivial example, suppose we have three data
points (5,5), (6,6), and (7,7). First compute the centroid, which
is (6,6), and then subtract this from each point to give the new set
of points (-1,-1), (0,0), and (1,1). Then we can tabulate the sums:
x y x^2 - y^2 xy
--- --- --------- ----
-1 -1 0 1
0 0 0 0
1 1 0 1
----- -----
0 2
In this simple example we have {x^2-y^2} = 0 and {xy} = 2, which
means that A = 0, so our equation for the principle directions is
simply
tan(q)^2 - 1 = 0
Thus the two roots are tan(q)=1 and tan(q)=-1, which corresponds
to the angles +45 degrees and -45 degrees. This makes sense,
because our original data points make a 45 degree line, so if
we rotate them 45 degrees clockwise they are flat, whereas if we
rotate them 45 degrees the other way they are vertically arranged.
These are the two principle directions of this set of 3 points.
The "best" fit through the original three points is a 45 degree
line through the centroid - which is obvious in this trivial
example, but the method works in general with arbitrary sets
of points.
For another example, suppose we have four data points (2,6), (4,2),
(16,8), and (14,12). The centroid of these points is (9,7), so we
can subtract this from each point to give the new set of points
(-7,-1), (-5,-5), (7,1), and (5,5). Then we can tabulate the sums:
x y xy x^2 - y^2
--- --- ---- ---------
-7 -1 7 48
-5 -5 25 0
7 1 7 48
5 5 25 0
---- ----
sums: 64 96
In this case we have {xy} = 64 and {x^2-y^2} = 96, which gives A = 3/2,
so our equation for the principle directions is
tan(q)^2 + (3/2)tan(q) - 1 = 0
The two roots are tan(q) = 1/2 and -2, which correspond to the angles
+26.565 degrees and -63.434 degrees. This is consistent with the fact
that our original four data points are the vertices of a rectangle
whose edges have the slopes 1/2 and -2. The "best" fit through these
four points is a line through the centroid with a slope of 1/2.
(It's interesting that the two quantities which characterize the
points, namely xy and x^2 - y^2, are both hyperbolic conic forms,
and they constitute the invariants of Minkowski spacetime when
expressed in terms of null coordinates and spatio-temporal
coordinates, respectively.)
Return to MathPages Main Menu
Сайт управляется системой
uCoz