|
With a trendline we can quantify that observation. To add a trendline, select the plotted
series, and then Chart | Add Trendline… | Type tab |
select Linear and from the Options tab |
check ‘Display equation on chart’
|

Figure
2
|
In addition to the trendline, Figure 2 also contains horizontal line showing the average of
the sales data. From the trendline, we
can conclude that on average monthly sales increase by 650 units each month. But, that is about the limit of the
information. It is important to note
that extrapolating too far out beyond the existing data set is not a very smart
thing to do. It would be OK to estimate
sales in each of the missing months: 3 and 5.
Estimating for missing data within the overall data set is known as interpolation. By contrast, estimating sales for months
outside the data set is known as extrapolation. It might be OK to make an estimate for
“nearby” months such as month 9 or maybe even 10. However, it would be foolhardy to conclude
that a year from now sales will have risen by 650*12 or 7,800 units. The limit of what one can do with a
regression is further addressed in the section on Confidence Intervals.
While the trendline is of some value, if we were to use the
information in any analysis, it would help to have the data in worksheet cells
and not as text in a chart. In addition,
consider the case in which the dependent variable (sales in the above example)
was a function of not one independent variable (month in the above example) but
multiple variables? Known as a multiple regression, a graphical
analysis would require what cannot be done in Excel: an n‑dimensional
graph. The section on Linear regression with
multiple variables addresses how this can be done in an Excel worksheet.
Excel includes multiple functions for regression
analysis. We will look at LINEST in
detail. Once we understand the LINEST
function, LOGEST will not be too difficult to understand. In addition to using LINEST for linear
estimates, we will also see how its name is misleading. It can actually be used for a large variety
of different types of trends such as polynomial, logarithmic and exponential.
The previous chart showed a linear trend with one
independent variable. What other kinds
of trends are there? Figure 3 shows five different types: polynomial of order 3,
linear, logarithmic, natural exponential, and power.

Figure
3 – Note that what Excel calls an exponential trendline
is strictly speaking a natural
exponential trendline of the form
. Excel doesn’t have the capability of drawing the
more general exponential trendline of the form 
Other types of trendlines would include the trigonometric
functions
or
, as well as the exponential function,
. While Excel includes
a special function to analyze the exponential regression – the LOGEST function
– we will see why, strictly speaking, it is not necessary.
The rest of the case study is structured as follows:
- The basic linear regression
- Understanding the result
- Limits of regression analysis
- Over-specifying a regression – a
common mistake
- Confidence Intervals
- Regression analysis with other trend
functions
- Linear regression with multiple
variables
- Regression with polynomials
- Regression with indicator (dummy)
variables
- Comparing chart trendline values with
LINEST results
- Implications for different versions of
Excel
- LINEST and VBA
- Summary
A linear regression fits the line
, or as Excel prefers to call it
, to the existing data set.
It does so through a technique known as minimizing the sum of the
squares of the error terms. To get the
complete result of a regression analysis, select a range 5 rows by 2 columns
and array-enter the LINEST function as shown in Figure 4. The first row
contains the 2 coefficients a1
and a0 respectively.
Figure
4
Using regression analysis yields a lot more information than
the trendline’s coefficients. The rest
of the information is important in understanding how well the regression line
fits the data, how significant the individual coefficients are, as well as the
significance of the regression as a whole.
It also contains key elements needed to build confidence intervals for
interpolated or extrapolated estimates, a subject covered in the section titled
Confidence Intervals.
But, first, we start with some nomenclature. The number of data points is given by n.
The number of independent variables is given by k. If a constant is included
in the regression, it increases k by 1. Figure 5 is a zoomed-in version of Figure 2 focusing on the data point (4, 4400). Each of the recorded observations is denoted
by the pair of values
. For each
, the value predicted by the regression is given by
.
is pronounced “y
hat.” The average of all the observed y
values is known as
, pronounced “y bar.”

Figure
5 – Enlarged view of Figure 2
around the data point (4, 4400)
Lacking any information about the regression one could still
estimate the average of all the y’s, i.e.,
. That’s a “first cut”
estimate that we seek to improve with a regression. The value that one gets from the regression
is
. As Figure 5 shows, the total gap between the observed data point
and the “first cut” estimate is given by the deviation
-
. Out of this the
regression accounts for the amount
‑
. This leaves a
residual (also known as an error) of
‑
unexplained by the
regression.
With the nomenclature established, we look at the result of
the LINEST function. For reasons that
will soon be apparent, we start with the last row. There are two values
and
. These are aggregate
measures of something we have already looked at the level of an individual data
point. Recall that for each individual
data point, the measure of how much the regression explains is
and how much remains
unexplained is
.
and
are aggregate values of the same metrics. The first is the sum of the squared values of
how well the regression fits the data or
. The second is the
sum of the squared values of how much remains unexplained or
.
Row 4 contains two values: the F statistic and the degrees of freedom, df. The degrees of freedom
is given by the expression n-k, where n and k are explained earlier in this
section. The F statistic, or the observed F-value, is a measure of the
significance of the regression as a whole.
For the technically minded it tests the null hypothesis that all of the
coefficients are insignificant against the alternative hypothesis that at least
one of the coefficients is significant.
While Excel provides the value, it can also be computed as
. This, the observed
F-value, is then compared against a critical
F-value, F(a, v1, v2),
where a is 1 - the level of
significance we are interested in, and v1 and v2 are as calculated
below. The typical level of significance
one is interested in is 95%. Hence, a = 1-0.95 or 0.05. If the constant term in included in the
regression (the LINEST parameter const is either TRUE or omitted), v1
= n – df – 1 and v2 = df. If
the constant term is excluded from the regression (the LINEST parameter const =
FALSE), then v1 = n – df and v2 = df.). Excel’s FINV function, used as FINV(F, v1,
v2), provides the critical F-value. If
the observed F-value is greater than the critical F-value, it means the
regression as a whole is significant.
Row 3 contains the two metrics, R2 and the SEreg. The R2 is measure of how well the
regression fits the observed data. It
ranges from 0 to 1 and the closer to 1 the better the fit. Mathematically, it is calculated as
, where each term is explained above. Graphically, in terms of Figure 5, it is a measure of how close the regression line is
to all of the observations. Suppose the regression line were to pass
through every observation. Then, SSresidual
would be zero since each of its components
would be zero, and R2
would become SSregression/SSregression or 1. The second item in this row is the Standard
Error of the regression, or SEreg.
It can also be calculated from what we already know, i.e.,
. Keep in mind that
n-k is also the df value in row 4 of
the result. SSreg will play a
role in calculating the confidence intervals later in this chapter.
Row 2 provides the standard error of each coefficient, or
. The section Understanding the result addresses how these errors help determine if the
coefficients are significant.
LINEST can return more than just the coefficients of the regression. Used as an array formula in a 5 rows by X
columns range, LINEST returns not only the coefficients but also other
statistical information about the results.
Some might find it surprising but the Excel documentation for LINEST
does a very good functional job of explaining not only contents of all the rows
but also the statistical value of that information. Nonetheless, we will look at one key element
of the result that bears repeating – it is overlooked by way too many users of
LINEST. Figure 6 shows the result of some linear regression
using LINEST.

Figure
6
When array-entered in 5 rows (D2:E5), LINEST, as before,
returns the coefficients in row 1. In
row 2 it provides the standard error
of each of the coefficients. Taken
together, the two rows contain critical information in estimating whether each
of the coefficients is statistically different from zero. Dividing the absolute value of the coefficient
value by the standard error yields what is known as the observed t-result, or
Comparing the
absolute value of this t-result with the corresponding critical t-value lets one decide whether that coefficient should be
treated as zero. If the observed
t-result is less than the critical
t-value then statistically the coefficient is the same as zero. In the above example, the t-result for the a1 and the a0
(constant) terms are:
, respectively. The critical
t-value is calculated with the formula =TINV(0.05, df.), where 0.05 corresponds
to the 95% confidence level and d.f. is the degrees-of-freedom for the
regression. In the above example, d.f.
is given by E4. The result of the
formula, i.e., the critical t-value, is
. The absolute value of the observed t-result
for the coefficient (a1) is much greater than the critical
t-value. Hence, we can conclude that the
a1 term is statistically significant. On the other hand, the absolute value of the observed
t-value for the a0 (constant) term is 2.148. Since it is less than the critical t-value,
we conclude that, statistically speaking, the constant is zero.
Since the constant term is statistically indistinguishable
from zero, one should run the regression with the constant term forced to
zero. This yields the result in Figure 7.

Figure
7
Note that the R2 value has actually increased
from 99.65% to 99.88%! Hence, removing
the value that was statistically indistinguishable from zero has improved the R2
value. Of course, this may not always be
true, but removing a term that is statistically the same as zero is always a
good idea.
Limits of regression analysis
Regression analysis is not magic. It is a tool that is only as good as the
underlying data used in the analysis and as good as the person using it. Ultimately, there is no substitute for a
better understanding of the underlying cause-and-effect that the regression is
meant to analyze as well as a better understanding of the real world elements
being modeled in the regression.
A question that arises on occasion is “what kind of
trendline should I pick?” The answer is
that the analyst has to have some understanding of the process under
analysis. For example, in the example
used in the beginning of this chapter, we modeled the sales volume as a linear
function of the month. While it made
sense for the range of vaues we had, if we were working over a longer period,
it is quite possible that a different model, such as a logarithmic regression,
might have been more appropriate.
A few other examples that come to mind will demonstrate the
importance of paying attention to the underlying issues that affect the
analysis:
In one case, someone had 5 years of data for the number of
applications and the number of admissions to a college. Based on this data, her boss had asked her to
project the number of applications and admissions for the next 5 years. Extrapolating five years of data this far out
into the future is simply dangerous. Any
number of factors can affect application and admission rates to a college and
ignoring the various possible causes is a prescription to a disaster.
In a consulting assignment, I was asked by a regional
utility to help analyze the pattern of calls to its call center. As one can imagine any number of variables go
into such calls. Among others, there
were seasonal trends, annual trends, day-of-week patterns, storm-related surges
in call volume, etc. One of the
challenges was to better understand call patterns through the week. The obvious and intuitive thing to do was
establish a ‘day of week’ indicator.
However, that did not yield statistically significant results. After talking with the call center management
and the call center operators, I realized that the key was not the day-of-the-week
but ‘operating day of week’ – a term that accounted for holidays. What that meant was that if Monday was a
working day, it would be operating-day-1.
However, if it was a holiday, then for that week Tuesday would be
operating-day-1. Once we factored in
this kind of reasoning, together with a similar flag for ‘last-working-day-of-the-week’
the independent variables immediately became statistically significant!
The final example is a classic from an introductory MBA
class on statistics. Suppose one were to
do either of the following two regressions: (1) the number of men buying ice
cream each day as a function of the number of women wearing bikinis that day,
or (2) the reverse, i.e., the number of women buying ice cream each day as a
function of the number of men going shirtless that day. The result will show a significant
statistical correlation. But does it
make sense? No, not really. The events are correlated, but neither is the
cause of the other. In fact, the true
cause is missing from the analysis – the average daily temperature! In summer, when the outside temperature is
high, both men and women buy more ice cream and wear fewer clothes. When the temperature drops during the winter
months, people wear more clothes and buy less ice cream.
These examples should highlight the risks of carrying out a
regression analysis without truly understanding the underlying factors that
affect the analysis. If one is lucky the
results will be so absurd that they will be rejected out-of-hand. If one is unlucky, the result may lead to bad
organizational or individual decisions.
Over-specifying a regression
Given how easy it is to add a polynomial trendline to an
Excel chart, it is tempting to always ask for a high order polynomial as the
curve of best fit. The number of
requests one sees in the Excel newsgroups where someone asks for a cubic
polynomial or even a sixth-order polynomial is quite high. However, it is usually not a good practice to
over-specify a regression since it can lead to results that are numerically
unstable. This over-specification is
easy to identify if one uses LINEST to find the coefficients of a polynomial
best-fit and then tests if the coefficients are statistically significant.
Suppose one carried out a quadratic polynomial regression
for the data from the previous section.
The result of the LINEST function would be:

Figure 8
The absolute value of the observed t-value corresponding to the x2 term is
1.355. The t-critical value is 2.1788.
Hence, statistically speaking, the coefficient of the x2 term
is indistinguishable from zero. What
this tells us is that asking for a quadratic polynomial best-fit over-specifies
the regression. Removing the x2
term leads to the result of Figure
6, which, in turn, leads to the result of Figure 7.
Among the uses of a regression is the ability to estimate
missing information as well as information outside of the range of data
analyzed. In the example from the first
section, one might want to estimate the sales volume missing in months 3 and
5. Or, one might want to estimate sales
in months 9 or 10, or even further out into the future. Note the emphasis on “estimate.” Clearly, one cannot state with certainty that
the sales volume in month 3 must have been a particular amount. Similarly,
it is hard to imagine a certain prediction about sales volume a year from now. However, we can do two things. First, provide what is called a point
estimate. Second, provide a range of
within which we can state with some certainty the actual value will fall. This information is often shown graphically
as in Figure 9. The lower and
upper curved lines demarcate the range of the 95% interval whereas the
regression line itself represents the point estimate. Hence, for month three the best point
estimate would be 4,383 units. For most
practical purposes there is a 95% probability that the actual value was between
the upper and lower CIs. For month 3
those values would be 3,363 and 5,404 respectively.

Figure
9
The calculations required to plot the three lines are
described below. The predicted line is
the same as the regression line and each y value, yj, is calculated
as described in The
basic linear regression, i.e.,
. The distance of each
confidence interval point from the regression line is given by the equation
, where CIj is the value for the value of interest
xj and xi represents the known observations. SEreg is one of the values
returned by the LINEST function and explained in The basic linear regression section. Excel
provides a function, DEVSQ, to compute the sum of the squares in the equation.
Next, we compute these values in Excel. The original data set is shown in Figure 10. Only four of
the regression related values are shown – a1, a0, SEreg,
and the two-side 95% t(df) value.

Figure
10
Next, we calculate the predicted value and the confidence
interval values for the months 1 through 20.

Figure 11
H3:K3 contain the formulas shown below.
|
H3
|
=$E$3*G3+$E$4
|
|
I3
|
=$E$6*$E$5*SQRT(1/COUNT($A$3:$A$8)+(G3-AVERAGE($A$3:$A$8))^2/DEVSQ($A$3:$A$8))
|
|
J3
|
=H3-I3
|
|
K3
|
=H3+I3
|
Copy cells H3:K3 rows 4:22.
Graph H, J, K as the y values and G as the x-values in a XY Scatter
chart to get

Figure 12
Note how much additional uncertainty is introduced the
further we get from the original data set, which had values only for the first
9 months.
Regression analysis with other trend functions
As shown in Figure
3, there are many different types of trendlines possible. Each reflects a different relationship
between the independent and the dependent variables. Some trend functions of a single variable –
other than a linear or a polynomial trend – are listed in the table below.
|
Logarithmic
|

|
|
Power
|

|
|
Exponential function to base b
|

|
|
Natural Exponential function
|

|
|
Trigonometric function
|

|
Table 2
The key to understanding this section (as well as the one on
polynomial trends) is realizing something that apparently the Microsoft people
who documented LINEST overlooked: the linearity required in sum-of-square best
fit regression analysis applies not
to the variables but to the unknown coefficients! Hence, as long as we can write a function in
a form that is linear in the unknown coefficients, LINEST can be used to
analyze the data. Keep this in mind as
we look at how to transform the above functions into a form amenable to
analysis by LINEST. Also, the data for
the examples for the different trendlines are in Figure 13. This is the
same data set used for the trendlines in Figure
3. Hence, for
those functions for which Excel can create chart trendlines, the reader can
compare the results below with the chart trendline results.

Figure
13
We start with the logarithmic trendline function,
. It is already in a
linear form (
). Hence, given the x
and y values, =LINEST(y-range, LN(x-range)) will give the required a (slope) and b (intercept) values.

Figure 14
The power function
can be transformed
into
. As in the log trendline, given x and y
values, using LINEST with this transformed function means that =LINEST(LN(y-range),
LN(x-range)) yields ln(a) as the
intercept and b as the slope. Hence, for the original regression a will be EXP(intercept) and b will be the slope itself.

Figure 15
The exponential function to base b,
, can be transformed into
. Using LINEST with
this transformed function means that =LINEST(LN(y-range), x) yields ln(a) as the intercept and ln(b) as the slope. Hence, for the original regression a=EXP(intercept) and b=EXP(slope).

Figure 16
Finally, the natural exponential function
can be transformed
into
. Using LINEST with
this transformed function means that =LINEST(LN(y-range), x) yields ln(a) as the intercept and b as the slope. Hence, for the original regression a=EXP(intercept) and b=slope.

Figure 17
We can easily extend the above case with its single
independent variable to include multiple independent variables. When the dependent variable is a function of
multiple independent variables the problem is called multiple regression. In mathematical terms,
, where each x
represents an independent variable and the various m’s are the coefficients
from the regression (m0 is the regression constant). One example would be the assessed value of a
commercial building as a function of floor space, number of offices, number of
entrances, and age. This is from the
Excel help file and is duplicated here to provide a sense of continuity between
the help and this document. In addition,
it serves as a stepping stone for the next section, which is a type of
regression that is frequently requested.
In the commercial building example, the y would be the value
of the building and the x’s are the respective independent variables as shown
in Table 3.
|
Variable
|
Refers to the
|
|
y
|
Assessed value of the office building
|
|
x1
|
Floor space in square feet
|
|
x2
|
Number of offices
|
|
x3
|
Number of entrances
|
|
x4
|
Age of the office building in years
|
Table
3
Suppose the value of the commercial building and the corresponding
values of the independent variables are known (see Figure 18).

Figure
18
Hence, the regression equation would be
. In this case, the
regression result of LINEST is shown in Figure
19

Figure
19
The number of columns is one more than the number of
independent variables, the extra column for the regression constant. Left to right the coefficients are m4
to m0. Hence, -234.2372 is
the coefficient of x4, the age of the office building and 27.64139
is the coefficient of x1, the floor space.
For an explanation of the various rows see the earlier
section on ‘Understanding
the result.’
Returning to a recurring theme, does the result make
sense? What is the sign of each of the
coefficients? Do you agree with the
sign? The coefficients for the floor
space, the number of offices, and the number of entrances are all positive (and
based on the observed t-value statistically significant). That means that more entrances (or more floor
space or more offices) means a higher value for the building. The age of the
building, on the other hand, has a negative (and significant) coefficient. That means that the larger the age, i.e., the
older the building, the lower its value.
That makes sense, doesn’t it?
Regression with polynomials
In a regression analysis with polynomials such as
each term is like a different variable, i.e.,
. Written in this form
it becomes apparent that a polynomial regression is no different from a multiple
regression. One could create three
columns of data: one with the x values, the next with the x2 values,
and the last with the x3 values and then use the same LINEST syntax
as for a multiple regression. However, the
mechanics can be simplified thanks to Excel’s array formula capability. For example, the solution to a third order (cubic)
polynomial regression would be the array formula =LINEST(y-range,
x-range^{1,2,3}). The last expression
takes the x-range and converts it into three separate vectors,
,
, and
, that are then used for the multiple regression. Using the data of Figure 13, the result of the LINEST function is in Figure 20. Again, one
can compare the result of the LINEST function with the chart trendline of Figure 3.

Figure
20
Regression with indicator (dummy) variables
There are many instances when an independent variable does
not contain numeric values but rather categorical values. A simple example is the variable Gender, which can contain only one of
two values M or F.
In such a case, one represents each categorical value with an Indicator (or Dummy) variable that is either 1 or 0
depending on the category value. For
example, we could introduce a variable M that would be 1 if Gender equals
M. Otherwise, it would be zero.
While it might appear
that one should also have another variable that is 1 when Gender equals F, it
is not really required and should not be created. After all, M=0 implies F=1, just as M=1
implies F=0. As a rule, the number of
indicator variables will be one less than the number of values that the
category variable can take.
Another example: imagine an independent variable, Season,
that could equal Spring, Summer, Fall, or Winter. We
would need only three Indicator variables, say, Spring, Summer, and Fall. Winter would be implied by all three indicator
variables being zero.
Sometimes, what might appear to be a numeric variable is
actually a categorical variable. Most
businesses break up the year into four quarters, typically numbered 1, 2, 3,
and 4. While this might make it look
like the variable is numeric it is actually a category variable. After all, the quarters could just as easily
be labeled A, B, C, and D. Essentially,
one must be careful to identify and treat a categorical variable correctly.
This issue of correctly identifying a category variable
shows up in the next example.