The report shows strengths and weaknesses for two methods, the global and local
method. Also, it illustrates the calculation for the method’s predicting error which
we will then based on the calculated errors to choose the best fit method.
This report is an extension of the group report, which the main purpose is to find a suitable
model for a given data. In the report we discuss further on the global and local method.
Section two and three illustrate the strengths and weaknesses for the two methods. Some
cases were shown when the global method or the local method will face problems. Choosing
the wrong method may affect the accuracy of prediction outputs. In section four we will
introduce the error calculation that we will consider when choosing a suitable method. We
then compare the two methods with the calculated errors on the same training data.
2 Global Method – Linear Regression
The global method mentioned in the group report is Linear Regression. Predicting the
response variable Y(This will be output through out the project) using the regression
Yˆ = ?ˆ + ?ˆ X + … + ?ˆ X + ? (1)
X1,…,Xn is the explanatory variables(This will be input through out the project) .
2.1 Strengths of Linear Regression
Linear regression shows optimal results when the relationship(?i shows the relationship)
between the inputs(independent variables) and output(dependent variable) are almost lin-
ear(where inputs could also be squared(Xi2) or quadratic(Xi3) etc.).
It also have the ability to identify the outliers, where outliers are the observations which
do not follow the pattern of the other observations, such as the points which lie far from
the regression line on the regression plot.
2.2 Weaknesses of Linear Regression
Linear Regression only consider the linear relationship between the inputs and output.
That is, it assumes there is a straight-line when the training data is plotted. Hence it is
often inappropriately used to model non-linear relationships. For instance the income and
the age of people, the relationship between them is curved.
It only consider the mean of the output. However, sometimes we need to look at the
extreme cases of the output. For example, if measuring the relationship between the birth
weight of babies born and the ages of the mothers. Babies are at risk and need special care
when their weights are too low, so in this example we would want to look at the extreme
It is also really sensitive to outliers, outliers greatly affect the slope of the regression line.
Furthermore, it is easy to learned on the training data too specific, starts to model the
noises(random error) in the training data rather than just the relationship between the
variables. Therefore, it may fail to fit the additional data. This is refer to as over-fitting,
most commonly arises when having too many parameters comparing to the number of
2.3 Failure case when using Linear Regression
One of the failure case for Linear Regression method is when the output measurement is
in categorical (such as gender or blood type of a person) form. If the output is the blood
type of a particular person, the output will be generally given to be 0, 1, 2 or 3 when the
blood type is A, B, AB, or O correspondingly or in different order. Even if we have a linear
relationship between the inputs and the output from the given training data, the output
can only be 0, 1, 2 or 3. Hence the method fail immediately after when one of the the
prediction output is neither 0, 1, 2 or 3. The red line in figure 1 below shows the actual
data and the blue line is the regression line, it clearly shows the prediction outputs are
correct only in a few rare cases.
Figure 1: Output in categorical form
Another command failure case is when the given training data is small and the relationship
between the inputs and the output from the given training data is linear. But the additional
data is far from the range of the inputs in the training data, also the additional data does
not follow the pattern. This may occur in the example mentioned above, the income and
the age of people, the relationship between them is curved. In figure 2, the training data is
the first half of the curved red line, and the second half is the test data. Which the figure
clearly shows the test data will not follow the pattern the regression line(blue line) shows.
Figure 2: relationship between income(X) and age(Y)
3 Local method – k-Nearest Neighbours method
The local method mentioned in the group report is k-NN method. Predicting the output
Y using the k nearest points of xi’s output, as function below:
Yˆ(x)=1 yi (2)
xi ?Nk (x)
Nk is the neighbourhood of x defined as k nearest point xi in the training data.
3.1 Choosing optimum k for k-NN
As shown in figure 3, for k = 1, …, 5 Z get classified correctly as class 2; for k greater than
5 classification of Z is wrong(class 1).
Figure 3: Choosing optimum k
Generally choosing what value of k will optimal the prediction output? The value of k is
extremely training data dependent. It should be large so that the noises won’t affect the
prediction highly, also need to be small so only the nearby neighbours are included. One
way of choosing optimal k may be found by minimizing the Mean Square Error(MSE).
That is solving ?MSE/?k = 0 (see reference3). Where MSE can measure the mean error
between Y and Yˆ.
( Y ? Yˆ ) 2
M S E =
An alternative way of choosing k is k =
data, this is referred to as the general rule of thumb for choosing k. However this is not
useful in practice, it can be shown in the example in later section.
3.2 Strengths of k-NN
K-NN is an easy and straight forward method, the complex concepts can be predict by
using simple procedures. Not like the Linear Regression Method, it can be applied to the
training data from any distribution. Also, it is robust to noisy training data. Furthermore,
it is a good classification if the number of the training data is large.
3.3 Weaknesses of k-NN
One weakness is the method need to determine value of the parameter k. It is mentioned
above in section 3.1, that choosing the right k may be difficult.
Also, the method does not learn from the training data, it only use the training data itself
for predicting. This is why it is sometimes called as the lazy learner.
Furthermore, the method’s computation cost is quite high when the number of training
data is large. Suppose having n training data each of dimension d. To compute the
distance to one training data, the computational complexity will be O(d); To find one
nearest neighbour the computational complexity will be O(nd); To find k closest nearest
neighbour the computational complexity will be O(knd). Thus, as mentioned above the
computation cost is high when n ? ?. But unfortunately we need to have a large number
of training data for k-NN to work well.
3.4 Failure case when using k-NN
One of the failure case of k-NN is when it is in high dimensions, which is mentioned in
the group report. The method depends greatly on the distances between points. As the
number of dimensions is increased, the distances are going to be less representative, this
is called curse of dimensionality.
4 Models Selection
4.1 Bias, Variance and Model Complexity
The loss function is for measuring errors between the Y and f(X), we denote it as
L(Y,f(X)). The most common choices for loss function are the Squared Error and the
1 n 1 2
(Yi ? k yi)
xi ?Nk (x)
Absolute Error, both of them measure the quality of the estimators.
N, where N is the number of size in the training
Squared Error = (Y ? f(X)) (4)
Absolute Error = |Y ? f(X)| (5)
Test error is the average error that we incur on new data. Showing how well the estimated
model fˆ will do on future data that is not in the training data.
Err? = EL(Y,f(X))|? (6)
where both X and Y are drawn from their joint distribution randomly. Here the data set ?
is fixed and the test error refers to as the error for this specific data. We wanted to know the
err = N
Unfortunately the training error does not estimate the test error well, because it does not
properly account for model complexity. Model complexity is a measure of how hard it is to
learn from the data; when two models fit existing data equally well, the model with lower
complexity will give lower error on future data. The training error tends to consistently
decrease when the model complexity increases. However, if the model is too complex, it
will be over-fitting; the model adapts itself too closely to the training data, and will not
generalize well(having large test error). In contrast, if the model is not complex enough,
it will not even do well on the training data, which are said to be under-fit; it may also
have large bias, again will usually generalize poorly.(see reference1)
test error of our estimated model f, but it is impossible to calculate. An straightforward
estimate of the test error is the training error, it is the average loss over the whole training
data. This is refer to as the Mean Squared Error (MSE) or the Mean Absolute Error (MAE)
when taking the squared error or the absolute error as the loss function.
If we assume that Y = f(X) + ? where E(?) = 0 and V ar(?) = ??2. The expected test
error of a regression fit f(X) at an input point X = x0, using the squared-error loss can
be derive as:
Err(x0) = E(Y ? f(x0)) |X = x0
= ?? + Ef(x0) ? f(x0) + Ef(x0) ? Ef(x0) (8)
= ?? + Bias (f(x0)) + V ar(f(x0))
The first term in the equation is uncontrollable not like the other two terms, it is the
variance of the target at its true mean f(x0), it cannot be avoid unless ??2 = 0. The
second term is the squared bias, the amount by which the expected value of the estimate
differs from the true mean. The third term is the variance, the expected squared deviation
of f(x0) around its mean.
For a linear regression fit fˆ (x) = xT ?ˆ, the test error will be
Err(x ) = E(Y ? fˆ (x ))2|X = x
where h(x0) = X(XT X)?1×0. From the group report, we know that ?ˆ = (XT X)?1XT Y .
Therefore fˆ (x ) = x T (XT X)?1XT Y , and hence V arfˆ (x ) = ||h(x )||2? 2. While this
variance changes with x0, its average(when x0 taken to be each of the training data value
xi) is p ??2, thus
= ? 2 + f(x ) ? Efˆ (x )2 + ||h(x )||2? 2
1 N p
Err(xi)=?? +N f(xi)?Ef(xi) +N?? (10)
The model complexity is directly related to the number of parameters(p).
For the k-NN regression fit, the test error have a simpler form:
Err(x ) = E(Y ? fˆ (x ))2|X = x
The model complexity is controlled by the number of the neighbours k, it is inversely
related to the model complexity. The bias term will likely increase with k; For small k,
the values f(x(l)) will be close to f(x0), hence the average should be close to f(x0); When
k increases, the neighbours become further away, and prediction may be far away. The
variance term is the variance of an average, and it decrease as the k increase.
More generally for both models, as the model complexity increased, the variance term
tends to increase and the bias term tends to decrease.
4.2 Cross Validation
Perhaps the most widely use method in model selection for estimating the test error of a
predictive model is the Cross Validation, and the most common use is the K-Fold Cross-
It will be ideal if the data is large enough, as K-fold cross-validation uses part of the
available data to fit the model, and another part to test it. We split the data into K
roughly equal-sized parts; for example, when K = 5 (typically we use K = 5 or 10), the
scenario will be splitting the available data into 5 roughly equal-sized parts and take kth
part of it to be the validation data for testing the model and the rest to be the training
data. We will then fit the test data into the other K-1(in this case 4) parts of the data,
and calculate the prediction error of the fitted model when predicting the kth part. Do
this for k = 1,2,…,K and we get the cross validation error as below, where fˆ?k(i)(xi) is
the fitted function, computed when the kth part removed.
L(yi ? fˆ?k(i)(xi)) (12)
In the case when K = N, it is called leave-one-out cross-validation.
CV = K
4.3 Models comparison
1 k ? ? 2
To compare the two models, we will illustrate this by given an example. The given example
contains a training data of size 30. Wanted to show the relationship between the age(X)
and the blood pressure(Y ) from the training data(see reference6 for the training data).
All the training error mentioned in this example have taken the squared error as the loss
function, therefore the training error we are using is the MSE.
Fitting the training data using the method Linear Regression in R; the example is fitted
as simple linear regression:
Yˆ = ?ˆ + ?ˆ X + ? (13)
the regression value is ?ˆ = 98.7147 and ?ˆ = 0.9709, the training error is 279.7815. From
?ˆ we can figure out that the age and the blood pressure have a strong linear relationship.
Figure 4: Linear Regression
According to the general rule of thumb for choosing k as seen before, the size of the training
data is 30 ( 30 ? 5.4772), hence we consider two cases when k = 5, 6. Fitting the training
data in R when k = 5, the training error is 340.4787; when k = 6, the training error is
284.6417. Both prediction can be shown in the figures below, where the solid red circles
are the observed outputs and the blue circles are the prediction outputs.
20 30 40 50 60 70
20 30 40 50 60 70
Figure 5: 5-NN Regression Figure 6: 6-NN Regression
By comparing the training errors, we found out in this example the Linear Regression
could said to be more accurate than both the k-NN methods. One of the possible reason is
that the training data is not large enough. However, as stated previously the training error
is not a good estimate of the test error. Therefore, it may not be true for the additional
data that the Linear Regression model is more accurate.
Now we consider using the 5-fold cross-validation in R. For the Linear Regression, the
MSE for each fold is in the range between 63.1 and 1129 and the overall MSE(the cross
validation estimated test error) is 297. The cross-validation produced the plot as below.
120 140 160 180 200 220
120 140 160 180 200 220
5CV for linear regression
20 30 40 50 60 70
Figure 7: 5-Fold Cross-Validation for Linear Regression
For the 5-NN, the cross validation estimated test error is 373.1771; For the 6-NN, the cross
validation estimated test error is 346.0542. In this example all the three training errors
differs lots from the cross validation estimated test errors, which verifies what have been
stated in section 4.1 that the training error is not a good estimate of the test error.
Will the prediction error be smaller in this example if we choose other value of k instead
of using the general rule of thumb? Figure 5 shows that the training error increased as the
k increased, so when k = 5, 6 it is not the optimum choice of k. It confirmed what have
said in previous section that the general rule of thumb is not useful in practice.
Figure 8: MSE against k
Linear Regression method relies on the linear relationship (?i) between the inputs and
output of the data. K-NN method does not restrict the relationship between the inputs
and output, but the data size need to be large for the method to work well. Choosing the
value k for k-NN is important, it affects the model’s prediction accuracy strongly.
In practice it is hard to precisely say which model is a better prediction model. As we
can not measure the test error, and the most obvious estimate of the test error(training
error) does not account for the model complexity, hence not reliable. However we can
also estimate the test error using the technique cross-validation, Which it will give more
reliable data that we can compare and choose the right model.