The report shows strengths and weaknesses for two methods, the global and local

method. Also, it illustrates the calculation for the method’s predicting error which

we will then based on the calculated errors to choose the best fit method.

1 Introduction

This report is an extension of the group report, which the main purpose is to find a suitable

model for a given data. In the report we discuss further on the global and local method.

Section two and three illustrate the strengths and weaknesses for the two methods. Some

cases were shown when the global method or the local method will face problems. Choosing

the wrong method may affect the accuracy of prediction outputs. In section four we will

introduce the error calculation that we will consider when choosing a suitable method. We

then compare the two methods with the calculated errors on the same training data.

2 Global Method – Linear Regression

The global method mentioned in the group report is Linear Regression. Predicting the

response variable Y(This will be output through out the project) using the regression

model below:

Yˆ = ?ˆ + ?ˆ X + … + ?ˆ X + ? (1)

011nn

X1,…,Xn is the explanatory variables(This will be input through out the project) .

2.1 Strengths of Linear Regression

Linear regression shows optimal results when the relationship(?i shows the relationship)

between the inputs(independent variables) and output(dependent variable) are almost lin-

ear(where inputs could also be squared(Xi2) or quadratic(Xi3) etc.).

It also have the ability to identify the outliers, where outliers are the observations which

do not follow the pattern of the other observations, such as the points which lie far from

the regression line on the regression plot.

2.2 Weaknesses of Linear Regression

Linear Regression only consider the linear relationship between the inputs and output.

That is, it assumes there is a straight-line when the training data is plotted. Hence it is

often inappropriately used to model non-linear relationships. For instance the income and

1

the age of people, the relationship between them is curved.

It only consider the mean of the output. However, sometimes we need to look at the

extreme cases of the output. For example, if measuring the relationship between the birth

weight of babies born and the ages of the mothers. Babies are at risk and need special care

when their weights are too low, so in this example we would want to look at the extreme

cases.

It is also really sensitive to outliers, outliers greatly affect the slope of the regression line.

Furthermore, it is easy to learned on the training data too specific, starts to model the

noises(random error) in the training data rather than just the relationship between the

variables. Therefore, it may fail to fit the additional data. This is refer to as over-fitting,

most commonly arises when having too many parameters comparing to the number of

training data.

2.3 Failure case when using Linear Regression

One of the failure case for Linear Regression method is when the output measurement is

in categorical (such as gender or blood type of a person) form. If the output is the blood

type of a particular person, the output will be generally given to be 0, 1, 2 or 3 when the

blood type is A, B, AB, or O correspondingly or in different order. Even if we have a linear

relationship between the inputs and the output from the given training data, the output

can only be 0, 1, 2 or 3. Hence the method fail immediately after when one of the the

prediction output is neither 0, 1, 2 or 3. The red line in figure 1 below shows the actual

data and the blue line is the regression line, it clearly shows the prediction outputs are

correct only in a few rare cases.

Figure 1: Output in categorical form

Another command failure case is when the given training data is small and the relationship

between the inputs and the output from the given training data is linear. But the additional

data is far from the range of the inputs in the training data, also the additional data does

not follow the pattern. This may occur in the example mentioned above, the income and

the age of people, the relationship between them is curved. In figure 2, the training data is

the first half of the curved red line, and the second half is the test data. Which the figure

clearly shows the test data will not follow the pattern the regression line(blue line) shows.

2

Figure 2: relationship between income(X) and age(Y)

3 Local method – k-Nearest Neighbours method

The local method mentioned in the group report is k-NN method. Predicting the output

Y using the k nearest points of xi’s output, as function below:

Yˆ(x)=1 yi (2)

k

xi ?Nk (x)

Nk is the neighbourhood of x defined as k nearest point xi in the training data.

3.1 Choosing optimum k for k-NN

As shown in figure 3, for k = 1, …, 5 Z get classified correctly as class 2; for k greater than

5 classification of Z is wrong(class 1).

Figure 3: Choosing optimum k

Generally choosing what value of k will optimal the prediction output? The value of k is

extremely training data dependent. It should be large so that the noises won’t affect the

prediction highly, also need to be small so only the nearby neighbours are included. One

way of choosing optimal k may be found by minimizing the Mean Square Error(MSE).

That is solving ?MSE/?k = 0 (see reference3). Where MSE can measure the mean error

3

between Y and Yˆ.

1 n

( Y ? Yˆ ) 2

M S E =

= n

An alternative way of choosing k is k =

data, this is referred to as the general rule of thumb for choosing k. However this is not

useful in practice, it can be shown in the example in later section.

3.2 Strengths of k-NN

K-NN is an easy and straight forward method, the complex concepts can be predict by

using simple procedures. Not like the Linear Regression Method, it can be applied to the

training data from any distribution. Also, it is robust to noisy training data. Furthermore,

it is a good classification if the number of the training data is large.

3.3 Weaknesses of k-NN

One weakness is the method need to determine value of the parameter k. It is mentioned

above in section 3.1, that choosing the right k may be difficult.

Also, the method does not learn from the training data, it only use the training data itself

for predicting. This is why it is sometimes called as the lazy learner.

Furthermore, the method’s computation cost is quite high when the number of training

data is large. Suppose having n training data each of dimension d. To compute the

distance to one training data, the computational complexity will be O(d); To find one

nearest neighbour the computational complexity will be O(nd); To find k closest nearest

neighbour the computational complexity will be O(knd). Thus, as mentioned above the

computation cost is high when n ? ?. But unfortunately we need to have a large number

of training data for k-NN to work well.

3.4 Failure case when using k-NN

One of the failure case of k-NN is when it is in high dimensions, which is mentioned in

the group report. The method depends greatly on the distances between points. As the

number of dimensions is increased, the distances are going to be less representative, this

is called curse of dimensionality.

4 Models Selection

4.1 Bias, Variance and Model Complexity

ˆ

The loss function is for measuring errors between the Y and f(X), we denote it as

ˆ

L(Y,f(X)). The most common choices for loss function are the Squared Error and the

nii

i=1

(3)

1 n 1 2

i=1

?

(Yi ? k yi)

xi ?Nk (x)

Absolute Error, both of them measure the quality of the estimators.

ˆ2

N, where N is the number of size in the training

Squared Error = (Y ? f(X)) (4)

4

ˆ

Absolute Error = |Y ? f(X)| (5)

Test error is the average error that we incur on new data. Showing how well the estimated

model fˆ will do on future data that is not in the training data.

ˆ

Err? = EL(Y,f(X))|? (6)

where both X and Y are drawn from their joint distribution randomly. Here the data set ?

is fixed and the test error refers to as the error for this specific data. We wanted to know the

ˆ

1 N

err = N

Unfortunately the training error does not estimate the test error well, because it does not

properly account for model complexity. Model complexity is a measure of how hard it is to

learn from the data; when two models fit existing data equally well, the model with lower

complexity will give lower error on future data. The training error tends to consistently

decrease when the model complexity increases. However, if the model is too complex, it

will be over-fitting; the model adapts itself too closely to the training data, and will not

generalize well(having large test error). In contrast, if the model is not complex enough,

it will not even do well on the training data, which are said to be under-fit; it may also

have large bias, again will usually generalize poorly.(see reference1)

test error of our estimated model f, but it is impossible to calculate. An straightforward

estimate of the test error is the training error, it is the average loss over the whole training

data. This is refer to as the Mean Squared Error (MSE) or the Mean Absolute Error (MAE)

when taking the squared error or the absolute error as the loss function.

ˆ

L(yi,f(xi)) (7)

i=1

If we assume that Y = f(X) + ? where E(?) = 0 and V ar(?) = ??2. The expected test

ˆ

error of a regression fit f(X) at an input point X = x0, using the squared-error loss can

be derive as:

ˆ2

Err(x0) = E(Y ? f(x0)) |X = x0

2ˆ2ˆˆ2

= ?? + Ef(x0) ? f(x0) + Ef(x0) ? Ef(x0) (8)

22ˆˆ

= ?? + Bias (f(x0)) + V ar(f(x0))

The first term in the equation is uncontrollable not like the other two terms, it is the

variance of the target at its true mean f(x0), it cannot be avoid unless ??2 = 0. The

second term is the squared bias, the amount by which the expected value of the estimate

differs from the true mean. The third term is the variance, the expected squared deviation

ˆ

of f(x0) around its mean.

For a linear regression fit fˆ (x) = xT ?ˆ, the test error will be

p

Err(x ) = E(Y ? fˆ (x ))2|X = x

0p00

(9)

where h(x0) = X(XT X)?1×0. From the group report, we know that ?ˆ = (XT X)?1XT Y .

Therefore fˆ (x ) = x T (XT X)?1XT Y , and hence V arfˆ (x ) = ||h(x )||2? 2. While this

p00 p00?

variance changes with x0, its average(when x0 taken to be each of the training data value

xi) is p ??2, thus

N

= ? 2 + f(x ) ? Efˆ (x )2 + ||h(x )||2? 2

?0p00?

1 N

N

i=1

1 N p

2ˆ22

Err(xi)=?? +N f(xi)?Ef(xi) +N?? (10)

i=1

5

The model complexity is directly related to the number of parameters(p).

For the k-NN regression fit, the test error have a simpler form:

Err(x ) = E(Y ? fˆ (x ))2|X = x

0k00

=??2+f(x0)?k

l=1

f(x(l))2+ k

The model complexity is controlled by the number of the neighbours k, it is inversely

related to the model complexity. The bias term will likely increase with k; For small k,

the values f(x(l)) will be close to f(x0), hence the average should be close to f(x0); When

k increases, the neighbours become further away, and prediction may be far away. The

variance term is the variance of an average, and it decrease as the k increase.

More generally for both models, as the model complexity increased, the variance term

tends to increase and the bias term tends to decrease.

4.2 Cross Validation

Perhaps the most widely use method in model selection for estimating the test error of a

predictive model is the Cross Validation, and the most common use is the K-Fold Cross-

Validation.

It will be ideal if the data is large enough, as K-fold cross-validation uses part of the

available data to fit the model, and another part to test it. We split the data into K

roughly equal-sized parts; for example, when K = 5 (typically we use K = 5 or 10), the

scenario will be splitting the available data into 5 roughly equal-sized parts and take kth

part of it to be the validation data for testing the model and the rest to be the training

data. We will then fit the test data into the other K-1(in this case 4) parts of the data,

and calculate the prediction error of the fitted model when predicting the kth part. Do

this for k = 1,2,…,K and we get the cross validation error as below, where fˆ?k(i)(xi) is

the fitted function, computed when the kth part removed.

1 N

L(yi ? fˆ?k(i)(xi)) (12)

In the case when K = N, it is called leave-one-out cross-validation.

CV = K

4.3 Models comparison

i=1

1 k ? ? 2

(11)

To compare the two models, we will illustrate this by given an example. The given example

contains a training data of size 30. Wanted to show the relationship between the age(X)

and the blood pressure(Y ) from the training data(see reference6 for the training data).

All the training error mentioned in this example have taken the squared error as the loss

function, therefore the training error we are using is the MSE.

Fitting the training data using the method Linear Regression in R; the example is fitted

as simple linear regression:

Yˆ = ?ˆ + ?ˆ X + ? (13)

01

the regression value is ?ˆ = 98.7147 and ?ˆ = 0.9709, the training error is 279.7815. From

01

?ˆ we can figure out that the age and the blood pressure have a strong linear relationship.

1

6

Figure 4: Linear Regression

According to the general rule of thumb for choosing k as seen before, the size of the training

?

data is 30 ( 30 ? 5.4772), hence we consider two cases when k = 5, 6. Fitting the training

data in R when k = 5, the training error is 340.4787; when k = 6, the training error is

284.6417. Both prediction can be shown in the figures below, where the solid red circles

are the observed outputs and the blue circles are the prediction outputs.

5?NN 6?NN

20 30 40 50 60 70

Age

20 30 40 50 60 70

Age

Figure 5: 5-NN Regression Figure 6: 6-NN Regression

By comparing the training errors, we found out in this example the Linear Regression

could said to be more accurate than both the k-NN methods. One of the possible reason is

that the training data is not large enough. However, as stated previously the training error

is not a good estimate of the test error. Therefore, it may not be true for the additional

data that the Linear Regression model is more accurate.

Now we consider using the 5-fold cross-validation in R. For the Linear Regression, the

MSE for each fold is in the range between 63.1 and 1129 and the overall MSE(the cross

validation estimated test error) is 297. The cross-validation produced the plot as below.

7

Blood Pressure

120 140 160 180 200 220

Blood Pressure

120 140 160 180 200 220

5CV for linear regression

Fold 1

Fold 2

Fold 3

Fold 4

Fold 5

20 30 40 50 60 70

Age

Figure 7: 5-Fold Cross-Validation for Linear Regression

For the 5-NN, the cross validation estimated test error is 373.1771; For the 6-NN, the cross

validation estimated test error is 346.0542. In this example all the three training errors

differs lots from the cross validation estimated test errors, which verifies what have been

stated in section 4.1 that the training error is not a good estimate of the test error.

Will the prediction error be smaller in this example if we choose other value of k instead

of using the general rule of thumb? Figure 5 shows that the training error increased as the

k increased, so when k = 5, 6 it is not the optimum choice of k. It confirmed what have

said in previous section that the general rule of thumb is not useful in practice.

Figure 8: MSE against k

5 Conclusion

Linear Regression method relies on the linear relationship (?i) between the inputs and

output of the data. K-NN method does not restrict the relationship between the inputs

and output, but the data size need to be large for the method to work well. Choosing the

value k for k-NN is important, it affects the model’s prediction accuracy strongly.

In practice it is hard to precisely say which model is a better prediction model. As we

can not measure the test error, and the most obvious estimate of the test error(training

error) does not account for the model complexity, hence not reliable. However we can

also estimate the test error using the technique cross-validation, Which it will give more

reliable data that we can compare and choose the right model.