Supervised

Learning Coursework 2, GI01/M055, 2017/2018

Solution to

Question 1 :

(a)

·

A prior distribution – probability distribution over a function dp(f)

·

An additive noise model – inaccuracy in the output observations.

The Noise for each measure assumed to be independent.

(b)

According

to the Bayes theorem – Posterior probability of a parameters (f) given an input

data (X) is the product of the Likelihood function and Prior probability dP(f) (prior knowledge before performing the

experiment) divide by probability of input data.

Let

us assume the inputs data x = {Xi, where i = 1,2,3…..N} and its

corresponding random function of f = { fi, where i = 1,2,3…..N }

dPpost(f|X,

y) = (dP(f) P(y|X, f))/p(y|X)

We

can eliminate the denominator p(y|X) since there is no dependency on the

parameter ‘f’.

Therefore:

dPpost(f|X,

y) = (dP(f) P(y|X, f))/p(y|X) ? dP(f) P(y|X, f)

dPpost(f)

= dP(f) P(y|X, f))

(c)

We

know that in the case of Gaussian process regression the prior and additive

noise model are Gaussian.

Let

=

, where

P(y|X,

f)) = ,

where X Î and y Î

The

data corrupted by likelihood function with zero mean and some variance

Prior

is given by dP(f) for symmetrical Gaussian distribution – centred at the origin

over the weight vector w.

(d)

For

a finite collections of random variable stochastic Gaussian process has

multivariate normal distribution while for infinite random variable it has a

joint distribution. In our case, it is a

1-dimensional Gaussian(normal) distribution know as marginal distribution.

(e)

Marginal

distribution is the integral of prior multiplied by additive noise model.

Let

us assume the inputs data x = {Xi, where i = 1,2,3…..N}

its

corresponding random function of f = { fi, where i = 1,2,3…..N } and

output v alues yi …… yn

P(y|X,

f)) =

The

posterior distribution:

By

manipulating the above expression, we can get the mean and covariance matrix.

The

argmin of the weight vector( ) would maximise the posterior i.e it would

give us the mean of the distribution.

is the mean distribution

,

where is standard optimiser,

is the identity matrix which is

And the covariance can be expressed as:

(f)

The

full Bayesian inference adds up all the posterior distribution output and then

takes the average as its output. The maximum a posteriori probability (MAP) picks

a weight vector which maximises the posterior distribution. If the prior and additive noise model are

Gaussian then the Posterior is also a Gaussian. In linear Gaussian, the

convergence of the posterior mean and the MAP estimator coincide. The sample mean of random linear function is equivalent

to the MAP mean of Gaussian for any value of standard deviation.

Since

we have prior knowledge about pervious posterior distribution, we can use it to

enhance the MAP hence we can get the complete posterior distribution – as shown

above in part ‘e’ the Gaussian have a mean of in high dimension which is more accurate than sample mean

from the linear function.

Solution to

Question 2 :

(a)

When the value of a =b =1, the Beta distribution

is equivalent to uniform distribution. If we integrate B(a,b) and compute in interval of

0,1 we get 0.5 mean and a /a +b gives 0.5 mean as well.

(b)

Let X be the data X =

{1,….n}

H ={h:i?X ?0,1}=0,1X

Prior given as, P(h) =

and noise model P(y = 1|h,(i,y)) = h(i) and P(y = 0|h,(i,y)) = 1?

h(i)

Posterior distribution is

the proportional to the product of prior and noise model.

Prior of the beta

distribution is given by

P(h) =

Ppost =

,

This means

Beta is conjugate to the Bernoulli.

(c)

The mode is equivalent to the

maximum a posterior distribution, so

Mode of the posterior

The maximum

likelihood estimate is the empirical probability, i.e. ,

Mean of the

posterior E(p|X) =

For uniform distribution

where a =b =1

P(y = 1|h,(i,y)) = h(i) = 0.3

P(y = 0|h,(i,y)) = 1-h(i) = 0.7

Therefore,

Mean = 0.4333

Mode = = 0.3

Solution to

Question 3 :

(a)

Overfitting or High variance is when the learned hypothesis fits

training data well but fails to generalize for unseen dataset. The model might

be too wiggly which is not a good model to predict future(test) dataset. In

other words, the predictor is too complex and not consistent for unseen data.

(b)

Empirical risk minimization(ERM) – the learner algorithm gets X training

data input and model function H which has an output predictor H(x) from unknown

distribution D. The goal of the learner algorithm is to minimize hypothesis H*

for fixed function H for which E(L(H(x), y)) is minimum. For such unknown distribution, we minimize

the empirical risk over the given training dataset such technique in statistical

learning algorithm known as Empirical risk minimization(ERM.)

(c)

Structural Risk Minimization(SRM) is on finite dataset, the task

of selecting best model class function within a nested family of space . The model may be too complex at HQ and poor at generalizing it to a new

dataset which would consequent an overfitting problem. The SRM addresses

overfitting problem by solving such problem by balancing the model’s complexity

and training data error.

Let in Hq minimizer be

As we increase q, the gets reduced, the function increase in between we get a

balanced model.

(d)

Regularization addresses overfitting problem by penalizing complex

learning algorithm – it selects one complex hypothesis class and adds regularizer function to empirical error to prevent overfitting.

Our goal is to choose best learning algorithm among many possible

hypothesis space – the regularization parameter can be chosen best in case of ridge regration or best in .

If we have large dataset – we can split the dataset into three

different parts:

Training dataset – where we model our functions using different

learning algorithms.

Development or Validation dataset – on this we tune the parameters

on held out dataset (development or validation set.)

Test

dataset – to test best model we have developed on the training dataset.

(e)

The kernel trick enables us to work with linear function in

infinite dimensional feature space.

subject to yi?w,xi??1??i,?i ?0, ?i = 1,…,m.??

Hinge loss function is defined as:

Basically, we minimize the hinge loss while minimizing the norm of

the weight vector(w) that corresponds to maximising the margin. Maximising the

margin gives a good generalization learning algorithm even though we have a

high dimensional feature space. The nonnegative xi(?i) measures the hinge loss – computes

the amount by which fails to meet a margin of 1.

(f)

By changing the hinge loss function, we can create a linear

programming boosting. The primal

subject to yiHia?p??i,?i ?0, ?

= 1,…,m.?

and the dual would have the

following form:

;

subject to

where is the distribution,

is additional weak learner,

Optimal solution can

be achieved by doing the dual programming iteratively. We assume the set of weak

learners are finite. The boosting algorithm would give optimum error bound.

We might have more than one weak learner’s optimal solution which

are approximate to the cost function.

Solution to

Question 4 :

(a)

In statistical learning theory generalisation error is a measure how

well a predictive of a learning algorithm perform on unseen dataset. The expected error is approximately equivalent

to the empirical error and the error reduces when we increase the dataset. Since

the generalisation of the learning algorithm is done using randomly drawn from

a finite independent and identically distributed samples the model might be

sensitive. One way of overcoming this problem is by relating the

training(given) error to the generalisation error – which possibly would avoid

overfitting.

There are many theories for

generalisation error – each theory has strengths and weaknesses.

(b)

For a fixed size of ‘m’

training datasets

None linearity in the lost

function would give a small difference in the average of the distribution.

Analysing the mean using

traditional statistics

,

If we use consistency of

classification method for big dataset

where if p(x,1)>p(x,0), otherwise

This might not be a good solution when we have small sample size.

We are taking limit m tends to infinity so for small m the Bayes risk might

fail.

The 95th percentile is 0.15 which is above the mean value 0.08.

This means the probability we have been misled by 0.05 which is less than the

mean value.

(c)

Structural risk minimization(SRM) is a best learning algorithm selection

procedure – choosing the model that minimizes VC upper bound on risk.

,

For fixed datasets, as the complexity of the learning algorithm C(Hi)

increases, the training error (eˆrr(c)) will decrease but fails to generalize for unseen dataset. On the other hand, as the VC dimension

increases, will increase since this depends on the VC

dimension. SRC chooses a best model from the sequence of hypothesis ( which minimizes the right-hand side of the

above inequality.

As the complex models might over fit and at the same time the more

training set we have the lower empirical error will be. SRC resolves the trade-off

by providing a measure of characterization between complexity and empirical

error – by

balancing the model’s complexity and training set (empirical) error.

(d)

As discussed above(b) is the probability that the training data misled.

Assuming

Hk is finite and confidence at least 1-, where is small positive number, we can compute the likelihood

of being misled by evaluating training error equals to zero. In our case,

Since the functions are drawn

randomly from a distribution we must add prior weight. By applying Hoeffding’s inequality

to the randomly drawn probability we can get the generalization error is

bounded – for model k.

,

Reference:

I have used the lecture notes

for this assignment.

https://moodle.ucl.ac.uk/course/view.php?id=11543