DATA7703, Assignment 3 (Python代写,Algorithm代写,Machine Learning代写,澳洲程序代写,DATA7703代写,University of Queensland代写)

We consider linear combinations of threshold classifiers for classifying real numbers in this question

微信: biyeprodaixie 欢迎联系咨询

本次CS代写的主要涉及如下领域: Python代写,Algorithm代写,Machine Learning代写,澳洲程序代写,DATA7703代写,University of Queensland代写

DATA7703, Assignment 3

2020 Semester 2, due 23:59 6 Oct


Instructions. Please read the instructions carefully — not following them may result in a penalty. (a) Submit your solutions as a single PDF file on Blackboard. Go to Assessment, Assignment 3 to submit. If you don’t know how to convert your file to a PDF, please search  for a guide online. You can submit as many times as you want before the deadline. The last submission will be graded. (b) Write down your name and student number on the first page of your solution report,  and write down the question numbers for your solutions. For programming questions, you are welcome to submit your code files or output files in a separate zip file, but you must include both your code and relevant output in your submitted PDF file. Excessive code output may be penalised. (c) Follow integrity rules, and  provide citation as needed. You can discuss with your classmates, but you are required to write your solutions independently,  and specify who you  have  discussed with in your solution.   If you    do not know how to solve a problem, you can get 15% of the mark by writing down “I don’t know”.

You are encouraged to keep your solutions concise — these questions require thoughts, not long answers.



𝐼(·) takes value +1 if its a{︃rgument is true, and -1 otherwise.  For example, 𝑐() = 𝐼( ≤ 5)

(15 marks) We consider linear combinations of  threshold  classifiers  for  classifying  real  numbers in this question. Specifically, a threshold classifier for classifying real numbers is one of the form 𝑐() = 𝐼( ) where ∘ can be ≤, , < or >, and the indicator function



can be written as 𝑐() =


+1,    � ≤ 5, .

−1,   � > 5.


Given threshold classifiers 𝑐1, . . . , 𝑐, we say that a linear combination 𝑓 = 1𝑐1 + . . . +



𝑐  of  them  represents  a  set  𝑆 R if  and  only  if  the  classifier  𝑐()

satisfies 𝑐() = 𝐼( 𝑆).


=  𝐼(𝑓 ()  ≥ 0)



    1. (5 marks) For any 𝑎 < 𝑏, consider the threshold classifiers 𝑐1() = 𝐼( > 𝑎), 𝑐2() =

𝐼( < 𝑏), and 𝑐3() = 𝐼( < +∞).  What is the set represented by 𝑓 = 𝑐1 +𝑐2 +0.1𝑐3?

    1. (10 marks) Find a linear combination of threshold classifiers to represent two intervals (−2, −1) ∪ (1, 2),

𝑓 with


= 1.  If 𝑌  is the prediction of 𝑓,  then Wagging predicts 𝑌¯



(20 marks) We consider a variant of the standard bagging algorithm,  which we  call Wag-  ging (Weighted Aggregating), in this question. We only consider regression problems here. In the standard bagging algorithm for regression, we train multiple models 𝑓1, . . . , 𝑓 and average their outputs during prediction. In Wagging, we compute a weighted average of the pred∑︀ictions of the individual models instead.  Formally, we assign a weight ∑︀  ≥ 0 to

There are many possible ways to choose 1, . . . , in Wagging, and we consider how

their values affect bias and variance below.


In the remainder of this question, assume that 𝑌1, . . . , 𝑌 are identically distributed with var(𝑌) = 𝜎2 for all , and cov(𝑌, 𝑌) = 𝜌𝜎2 for all 1 ≤ ̸= .

    1. (5 marks) Show that Wagging has the same bias as each individual model.
    2. (5 marks) Express var(𝑌¯ ) in terms of 1, . . . , , 𝜌 and 𝜎2.  In addition, for = 2,

use your formula to evaluate var(𝑌¯ ) when (1, 2) equals to (1, 0), (1 , 1 ) and (0, 1)

2 2






(10 m∑︀arks) For any ≥ 2,  find weights 1, . . . ,  that minimize the variance of



≤    ≤

(30 marks) We consider random forest in this question, and study the effect of , the size of the random feature subset used in choosing the splitting point in the decision trees. Recall that when constructing a decision tree in a random forest, at each node, instead of choosing the best split from all 𝑑 given features, we can first choose 1 𝑑 features, and then choose the best split among them.


    1. (5 marks) Load the California housing dataset provided in sklearn.datasets, and construct a random 70/30 train-test split. Set the random seed to a number of your choice to make the split reproducible. What is the value of 𝑑 here?
    2. (5 marks) Train a random forest of 100 decision trees using default hyperparameters. Report the training and test accuracies. What is the value of used?
    3. (5 marks) Compute all the pairwise correlations between the test set predictions of the 100 trees, and report their average. Correlation refers to the Pearson correlation in this question.
    4. (5  marks)  Repeat  (b)  and  (c)  for   =  1  to  𝑑,  and  tabulate  the  training  and  test accuracies, and the average correlations for all values. In addition, plot the training and test accuracies against in a single figure, and plot the average correlation against in another figure.
    5. (5 marks) Describe how the average correlation changes as increases. Explain the observed pattern.
    6. (5 marks) A data scientist claims that we should choose such that the average correlation is smallest, because it gives us maximum reduction in the variance, thus maximum reduction in the expected prediction error. True or false? Justify your answer.




(35 marks) We consider bagging using the OOB error to automatically determine the number of basis models to use in this question. Specifically, we keep adding one model trained on a bootstrap sample to the bagging ensemble at a time,  and stop only when  one of the following happens: (a) the number of models in the ensemble reaches a


pre-specified maximum number of models, or (b) is at least 10, and










Here the smoothed OOB error 𝜖¯𝑂𝑂𝐵 is defined as 𝜖¯𝑂𝑂𝐵 = 1 ∑︀5      𝜖𝑂𝑂𝐵 , with 𝜖𝑂𝑂𝐵 being

the OOB error for the ensemble consisting of the first models.

To help you implement the above enhanced bagging algorithm, a partial implementation with some comments has been provided below and also in the file

from sklearn.base import clone import numpy as np


class OOBaggingClassifier:

def init (self, base_estimator, n_estimators=200):




base_estimator: a probabilistic classifier that implements the predict_proba function, such as DecisionTreeClassifier

n_estimators: the maximum number of estimators allowed.


self.base_estimator_ = base_estimator self.n_estimators = n_estimators self.estimators_   =   [] self.oob_errors_ = []


def fit(self, X, y, random_state=None): if random_state:



self.best_n = 0


probs_oob = None

for i in  range(self.n_estimators): estimator = clone(self.base_estimator_)


# construct a bootstrap sample


# train on bootstrap sample


# compute OOB error

oob_error = ... # replace ... with your code


# save the OOB error and the new model self.oob_errors_.append(oob_error) self.estimators_.append(estimator)


# stop early if smoothed OOB error increases (for the purpose of #  this  problem,  we  don't  stop  training  when  the  criterion  is

# fulfilled, but simply set self.best_n to (i+1)).

if (self.best_n  ==  0)  and (OOB criterion);    # replace OOB criterion with your code

self.best_n = (i+1)


def errors(self, X, y):





X:  an  input  array  of  shape  (n_sample,  n_features)

y: an array of shape (n_sample,) containing the classes for the input examples




error_rates: an array of shape (n_estimators,), with the error_rates[i] being the error rate of the ensemble consisting of the  first  (i+1) models.


error_rates = []

# compute all the required error rates return error_rates


def predict(self, X):




X:  an  input  array  of  shape  (n_sample,  n_features)




y: an array of shape (n_samples,) containig the predicted classes


probs = None

for estimator  in  self.estimators_:  p = estimator.predict_proba(X) if probs is None:

probs = p else:

probs += p

return np.argmax(probs, axis=1)


Note that in this question, we only consider bagging of probabilistic classifiers, and we use a different prediction method from the majority vote method discussed in lecture: we first use each individual classifier to predict the class distributions, then we average over all the predicted distributions, and predict the most likely class. This is implemented in the predict function, and you should use the same prediction method when computing the OOB prediction. If an example is not an OOB example for any of the basis models, predict the first class.


    1. (5 marks) Load the digit data set provided in sklearn.datasets, and construct a random 70/30 train-test split. Set the random seed to a number of your choice to make the split reproducible.
    2. (5 marks) A naive way to compute the OOB error is to store all bootstrap samples, and then use the definition of OOB error discussed in lecture to compute it. This requires a lot of memory for large datasets. Describe a method that computes the


OOB error 𝜖𝑂𝑂

using the -th bootstrap sample, but not previous bootstrap sam-


ples — this allows you to throw away a bootstrap sample once the -th model has



been trained and 𝜖𝑂𝑂

has been computed. Introduce additional variables or data


structures if needed, but the amount of memory used should be at most within a constant factor of that for the original dataset size (the constant factor need to be independent of the number of basis models used).

    1. (10 marks) Implement the fit function to support the method for using the OOB error to determine the number of models as described above.
    2. (5 marks) Implement the errors function which takes in a labeled dataset, and returns the error rates for the ensemble consisting of the first basis models, for 1 ≤ .
    3. (5 marks) Train your OOBaggingClassifier on the digit dataset, using a maximum of 200 decision trees. Report the number of selected models. Plot the OOB error and the test error against the number of models (from 1 to 200). What is the number of basis models chosen by the OOB error method?
    4. (5 marks) Repeat (f) two more times using different random seeds. Comment on the performance of the OOB error method for selecting the number of models.