In this exercise you will put the regression techniques learned through the last few lectures
to use. An engine fuel efficiency data can be downloaded here.
Use the data to create regression models and answer the following questions
based on what you observed:
When applying OLS, how do you determine the features?
How does the number of features affect your training and test R-square values?
Is it always better to include more features?
How does the distribution of the training sample affect the test R-square?
Do you find Latin hypercube sampling to have better test R-square than random sampling?
What is the difference between feedforward neural network and Kriging?
Sample codes
A sample code can be downloaded here.
In the following, we briefly go through the code.
Data visualization
After loading the data (double click on the .mat file you just downloaded within Matlab),
the above code visualizes the engine efficiency map. Note that the unit of the map is
gram/kWh, therefore lower values are of higher efficiency.
Data preparation
The following script prepares the training and test data. The idea is to use
the training data to build the regression model, and test this model using the test data.
Note that here we assume we see the test data just so that we can
understand what happens when the model is deployed. In reality, the test
data is never seen and cannot be used to influence the regression model.
Here, we use two methods to create the training data. One is to use random sampling
(commented out in the following code) and the other is to use Latin Hypercube sampling.
Please test to see which one is better in terms of test R-square.
Ordinary Least Square
The following script performs OLS estimation, reports the test R-square,
and outputs a visual comparison between the true efficiency map and the predicted one.
Try to change the features, defined as X, and see if doing so improves
the test R-square.
Gaussian process
Gaussian process allows you to fit through all training data points exactly,
while using the least “complex” function. This however, as you will see, may lead
to large prediction error, and thus requires fine-tuning of the Gaussian spread (lambda in the following code).
Also a more even spread of samples through Latin Hypercube will help.
Feedforward Neural Net
While there is a script way to use the feedforward network, Matlab does provide
a network fitting app that is easy to use (with limited functionality).
The following code visualizes the outputs from the network net.