The danger of overfitting.

This is an exercise associated with the Caltech Learning From Data (introductory Machine Learning course) by Caltech Professor Yaser Abu-Mostafa.

The purpose is to set up and run an experiment to study overfitting when fitting data with polynomials. Overfitting occurs when a statistical model tends to fit random errors or noise instead of the underlying relationship. It occurs when a model is excessively complex, with too many parameters relative to the number of data. A model which has been overfit will generally have poor predictive performance. In this experiment, the parameters affecting overfitting are: the number N of data points, the degree Qf of the target polynomial, and the noise level sigma2. The fit is executed by a least-squares linear combination of Legendre polynomials, the details of the experiments are described in [AMLbook].

Step 1: Import the external model overfit.jar and prepare the experiment

In order to successfully run this experiment, you need LIONoso version 2.1.42 or later.

The experiment requires software which is not part of the standard LIONoso package, a case for testing how external models can be easily created and connected to the LIONoso workbench. In our case the external model has been compiled into a Java archive overfit.jar (please download it from here; if you are curious, the source code and documentation are available here). It needs to be loaded into the workbench (Models / External and user-defined / JAR function modules) and connected to a Design of Experiment (DOE) node to create input data for the experiments.

Step 2: Run the experiment and analyze results

As soon as the experiment is designed, the produced table of input data is fed to overfit.jar to produce the corresponding outputs, which can be easily analyzed by the standard LIONoso visualizations. The first example studies how the overfit (average of 1000 experiments per data point) depends on N and sigma2, for a fixed degree Qf=20.

The second example studies how the overfit (average of 1000 experiments per data point) depends on N and Qf, for a fixed noise level sigma2 = 0.1.

The two experiments can be part of a single workflow: the two DOE tables can be evaluated by the same model:

Additional insight can be obtained by analyzing the overfit measure variance. For example, how does the picture change if only one test instead of 1000 are executed for each parameter combination? Other experiments can be related to studying overfit with different models, like SVM, or neural networks. A word of caution: depending on your CPU speed, experiments with more than about 5,000 experimental points (with averages of 1000 tests) can be very slow, usually in the range of at least ten minutes. Try with with fewer experiments first to get a quick feedback.
Note that the number of tests is hardcoded in the model, and that it can only be modified by recompiling the package.

[AMLbook] Learning from data
Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin. 2012.

Available material: