LIONbook: Classification trees and forests: Handwritten digit recognition.

This is an exercise associated with the LIONbook (chapter 6).

Recognizing handwritten digits is a classic and very useful usage case in learning from data. Here one starts from 16x16 pixel images of digits - from the USA Postal Service Zip Code Database - labeled with the corresponding class (0,1,2,...,9), to learn a map which will generalize to new digits, digits not present in the learning database. This is why the original set is randomly split into two subsets, one for training, the other one for testing the performance (zip-train.csv and zip-test.csv).

Using classification trees and forests in LIONoso is seamless if you are already familiar with the user interface. Click on the following images to get a larger version.

Step 1: Load data files and classification tree (CART), connect and set parameters

Classification trees are available in LIONoso though the CART tree factory in the Models/CART Tree folder.

To load data files and the CART tree factory (the creator of classification trees) drag the corresponding "CSV file" and "CART tree factory" to the workbench to the right, and connect them by drawing an arrow. The files zip-train.csv and zip-test.csv contain the intensity features for all digits (in a 16x16 array indexed as 0.1, 0.1, ..., 1.0, 1.1, ..., 15.15 - - all intensity levels scaled so that the values are between -1 and 1). These data were made available by the neural network group at AT&T research labs (thanks to Yann Le Cunn).

The parameters for the "CART tree factory" are: inputs: 0.0 0.1 ... 15.15, output: Class. Click "Start training" to create the tree (same icon but no gear symbol).
Connect the zip-test.csv to the just created CART model to produce the table with the predicted classification results. The output classification will be in the "Class" column, while the target output (remember we are dealing with supervised classification) will be saved in the "Class-target" column.

Step 4: Analyze and visualize the classification of the test set

Drag the data column Class-target to create a bar chart, drag the variable Class onto the "subclass" destination area in the bar chart. Each subclass corresponding to a different classification of the given input classes will now appear as a separate bar.
The various classes in input (0,1,...,9) will now be spit by different output classifications. Most cases are classified correctly, some classification errors are present as expected. It is interesting to see how the different classes are confused. For example, class "3" tends to be confused with "5" (in 20 cases), "7" is confused with "4" (in six cases), etc.

A different visualization of the same results can be obtained by stacking the different subclasses. Each different classification will now appear as a stripe in a single vertical bar.

For a more detailed analysis, drag the Error analyzer node in the "table manipulation and creation" folder onto the output classification table, to obtain a detailed error analysis table and a confusion matrix table.

By right-clicking on the confusion matrix and picking new panel -> Heatmap, the confusion among the various classes can be color-coded as follows. Experiment with different choices for the coloring, observe how most cases fall on the diagonal (correct classifications), try to convince yourself that most confusions (cases falling in off-diagonal boxes) are similar to confusions made also by human readers.

Advanced users can now repeat the above experiment with Democratic forests [LIONbook] obtained by training many randomized trees and combining their output classifications in a democratic manner. The appropriate node is the Democratic forest factory. Are results obtained with democratic forets better than results obtained with trees?

Visualize the tree

It is of interest to see how the tree "works". In particular, one can analyze how the purity of the set of cases ending up in a node of the tree increases from the root to the leaves.

The above visualization can be done by clicking on the nodes of the tree produced in the lionmode web service . Be patient, many data need to be transferred so that the interaction can be slow.

Here a faster visualization by lionmode on the iris data set considered in a previous exercise .

The above visualizations have been produced with the lionmode web service , an automated service running in the cloud to identify the best possible model for a given task and to deliver feedback about the relevance of the various input features.

Example data files.


[LIONbook] The LION way
Roberto Battiti and Mauro Brunato. LIONlab, University of Trento, Feb 2014.
Download the LIONoso-ready data file: zipcode-digits-all.lion