Random Forests

3

December 14, 2013 by earlbellinger

One goal of supervised machine learning is to capture the relationship between objects and their classification. We humans do this particularly well during the course of our daily lives: we’ve all seen enough examples to easily distinguish, say, tables from chairs, and we hardly if ever make a mistake (“don’t sit where you eat,” or something like that). After all, we are the ones who give names to the classes in the first place. Emulating such comprehension capabilities in an automated fashion, therefore, is an important task in artificial intelligence.

“Plato’s discovery went as follows. It is possible for something to be a certain way and for something else to be the same way. So there are universals. (Tumultuous applause, which lasts, despite occasional subsidences, 2,400 years.) … ‘Universals’ is simply the name philosophers give to the ways in which two or more things can be the same.” – David Stove

The mapping from the objects we observe to their agreed upon labels is often wholly unconscious. We intuit the proper labels without needing to run through a list of explicit rules. Legs? Check. Flat surface? Check. That would be far too slow. It’s not likely that we would have survived in the wild if we had to deliberate at length to determine whether the shade we were enjoying was from a tree or a hungry predator.

When pressed, however, it’s possible for us to make such a list of rules. Tables and chairs are generally similar in having legs and a flat surface, yes, but where do they differ? If we think of all of the tables and chairs we’ve seen, we can enumerate the characteristics of each one in our memory, such as color, location, mass, shape, and so on. Then we can set to task on figuring out which are relevant to distinguishing tables from chairs. So can we get a machine to do it?

Random forests are an ensemble learning technique for accomplishing tasks like this. The idea is to train a “forest” of decision tree classifiers, each one being capable of giving some influence as to what features are important to distinguishing the label, and then aggregate over the results. The buzz around town is that these ensemble methods generally perform better than any individual method on its own. Random forests look particularly effective when your data is high dimensional, noisy, or linearly inseparable.

A Worked Example

For this example we’ll use the python package scikit-learn, and as you should guess, its implementation of the random forest classifier. As you shall see, it is extremely easy to use.

Let’s start by creating our tables and chairs. We’ll work with the following characteristics:
1. Color
2. Length of Legs
3. Area of top surface

We can add as many other features if we would like, such as number of legs, age, location, and so on, but we’ll keep it simple. And to make things fair, we won’t include give-away characteristics like “has seat back” – that would make our contrived example a little TOO contrived.

We will represent all possibilities of each attribute as a number. The colors of our tables and chairs will be drawn from the same uniform distribution of, say, integers 0 through 5, where you can think of 0 as being brown, 1 as black, etc. We’ll pull the length of the legs from different uniform distributions of real numbers: between 2 and 5 for chairs, and between 4 and 10 for tables. Finally, the area of the top surface will draw from a normal distribution: for chairs, we’ll use mean 2 and standard deviation 0.25; for tables, mean 5 and standard deviation 1.

Let’s construct our data in python. The classifier expects two lists: one containing the data and the other containing the corresponding labels, so we shall build these separately:

num_chairs = num_tables = 5000

from random import randint, uniform, gauss
data = [[randint(0, 5), # Possible colors for chairs
uniform(2, 5), # Possible leg lengths for chairs
gauss(2,0.25)] # Possible top surface areas for chairs
for i in range(num_chairs)] \
+ [[randint(0, 5), # Six possible colors for tables
uniform(4,10), # Possible leg lengths for tables
gauss(5, 1)] # Possible top surface areas for tables
for i in range(num_tables)]

from numpy import asarray # The labels must be a numpy array
labels = asarray(['chair']*num_chairs + ['table']*num_tables)


Now that we have 10,000 labelled records of tables and chairs, we can initialize a random forest classifier and train it on this data. I have arbitrarily chosen to use 100 estimators (trees in our forest). Unless you’re seeking the most concise representation, as can often be the goal in the sciences, the more estimators the merrier as that tends to lead to better performance — but training and testing time can be a limiting factor.

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(data, labels)


Now we can perform cross-validation to see how well this classifier performs when distinguishing tables from chairs. This technique involves segregating the data into training and testing samples, fitting the classifier with the training data, and evaluating it on the left out testing data. We can repeat this several times and then average over the results to get a sense of how well this classifier would perform on as yet unseen data. Below I have (again, arbitrarily) chosen to do 100 trials of cross-validation.

from sklearn.cross_validation import cross_val_score
scores = cross_val_score(rfc, data, labels, cv=100)
print("Accuracy: %0.2f (+/- %0.2f)"
% (scores.mean(), scores.std()*2))


Accuracy: 1.00 (+/- 0.00)

OK – it’s a little too good: it gets them all right all the time! Let’s make our data noisy so that we can see how robust it is. Let’s change every 100th table to a chair and vice versa.

labels[:num_chairs][::100]='table'
labels[num_chairs:num_chairs+num_tables][::100]='chair'
rfc.fit(data, labels)
scores = cross_val_score(rfc, data, labels, cv=100)
print("Accuracy: %0.2f (+/- %0.2f)"
% (scores.mean(), scores.std()*2))


Accuracy: 0.99 (+/- 0.01)

And there we have it. The classifier shows its robust to the mislabeled data we introduced.

Theoretical Niceties

The most desirable quality of a machine learning model is generalization: that the model is able to perform well on as yet unseen data. This is in contrast to memorization, in which a model knows only the literal input/output mapping of the data that was used to train it.

For example, if a model is trained to distinguish male names from female names on the set {Bob, Sally, Jim, Barbra}, but can only classify names it has already seen verbatim, then it has memorized the input/output mapping and is said to be overfitted. In contrast, if after learning from the genders of the supplied names, it is able to correctly infer the gender of previously unseen names (e.g. “Jason” is likely to be male and “Rose” is likely to be female), then the model has done a good job of generalizing. This is why we perform cross-validation: to estimate a classifier’s capacity to generalize.

In the book Introduction to Data Mining by Tan, Steinbach, and Kumar, the authors explain that when the number of trees in a random forest is sufficiently large, the generalization error $\epsilon$ is bounded above by the average correlation among the trees $\bar p$ and the average performance of the constituent tree classifiers $s$ such that

$\epsilon \leq \frac{\bar p (1-s^2)}{s^2}$.

The authors define average performance of the constituent tree classifiers in terms of their margin $M(\textbf X,Y)$:

$M(\textbf X,Y) = P(\hat Y_\theta = Y) - {\text{max} \atop Z \neq Y} P(\hat Y_\theta = Z)$

where $\hat Y_\theta$ is the prediction of a tree classifier built from a random vector $\theta$ on some test record $\textbf X$.

This is quite a nice guarantee not known about many other machine learning methods: well-trained random forests are certain to generalize at least to some extent.

Conclusions

This tutorial demonstrated how to use random forest classifiers to distinguish noisy data of different types with attributes drawn from a couple of different but overlapping statistical distributions. These kinds of methods are increasingly useful in an age where data are plentiful and automated prediction processes are indispensable.

I hope you enjoyed it and found it helpful!

3 thoughts on “Random Forests”

1. Thank you for this post, well written and useful. I especially like the philosophical underpinning. I have a few nitpicks though:

• It doesn’t make sense to “fit” your model once, before using “cross_val_score”: it is done internally, at every iteration of the CV loop
• For such a small dataset, cv=100 is a huge number, 5 or 10 would be more appropriate
• Although it probably wouldn’t make a difference in this case, it would be safest anyway to shuffle your dataset before using it
• The task is way too easy, to really illustrate your point, your should be more extreme in the noise you artificially inject in your dataset