## An Interaction or Not? How a few ML Models Generalize to New Data

*Source code for this post is here.*

This post examines how a few statistical and machine learning models respond to a simple toy example where they're asked to make predictions on new regions of feature space. The key question the models will answer differently is whether there's an "interaction" between two features: does the influence of one feature differ depending on the value of another.

In this case, the data won't provide information about whether there's an interaction or not. Interactions are often real and important, but in many contexts we treat interaction effects as likely to be small (without evidence otherwise). I'll walk through why decision trees and bagged ensembles of decision trees (random forests) can make the opposite assumption: they can strongly prefer an interaction, even when the evidence is equally consistent with including or not including an interaction.

I'll look at point estimates from:

- a linear model
- decision trees and bagged decision trees (random forest), using R's
`randomForest`

package - boosted decision trees, using the R's
`gbm`

package

I'll also look at two models that capture **uncertainty** about whether there's an interaction:

- Bayesian linear model with an interaction term
- Bayesian Additive Regression Trees (BART)

BART has the advantage of expressing uncertainty while still being a "machine learning" type model that learns interactions, non-linearities, etc. without the user having to decide which terms to include or the particular functional form.

Whenever possible, I recommend using models like BART that explicitly allow for uncertainty.

# The Example

Suppose you're given this data and asked to make a prediction at `$X_1 = 0$, $X_2 = 1$`

(where there isn't any training data):

X1 | X2 | Y | N Training Rows: |
---|---|---|---|

0 | 0 | Y = 5 + noise | 52 |

1 | 0 | Y = 15 + noise | 23 |

1 | 1 | Y = 19 + noise | 25 |

0 | 1 | ? | 0 |

(...click here for the rest of this post)

## Covariance As Signed Area Of Rectangles

A colleague at work recently pointed me to a wonderful stats.stackexchange answer with an intuitive explanation of covariance: For each pair of points, draw the rectangle with these points at opposite corners. Treat the rectangle's area as signed, with the same sign as the slope of the line between the two points. If you add up all of the areas, you have the (sample) covariance, up to a constant that depends only on the data set.

Here's an example with 4 points. Each spot on the plot is colored by the sum corresponding to that point. For example, the dark space in the lower left has three "positively" signed rectangles going through it, but for the white space in the middle, one positive and one negative rectangle cancel out.

In this next example, *x* and *y* are drawn from independent normals, so we have roughly an even amount of positive and negative:

## Formal Explanation

The formal way to speak about multiple draws from a distribution is with a set of independent and identically distributed (i.i.d.) random variables. If we have a random variable *X*, saying that *X*_{1}, *X*_{2}, … are i.i.d means that they are all independent, but follow the same distribution.

(...click here for the rest of this post)

## Previous Posts

### Simulated Knitting (post)

I created a `KnittedGraph`

class (subclassing of Python's `igraph`

graph class) with methods corresponding to common operations performed while knitting:

```
g = KnittedGraph()
g.AddStitches(n)
g.ConnectToZero() # join with the first stitch for a circular shape
g.NewRow() # start a new row of stitches
g.Increase() # two stitches in new row connect to one stitch in old
#(etc.)
```

I then embed the graphs in 3D space. Here's a hat I made this way:

### 2D Embeddings from Unsupervised Random Forests (1, 2)

There are all sorts of ways to embed high-dimensional data in low dimensions for visualization. Here's one:

- Given some set of high dimensional examples, build a random forest to distinguish examples from non-examples.
- Assign similarities to pairs of examples based on how often they are in leaf nodes together.
- Map examples to 2D in such a way that similarity decreases decreases with Euclidean 2D distance (I used multidimensional scaling for this).

Here's the result of doing this on a set of diamond shapes I constructed. I like how it turned out:

### A Bayesian Model for a Function Increasing by Chi-Squared Jumps (in Stan) (post)

In this paper, Andrew Gelman mentions a neat example where there's a big problem with a naive approach to putting a Bayesian prior on functions that are constrained to be increasing. So I thought about what sort of prior would make sense for such functions, and fit the models in Stan.

I enjoyed Andrew's description of my attempt: *"... it has a charming DIY flavor that might make you feel that you too can patch together a model in Stan to do what you need."*

### Lissijous Curves JSFiddle

Some JavaScript I wrote (using d3) to mimick what an oscilloscope I saw at the Exploratorium was doing:

### Visualization of the Weirstrass Elliptic Function as a Sum of Terms

John Baez used this in his AMS blog Visual Insight.