5 Tree Models: Random Forest

The idea of trees in statistical learning has similarity with decision trees models. Trees and its extensions have been used widely for modeling the prediction of outcome variables. Let’s start with reviewing the advantages and disadvantages of this method:

Benefits:

Very good in capturing the interactions between the features of a models and non-linearity associations
performs very well in handling categorical variables, with nice interpretation of the results
relatively faster than other machine learning algorithms
does not require scaling x variables/features.

Disadvantage:

Simple tree models do not lead to the best out-of-sample predictive.

However, if we grow a good forest and bag/boost the models to combine the fits from different tree models, we can improve the prediction power.

5.1 Regression Trees

It is much easier to explain the tree models using an example. Let’s start with the Boston housing data that we used before in \(kNN\) section. Similarly, I want to predict median housing price in a neighborhood, \({\tt medv}\), using the ratio of low state households, \({\tt lstat}\).

library(MASS) 
attach(Boston)
n = nrow(Boston)

ddf = data.frame(lstat,medv)

# Sort the data here
oo=order(ddf$lstat)

ddf = ddf[oo,]


library(rpart)

m_tree=rpart(medv~lstat,data=ddf)

#plot the estimated tree

summary(m_tree)

Call:
rpart(formula = medv ~ lstat, data = ddf)
  n= 506 

          CP nsplit rel error    xerror       xstd
1 0.44236500      0 1.0000000 1.0039181 0.08305220
2 0.15283400      1 0.5576350 0.5973316 0.05238544
3 0.06275014      2 0.4048010 0.4421396 0.04395568
4 0.01374053      3 0.3420509 0.3947080 0.04137455
5 0.01200979      4 0.3283103 0.3810984 0.04077176
6 0.01159492      5 0.3163005 0.3763524 0.03962160
7 0.01000000      6 0.3047056 0.3719473 0.03959767

Variable importance
lstat 
  100 

Node number 1: 506 observations,    complexity param=0.442365
  mean=22.53281, MSE=84.41956 
  left son=2 (294 obs) right son=3 (212 obs)
  Primary splits:
      lstat < 9.725  to the right, improve=0.442365, (0 missing)

Node number 2: 294 observations,    complexity param=0.06275014
  mean=17.34354, MSE=23.83089 
  left son=4 (144 obs) right son=5 (150 obs)
  Primary splits:
      lstat < 16.085 to the right, improve=0.3825785, (0 missing)

Node number 3: 212 observations,    complexity param=0.152834
  mean=29.72925, MSE=79.31047 
  left son=6 (162 obs) right son=7 (50 obs)
  Primary splits:
      lstat < 4.65   to the right, improve=0.3882819, (0 missing)

Node number 4: 144 observations,    complexity param=0.01374053
  mean=14.26181, MSE=18.74458 
  left son=8 (75 obs) right son=9 (69 obs)
  Primary splits:
      lstat < 19.9   to the right, improve=0.2174498, (0 missing)

Node number 5: 150 observations
  mean=20.302, MSE=10.84406 

Node number 6: 162 observations,    complexity param=0.01159492
  mean=26.6463, MSE=42.74335 
  left son=12 (134 obs) right son=13 (28 obs)
  Primary splits:
      lstat < 5.495  to the right, improve=0.07152825, (0 missing)

Node number 7: 50 observations,    complexity param=0.01200979
  mean=39.718, MSE=67.21788 
  left son=14 (32 obs) right son=15 (18 obs)
  Primary splits:
      lstat < 3.325  to the right, improve=0.1526421, (0 missing)

Node number 8: 75 observations
  mean=12.32533, MSE=16.18776 

Node number 9: 69 observations
  mean=16.36667, MSE=13.01729 

Node number 12: 134 observations
  mean=25.84701, MSE=40.14324 

Node number 13: 28 observations
  mean=30.47143, MSE=37.49776 

Node number 14: 32 observations
  mean=37.31562, MSE=65.92382 

Node number 15: 18 observations
  mean=43.98889, MSE=41.01765

summary(ddf$medv)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5.00   17.02   21.20   22.53   25.00   50.00

plot(m_tree)

text(m_tree)

But, this plot does not often look good. Indeed, it usually is a mess! To plot a nice tree, I suggest using \({\tt rpart.plot}\) package.

library(rpart.plot)

rpart.plot(m_tree)

At each interior node, there is decision rule, \(x>c\). If \(x>c\), you should follow the branch on the left. This continues until you reach a bottom/terminal node, which also is known as the leaf of the tree.

plot(ddf$lstat,ddf$medv, col='blue')

lines(ddf$lstat,predict(m_tree,ddf),col='maroon',lwd=3)


for (i in m_tree$splits[,4] ){
abline(v=i, col="orange",lty=2)
}

The set of bottom nodes gives us a partition of the predictor \(x\) space into disjoint regions. At right, the vertical lines display the partition. With just one x, this is just a set of intervals.

Within each region (interval) we compute the average of the \(y\) values for the subset of training data in the region. This gives us the step function which is our \(\hat{f}\). The \(\bar{y}\) values are also printed at the bottom nodes. To predict, we just use the above step function estimation of f(x).

5.2 A Tree with Two Explanatory Variables

Now, let’s estimate a tree model using two features: \(x=(lstat, dis)\) and \(y=medx\).

First, let’s take a look at scatter plot of these variables:

attach(Boston)

ddf = data.frame(medv,rm,nox,lstat, dis)

# Sort the data here
oo=order(ddf$lstat)

ddf = ddf[oo,]

library(ggplot2)

ggplot(ddf, aes(lstat, dis, colour = medv)) + geom_point()

m_tree2=rpart(medv~lstat+dis,data=ddf)


rpart.plot(m_tree2, box.palette = "Grays")

print(m_tree2)

n= 506 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 506 42716.3000 22.53281  
   2) lstat>=9.725 294  7006.2830 17.34354  
     4) lstat>=16.085 144  2699.2200 14.26181  
       8) dis< 2.0037 75  1114.6990 11.96933 *
       9) dis>=2.0037 69   761.9316 16.75362 *
     5) lstat< 16.085 150  1626.6090 20.30200 *
   3) lstat< 9.725 212 16813.8200 29.72925  
     6) lstat>=4.65 162  6924.4230 26.64630  
      12) dis>=2.4501 144  3493.3720 25.67986  
        24) lstat>=5.495 118  2296.9260 24.71695 *
        25) lstat< 5.495 26   590.4850 30.05000 *
      13) dis< 2.4501 18  2220.5910 34.37778 *
     7) lstat< 4.65 50  3360.8940 39.71800  
      14) dis>=3.20745 38  2087.6880 37.00789  
        28) lstat>=3.99 13   179.3908 32.06154 *
        29) lstat< 3.99 25  1424.8400 39.58000 *
      15) dis< 3.20745 12   110.3000 48.30000 *

summary(m_tree2)

Call:
rpart(formula = medv ~ lstat + dis, data = ddf)
  n= 506 

          CP nsplit rel error    xerror       xstd
1 0.44236500      0 1.0000000 1.0018711 0.08295707
2 0.15283400      1 0.5576350 0.6004766 0.05155493
3 0.06275014      2 0.4048010 0.4398185 0.04134508
4 0.02833720      3 0.3420509 0.4007954 0.04098180
5 0.02722395      4 0.3137137 0.4159664 0.04266595
6 0.01925703      5 0.2864897 0.3877346 0.03890024
7 0.01418570      6 0.2672327 0.3581315 0.03870040
8 0.01131786      7 0.2530470 0.3530301 0.03881346
9 0.01000000      8 0.2417291 0.3399729 0.03629141

Variable importance
lstat   dis 
   72    28 

Node number 1: 506 observations,    complexity param=0.442365
  mean=22.53281, MSE=84.41956 
  left son=2 (294 obs) right son=3 (212 obs)
  Primary splits:
      lstat < 9.725   to the right, improve=0.4423650, (0 missing)
      dis   < 2.5977  to the left,  improve=0.1169235, (0 missing)
  Surrogate splits:
      dis < 4.48025 to the left,  agree=0.737, adj=0.373, (0 split)

Node number 2: 294 observations,    complexity param=0.06275014
  mean=17.34354, MSE=23.83089 
  left son=4 (144 obs) right son=5 (150 obs)
  Primary splits:
      lstat < 16.085  to the right, improve=0.3825785, (0 missing)
      dis   < 2.0754  to the left,  improve=0.3369205, (0 missing)
  Surrogate splits:
      dis < 2.2166  to the left,  agree=0.724, adj=0.438, (0 split)

Node number 3: 212 observations,    complexity param=0.152834
  mean=29.72925, MSE=79.31047 
  left son=6 (162 obs) right son=7 (50 obs)
  Primary splits:
      lstat < 4.65    to the right, improve=0.3882819, (0 missing)
      dis   < 2.16475 to the right, improve=0.1528193, (0 missing)
  Surrogate splits:
      dis < 1.48495 to the right, agree=0.769, adj=0.02, (0 split)

Node number 4: 144 observations,    complexity param=0.01925703
  mean=14.26181, MSE=18.74458 
  left son=8 (75 obs) right son=9 (69 obs)
  Primary splits:
      dis   < 2.0037  to the left,  improve=0.3047506, (0 missing)
      lstat < 19.9    to the right, improve=0.2174498, (0 missing)
  Surrogate splits:
      lstat < 19.73   to the right, agree=0.771, adj=0.522, (0 split)

Node number 5: 150 observations
  mean=20.302, MSE=10.84406 

Node number 6: 162 observations,    complexity param=0.0283372
  mean=26.6463, MSE=42.74335 
  left son=12 (144 obs) right son=13 (18 obs)
  Primary splits:
      dis   < 2.4501  to the right, improve=0.17481020, (0 missing)
      lstat < 5.495   to the right, improve=0.07152825, (0 missing)

Node number 7: 50 observations,    complexity param=0.02722395
  mean=39.718, MSE=67.21788 
  left son=14 (38 obs) right son=15 (12 obs)
  Primary splits:
      dis   < 3.20745 to the right, improve=0.3460110, (0 missing)
      lstat < 3.325   to the right, improve=0.1526421, (0 missing)
  Surrogate splits:
      lstat < 1.95    to the right, agree=0.8, adj=0.167, (0 split)

Node number 8: 75 observations
  mean=11.96933, MSE=14.86266 

Node number 9: 69 observations
  mean=16.75362, MSE=11.04249 

Node number 12: 144 observations,    complexity param=0.0141857
  mean=25.67986, MSE=24.25952 
  left son=24 (118 obs) right son=25 (26 obs)
  Primary splits:
      lstat < 5.495   to the right, improve=0.17346010, (0 missing)
      dis   < 4.39775 to the right, improve=0.07588735, (0 missing)

Node number 13: 18 observations
  mean=34.37778, MSE=123.3662 

Node number 14: 38 observations,    complexity param=0.01131786
  mean=37.00789, MSE=54.93915 
  left son=28 (13 obs) right son=29 (25 obs)
  Primary splits:
      lstat < 3.99    to the right, improve=0.2315753, (0 missing)
      dis   < 6.40075 to the right, improve=0.1173026, (0 missing)

Node number 15: 12 observations
  mean=48.3, MSE=9.191667 

Node number 24: 118 observations
  mean=24.71695, MSE=19.46548 

Node number 25: 26 observations
  mean=30.05, MSE=22.71096 

Node number 28: 13 observations
  mean=32.06154, MSE=13.79929 

Node number 29: 25 observations
  mean=39.58, MSE=56.9936

library(tree)

m_tree2=tree(medv~lstat+dis,data=ddf)

summary(m_tree2)


Regression tree:
tree(formula = medv ~ lstat + dis, data = ddf)
Number of terminal nodes:  9 
Residual mean deviance:  20.78 = 10330 / 497 
Distribution of residuals:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-14.780  -2.713  -0.302   0.000   2.298  15.620

partition.tree(m_tree2)

attach(Boston)
mid<-mean(medv)
ggplot(ddf, aes(lstat, rm, colour = medv)) + geom_point()+scale_color_gradient2(midpoint=mid, low="blue", mid="cyan",
                     high="maroon", space ="Lab" )

m_tree31=rpart(medv~lstat+rm+dis,data=ddf)

summary(m_tree31)

Call:
rpart(formula = medv ~ lstat + rm + dis, data = ddf)
  n= 506 

          CP nsplit rel error    xerror       xstd
1 0.45274420      0 1.0000000 1.0025014 0.08304321
2 0.17117244      1 0.5472558 0.6187634 0.05731793
3 0.07165784      2 0.3760834 0.4288170 0.04605842
4 0.03616428      3 0.3044255 0.3454690 0.04186982
5 0.03336923      4 0.2682612 0.3334360 0.04212732
6 0.02311607      5 0.2348920 0.3150462 0.04196738
7 0.01585116      6 0.2117759 0.2930648 0.04175549
8 0.01000000      7 0.1959248 0.2811165 0.04106018

Variable importance
   rm lstat   dis 
   54    33    13 

Node number 1: 506 observations,    complexity param=0.4527442
  mean=22.53281, MSE=84.41956 
  left son=2 (430 obs) right son=3 (76 obs)
  Primary splits:
      rm    < 6.941   to the left,  improve=0.4527442, (0 missing)
      lstat < 9.725   to the right, improve=0.4423650, (0 missing)
      dis   < 2.5977  to the left,  improve=0.1169235, (0 missing)
  Surrogate splits:
      lstat < 4.83    to the right, agree=0.891, adj=0.276, (0 split)

Node number 2: 430 observations,    complexity param=0.1711724
  mean=19.93372, MSE=40.27284 
  left son=4 (175 obs) right son=5 (255 obs)
  Primary splits:
      lstat < 14.4    to the right, improve=0.4222277, (0 missing)
      dis   < 2.58835 to the left,  improve=0.2036255, (0 missing)
      rm    < 6.5455  to the left,  improve=0.1442877, (0 missing)
  Surrogate splits:
      dis < 2.23935 to the left,  agree=0.781, adj=0.463, (0 split)
      rm  < 5.858   to the left,  agree=0.688, adj=0.234, (0 split)

Node number 3: 76 observations,    complexity param=0.07165784
  mean=37.23816, MSE=79.7292 
  left son=6 (46 obs) right son=7 (30 obs)
  Primary splits:
      rm    < 7.437   to the left,  improve=0.50515690, (0 missing)
      lstat < 4.68    to the right, improve=0.33189140, (0 missing)
      dis   < 6.75885 to the right, improve=0.03479472, (0 missing)
  Surrogate splits:
      lstat < 3.99    to the right, agree=0.776, adj=0.433, (0 split)

Node number 4: 175 observations,    complexity param=0.02311607
  mean=14.956, MSE=19.27572 
  left son=8 (86 obs) right son=9 (89 obs)
  Primary splits:
      dis   < 2.0037  to the left,  improve=0.2927244, (0 missing)
      lstat < 19.83   to the right, improve=0.2696497, (0 missing)
      rm    < 5.567   to the left,  improve=0.0750970, (0 missing)
  Surrogate splits:
      lstat < 19.73   to the right, agree=0.749, adj=0.488, (0 split)
      rm    < 5.5505  to the left,  agree=0.634, adj=0.256, (0 split)

Node number 5: 255 observations,    complexity param=0.03616428
  mean=23.3498, MSE=26.0087 
  left son=10 (248 obs) right son=11 (7 obs)
  Primary splits:
      dis   < 1.5511  to the right, improve=0.2329242, (0 missing)
      lstat < 4.91    to the right, improve=0.2208409, (0 missing)
      rm    < 6.543   to the left,  improve=0.2172099, (0 missing)

Node number 6: 46 observations,    complexity param=0.01585116
  mean=32.11304, MSE=41.29592 
  left son=12 (7 obs) right son=13 (39 obs)
  Primary splits:
      lstat < 9.65    to the right, improve=0.35644260, (0 missing)
      dis   < 3.45845 to the left,  improve=0.08199333, (0 missing)
      rm    < 7.3     to the right, improve=0.05938584, (0 missing)
  Surrogate splits:
      dis < 1.6469  to the left,  agree=0.87, adj=0.143, (0 split)

Node number 7: 30 observations
  mean=45.09667, MSE=36.62832 

Node number 8: 86 observations
  mean=12.53953, MSE=16.13867 

Node number 9: 89 observations
  mean=17.29101, MSE=11.21228 

Node number 10: 248 observations,    complexity param=0.03336923
  mean=22.93629, MSE=14.75159 
  left son=20 (193 obs) right son=21 (55 obs)
  Primary splits:
      rm    < 6.543   to the left,  improve=0.38962730, (0 missing)
      lstat < 7.685   to the right, improve=0.33560120, (0 missing)
      dis   < 2.8405  to the left,  improve=0.03769213, (0 missing)
  Surrogate splits:
      lstat < 5.055   to the right, agree=0.839, adj=0.273, (0 split)
      dis   < 10.648  to the left,  agree=0.782, adj=0.018, (0 split)

Node number 11: 7 observations
  mean=38, MSE=204.1457 

Node number 12: 7 observations
  mean=23.05714, MSE=61.85673 

Node number 13: 39 observations
  mean=33.73846, MSE=20.24391 

Node number 20: 193 observations
  mean=21.65648, MSE=8.23738 

Node number 21: 55 observations
  mean=27.42727, MSE=11.69398

rpart.plot(m_tree31, box.palette = "Grays")

m_tree32=rpart(medv~lstat+rm+dis,data=ddf)


rpart.plot(m_tree32, box.palette = "Grays")

5.3 What is the loss problem and optimization question in tree algorithm?

As shown above, we want to split the feature space, i.e. \(X=\{x_1, x_2, \dots, x_k\}\), to smaller rectangular, or boxes. If we find a splitting pattern which has the lowest prediction error for the training data, then we can predict the associated outcome with the prediction feature set \(X_p\) by finding to which box this set belongs. Roughly, we can think about about tree algorithm like a \(kNN\) algorithm in which \(k\) is dynamic and can change according to the changes in the data.

\[\begin{align} min_{R_1,R_2,\dots,R_J} \sum_{j=1}^J \sum_{i \in R_j} (y_i-\hat{y}_{R_j})^2 \end{align}\]

This loss has a straightforward intuition: we want to find \(R_j\)s to minimize the sum of the squared deviation of observations from the mean response within each box. In theory, we can define infinite number of boxes, and we need to evaluate the squared errors of each set of boxes to see which one is minimized. This is computationally very intensive and almost impossible.

To solve this problem, a recursive partitioning top-down approach is used to find the optimal boxes:

Scan each feature/predictor in the data to find the one for which splitting gives the best performance/prediction power, i.e. lowest error.
Now, for each split repeat the previous step, and continue this until splitting is too costly (we will discuss later what “costly” means, but just as a tip: think about bias-variance tradeoff), or reach a pre-defined tree size.

This is a greedy approach since the splitting process only focus on each step and pick the split that is associated with the best fit at the current step, rather than looking ahead and picking the split which lead to a better prediction tree in the following steps.

5.4 How to fit the data to a tree model?

To execute the above algorithm, statistical algorithms re-write the above problem as follow:

\[\begin{align} C(T,y)=L(T,y)+\alpha |T| \end{align}\]

where L(T,y) is the loss of fitting outcome \(y\) with tree \(T\). Our goal always is minimizing the loss, that is small \(L\) is preferred.

However, we do not want to make the model/tree too complex. A complex model might be good in capturing all variances in the training data, but would lead to a biased prediction of in the test data. Thus, we add the number of nodes in tree T, i.e. \(|T|\), with a penalty parameter \(\alpha\) to the cost function \(C(T,y)\). For the continuous \(y\), we can use RMSE, and for categorical \(y\), we can use any of the classification measures that are discussed in the previous section.

What is the best \(\alpha\)? if we pick a big \(\alpha\), then the penalty for having a big and complex tree is large. Thus, the optimization problem return a small tree which can lead to a large \(L\) on the training data. On the other hand, for a small \(\alpha\), we allow the model pick a big tree. Here, \(\alpha\) is analogous to \(k\) in \(kNN\). How do we pick \(\alpha\)? The answer similar to most of the similar cases in this course is cross-validation.

5.5 Pruning trees

The idea here is first allow a tree grows as much as possible without being worried about its complexity size. Then, considering the complexity parameter, CP value, we prune/cut the tree with the optimal CP value.

attach(Boston)

df=Boston[,c(8,13,14)] # pick dis, lstat, medv

print(names(df))

[1] "dis"   "lstat" "medv"

# First grow a big tree

tree_ml=tree(medv~., df)

print(tree_ml)

node), split, n, deviance, yval
      * denotes terminal node

 1) root 506 42720.0 22.53  
   2) lstat < 9.725 212 16810.0 29.73  
     4) lstat < 4.65 50  3361.0 39.72  
       8) dis < 3.20745 12   110.3 48.30 *
       9) dis > 3.20745 38  2088.0 37.01  
        18) lstat < 3.99 25  1425.0 39.58 *
        19) lstat > 3.99 13   179.4 32.06 *
     5) lstat > 4.65 162  6924.0 26.65  
      10) dis < 2.4501 18  2221.0 34.38 *
      11) dis > 2.4501 144  3493.0 25.68  
        22) lstat < 5.495 26   590.5 30.05 *
        23) lstat > 5.495 118  2297.0 24.72 *
   3) lstat > 9.725 294  7006.0 17.34  
     6) lstat < 16.085 150  1627.0 20.30 *
     7) lstat > 16.085 144  2699.0 14.26  
      14) dis < 2.0037 75  1115.0 11.97 *
      15) dis > 2.0037 69   761.9 16.75 *

plot(tree_ml)
text(tree_ml)

cat("Size of the big tree: \n")

Size of the big tree:

print(length(unique(tree_ml$where)))

[1] 9

Boston_fit=predict(tree_ml,df)

# Now, let's prune it down to a tree with 6 leaves/nodes

tree_ml_7=prune.tree(tree_ml,best=7)

cat("Size of the big tree: \n")

Size of the big tree:

print(length(unique(tree_ml_7$where)))

[1] 7

par(mfrow=c(1,2))

plot(tree_ml_7, type='u')
text(tree_ml_7, col='maroon', label=c('yval'), cex=.9)


Boston_fit7=predict(tree_ml_7,df)

##

ResultsMat=cbind(medv,Boston_fit,Boston_fit7)

colnames(ResultsMat)=c("medv","tree","tree7")

pairs(ResultsMat, col='maroon')

print(cor(ResultsMat))

           medv      tree     tree7
medv  1.0000000 0.8707875 0.8560183
tree  0.8707875 1.0000000 0.9830393
tree7 0.8560183 0.9830393 1.0000000

#Let's use the trained algorithm to predict the value for a specific point

Pred_point=data.frame(lstat=15,dis=2)

yhat=predict(tree_ml,Pred_point)
yhat_7=predict(tree_ml_7,Pred_point)


cat('prediction is: \n')

prediction is:

print(yhat)

     1 
20.302

print(yhat_7)

     1 
20.302

5.6 A cross-validated problem

attach(Boston)
set.seed(7)

df=Boston[,c(8,13,14)]

print(names(df))

[1] "dis"   "lstat" "medv"

# Let's fit a single tree and plot the importance of the variables

tree=rpart(medv~., method="anova",data=df,
           control=rpart.control(minsplit = 5,cp=.0005))


Ntree=length(unique(tree$where))

cat("Size of big tree is:",Ntree,"\n")

Size of big tree is: 90

# Let's check the CV results

plotcp(tree)

best_a=which.min(tree$cptable[,"xerror"])

bestCP=tree$cptable[best_a,"CP"]

bestSize=tree$cptable[best_a,"nsplit"]+1

best_tree=prune(tree,cp=bestCP)

plot(best_tree,uniform=TRUE)
text(best_tree,digits=2, use.n = TRUE)

5.7 Bagging and Random Forest

The idea behind Bagging is bootstrapping the data adequately enough to make sure that trees with good explanatory power are captured. This approach is computationally intensive and time-consuming.

Random Forests starts from Bagging and adds another kind of randomization. In this method, instead of going over all features when we do the greedy approach, this method randomly samples a subset of \(m\) variables to search over each time we make a split.

In this method, more types of trees will be evaluated. Since this method is Bootstraped, the important variables will be identified in average more than others; and thus can be used for prediction.

How to choose parameters for a Random Forest model? \(B\) is the number of Bootstrapped samples, and \(m\) is the number of variables to sample. A common choice for \(m\) is \(\sqrt p\), where \(p\) is the number of features in the model. When we set \(m=p\), then Random Forest is Bagging.

5.8 California Housing Data

library(randomForest)


rawData=read.csv("https://raw.githubusercontent.com/babakrezaee/MethodsCourses/master/DataSets/calhouse.csv")

# First divide the sample to train, validation, and test samples


set.seed(7)


n=nrow(rawData)

n1=floor(n/2)
n2=floor(n/4)
n3=n-n1-n2

#Shuffle the data
ii=sample(1:n,n)

CAtrain=rawData[ii[1:n1],]
CAval=rawData[ii[n1+1:n2],]
CAtest=rawData[ii[n1+n2+1:n3],]

## Fitting using the RF on train and evaluate on the tes, and predict on val

RF=randomForest(logMedVal~.,data=CAtrain,mtry=3,ntree=500)

RFpred=predict(RF,newdata=CAval)

RMSE=sqrt(mean(CAval$logMedVal-RFpred)^2)

cat('RMSE on the train data for Random Forest is',RMSE,"\n")

RMSE on the train data for Random Forest is 0.00363087

getTree(RF, 1, labelVar=TRUE)[1:30,]

   left daughter right daughter        split var split point status prediction
1              2              3     medianIncome    3.574250     -3  12.086736
2              4              5     AveOccupancy    2.294164     -3  11.775037
3              6              7     AveOccupancy    2.783601     -3  12.417998
4              8              9         latitude   37.975000     -3  12.053484
5             10             11         AveRooms    4.350556     -3  11.700731
6             12             13         AveRooms    6.438173     -3  12.556428
7             14             15     medianIncome    5.745500     -3  12.294984
8             16             17        longitude -117.135000     -3  12.163971
9             18             19       households   14.500000     -3  11.674856
10            20             21       households  446.500000     -3  11.850795
11            22             23        longitude -118.895000     -3  11.619754
12            24             25        longitude -117.175000     -3  12.500131
13            26             27        AveBedrms    1.809925     -3  12.724364
14            28             29     medianIncome    4.574500     -3  12.129722
15            30             31 housingMedianAge   27.500000     -3  12.696060
16            32             33 housingMedianAge   24.500000     -3  12.250378
17            34             35        longitude -116.180000     -3  11.493662
18             0              0             <NA>    0.000000     -1   9.615739
19            36             37        longitude -122.425000     -3  11.683028
20            38             39     medianIncome    1.854500     -3  11.730831
21            40             41        longitude -117.795000     -3  11.971906
22            42             43 housingMedianAge   48.500000     -3  11.531629
23            44             45         latitude   34.425000     -3  11.759766
24            46             47         latitude   37.995000     -3  12.516257
25            48             49     medianIncome    4.595600     -3  12.057731
26            50             51 housingMedianAge    9.500000     -3  12.797211
27            52             53        longitude -121.525000     -3  12.182172
28            54             55     AveOccupancy    2.784340     -3  12.027770
29            56             57        longitude -122.225000     -3  12.252785
30            58             59         AveRooms    7.908003     -3  12.648332

pairs(cbind(CAval$logMedVal,RFpred))

print(cor(cbind(CAval$logMedVal,RFpred)))

                    RFpred
       1.0000000 0.9056696
RFpred 0.9056696 1.0000000

varImpPlot(RF)