Learn how to apply your first four machine learning algorythms on the classic Titanic dataset. R Tutorial: Brief Introduction to Statisical Modeling in R, the reproducible way

Using the dyplr, mice, and caret packages

While the statistics behind predictive models can be difficult, implementing your first models in R doesn’t have to be! There many packages in R designed around user-friendliness and consistent frameworks to make your life easier. This tutorial goes through classic titanic dataset and performs four different statistical models: random forest, k-nearest neighbors, general linear model, and naive bayes. The titanic dataset used in this tutorial is freely available at kaggle. By using the caret package, we can implement all four of these models in a single for loop to predict whether a passenger lived or died. We can then visually compare the accuarcy of the models. This tutorial goes through three primary stages: preparing the data with dyplr, imputing missing values with mice, and then implementing the modeling through caret. Lets get started!


Data preparation with dyplr

Did you know that many data scientists say that over 80% of their work load is munging data? That is quite a lot of time. Fortunalety for us, the titanic dataset comes almost perfectly clean and ready. We just need to add a few finishing touches.

Our goal in this section is to change our data into caret-friendly format. A fantastic tool for manipulating data in R is the dyplr package, which describes itself as ‘a grammar of data manipulation’. Many of the mechanisms of this package are based off SQL, and the packages focuses readability and ease of use. This package greatly simplifies the data munging needed to turn messy data into model ready data. A majority of many data science projects involve data cleaning and formating, so it is a good idea get a little practie with dyplr.

Let’s load the tidyverse library, which as dyplr and a few other R tools at our disposal.

library(tidyverse)

Now we can bring the data in and examine what we have.

df_pre <- read_csv('data/train.csv')
knitr::kable(head(df_pre))
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NA S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NA S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NA S
6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 NA Q

Our data contains 891 rows, long enough to make some good predictions.

As we can see in the table above, some columns are more useful than others. So our first step should be doing away with the columns that do more harm than good. We can take a look at the kaggle data discription to better get a hint. Two most obvious problem columns are passenger ID and parch ticket. These two columns only serve to drum up noise and slow down our modeling. Enter dyplr.

We can use the select function (similar to SQL’s select) to drop unwanted columns. By putting a - before the spefied columnm, we tell R to drop this column while keeping the rest. Also in this section we utilize ‘pipes’ (i.e. %>%), another feature of the tidyverse, in order to simplify our code and make it more accessable. It may be easy to imagine the pipe operator as saying ‘then’.

df <- df_pre %>%
  select(-Cabin,-PassengerId,-Name,-Ticket,-Parch)

Now, we just have to modify our data one more time before sending it down to imputation. Notice that columns such as Survived and SibSp contain categorical data, but are numbered numerically. This spells trouble for caret. Although caret can automatically convert strings into factors (i.e. categoricals), factors in caret must have valid variable names. Numbers like ‘1’ and ‘0’ are not valid variable names in R, so we must switch our the variables inside Survived and SipSp to numbered counterparts.

In order to do this conversion effiecently, we must first create a helper function to switch out the numbers for letters.

to_letters <- function(input_x){
  switch(input_x,
    '0' = 'zero',
    '1' = 'one',
    '2' = 'two',
    '3' = 'three',
    '4' = 'four',
    '5' = 'five',
    '6' = 'six',
    '7' = 'seven',
    '8' = 'eight',
    '9' = 'nine',
    '10' = 'ten'
  )
}

Now, lets use a tidyverse pipeline to apply this function to our disired rows. Notice mutate() and mutate_if(). These functions, as you may have guessed, transform columns. Mutate_if() changes a column if it matches a certain condition, mirroring base R if functions.

df <- df %>%

  # Our function takes in strings, so lets convert our integer columns to strings
  # Note that our other numerical columns are doubles, so they remain uneffected by the below code
  mutate_if(is.integer, as.character) %>%

  # We peform the function row by row
  rowwise() %>%

  # We apply our custom function to each that requires transformation
  mutate(Survived=to_letters(Survived),
         SibSp = to_letters(SibSp)
  )

# Now we can turn our strings into factors for imputation
df <- as.data.frame(df) %>%
  mutate_if(is.character, as.factor)

Imputation with mice

Rows with missing values will bring our caret modeling to a screeching hault. But if we drop the rows, we may lose valueable training data. How do we deal with these issues? Through imputation. While the caret package does contain some imputation methods, the mice (Multivariate Imputation by Chained Equations) package really brings down the imputational house with algorythms galore.

Before we get started imputating, lets see what values we have missing.

which(colSums(is.na(df)) > 0)
##      Age Embarked
##        4        7

It appears that we have 4 continous age variables, and 7 embarked factor variables. Let’s imputate through the random forest method.

library(mice)
imputer_hold <- mice(df,m =5, meth='rf')
##
##  iter imp variable
##   1   1  Age  Embarked
##   1   2  Age  Embarked
##   1   3  Age  Embarked
##   1   4  Age  Embarked
##   1   5  Age  Embarked
##   2   1  Age  Embarked
##   2   2  Age  Embarked
##   2   3  Age  Embarked
##   2   4  Age  Embarked
##   2   5  Age  Embarked
##   3   1  Age  Embarked
##   3   2  Age  Embarked
##   3   3  Age  Embarked
##   3   4  Age  Embarked
##   3   5  Age  Embarked
##   4   1  Age  Embarked
##   4   2  Age  Embarked
##   4   3  Age  Embarked
##   4   4  Age  Embarked
##   4   5  Age  Embarked
##   5   1  Age  Embarked
##   5   2  Age  Embarked
##   5   3  Age  Embarked
##   5   4  Age  Embarked
##   5   5  Age  Embarked
summary(imputer_hold)
## Multiply imputed data set
## Call:
## mice(data = df, m = 5, method = "rf")
## Number of multiple imputations:  5
## Missing cells per column:
## Survived   Pclass      Sex      Age    SibSp     Fare Embarked
##        0        0        0      177        0        0        2
## Imputation methods:
## Survived   Pclass      Sex      Age    SibSp     Fare Embarked
##     "rf"     "rf"     "rf"     "rf"     "rf"     "rf"     "rf"
## VisitSequence:
##      Age Embarked
##        4        7
## PredictorMatrix:
##          Survived Pclass Sex Age SibSp Fare Embarked
## Survived        0      0   0   0     0    0        0
## Pclass          0      0   0   0     0    0        0
## Sex             0      0   0   0     0    0        0
## Age             1      1   1   0     1    1        1
## SibSp           0      0   0   0     0    0        0
## Fare            0      0   0   0     0    0        0
## Embarked        1      1   1   1     1    1        0
## Random generator seed value:  NA
df <- complete(imputer_hold)

Now we are set to make our modeling for loop.


Modeling with caret

Now the data is formated correctly and the values are imputated, it is time to do some predicive modeling. Using caret, we use over 200 different machine learning models within a single, consistent framework. Let’s load in the caret library and take a look at the four different models we will be using.

library(caret)

models <- c('ranger','kknn', 'glmnet', 'nb')

The first model we will use is ‘ranger’, a random forest implementation that is a little easier on processors compared to ‘rf’. The next model is ‘kknn’, which is an implemetantion of k-nearest neighbors. The model after that is the ‘glmnet’, a generalised linear model. After that we have our simplist statistical model, naive bayes specifided by ‘nb’.

In order to see more information about the algorhythms we are going to use and other caret-friendly algorhythms, check out the available models section of the caret documentation. One import note is that much of the hyperparameter tuning can be done automatically in caret. Caret will choose the hyperparameter(s) that supply the greatest accuarcy.

Now let’s build our control object for caret. This caret object gives some details about how we want our model to be evaluated. We first specifify the method, cross validiation, and the amount of folds that we will use. Because we have a binary target variable, we make our summary the twoClassSummary. We also specify that we are using class probolilty and save our predictions in the next two parameters.

myControl <- trainControl(method = 'cv',
                          number = 10,
                          summaryFunction = twoClassSummary,
                          classProbs = TRUE,
                          savePredictions = TRUE

                          )

Alright, now it is time to run some statistical models. We use a for loop to iterate over our list called models, and make a few more specfications as seen below.

for (model in models){

  # This funny looking functions create different variables by attaching our model string to 'model_'
  assign(paste('model',model,sep='_'),

         # The train object trains our model
         train(

           # We assign our variables used and the soure for our data
           Survived ~ ., df,

           # We center and scale our data for more accuarte results
           preProcess = c('center', 'scale'),

           # ROC serves as a robust measure of accuarcy in binary classification
           metric = 'ROC',

           # We use the list we created earlier to specify our model
           method = model,

           # For control, we can use the object we created in the previous code block
           trControl = myControl
          )
         )
  }

Very nice! We have our models, but which one is the best?

We can use caret’s summary and boxplot visualization to see. It appears as though the radom forest performed the best.

model_list <- list(Linear_model = model_glmnet,K_neighbors = model_kknn,
                   Naive_bayes = model_nb, Random_forest = model_ranger)

resamples <- resamples(model_list)

summary(resamples)
##
## Call:
## summary.resamples(object = resamples)
##
## Models: Linear_model, K_neighbors, Naive_bayes, Random_forest
## Number of resamples: 10
##
## ROC
##                    Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## Linear_model  0.8050802 0.8239801 0.8652406 0.8526209 0.8729253 0.8971429
## K_neighbors   0.7965241 0.8225974 0.8481283 0.8498767 0.8711898 0.9035948
## Naive_bayes   0.7187013 0.7864973 0.8206551 0.8121394 0.8416444 0.8737968
## Random_forest 0.8352941 0.8606455 0.8788770 0.8816050 0.8876679 0.9475936
##               NA's
## Linear_model     0
## K_neighbors      0
## Naive_bayes      0
## Random_forest    0
##
## Sens
##                    Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## Linear_model  0.6764706 0.6838235 0.7100840 0.7280672 0.7592437 0.8235294
## K_neighbors   0.5882353 0.6544118 0.7000000 0.7076471 0.7647059 0.8235294
## Naive_bayes   0.7647059 0.8058824 0.8529412 0.8505882 0.8750000 1.0000000
## Random_forest 0.5882353 0.6838235 0.7100840 0.7310924 0.7962185 0.8529412
##               NA's
## Linear_model     0
## K_neighbors      0
## Naive_bayes      0
## Random_forest    0
##
## Spec
##                    Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## Linear_model  0.7818182 0.8545455 0.8715488 0.8688552 0.8909091 0.9454545
## K_neighbors   0.7272727 0.8090909 0.8636364 0.8488889 0.8904040 0.9272727
## Naive_bayes   0.4181818 0.5318182 0.5454545 0.5556229 0.5898990 0.7454545
## Random_forest 0.8545455 0.8727273 0.8909091 0.8907407 0.9086700 0.9272727
##               NA's
## Linear_model     0
## K_neighbors      0
## Naive_bayes      0
## Random_forest    0
bwplot(resamples, metric = 'ROC')


Want to learn more?

For a deeper understanding of dyplr and other great data manipulation tools, I highly recommend the book R for Data Science.

Mice has online documentation containing all of its functionality.

An excellent introduction to caret in an interactive way: Datacamp’s Machine Learning Toolbox course. For a deeper understanding of the latest version of caret, check out caret’s suberb documentation. For more information about machine learning, two great free books on the subject are An Introduction to Statistical Learning with Applications in R and R and Data Mining: Examples and Case Studies.