Predicting How Subjects Did Exercises

Synopsis

In this report, we describe a model built to predict how subjects did various exercises. This report also explains the use of cross validation, the expected out of sample error, and the choices made in building this model.

Data Processing

The model uses activity monitoring data from http://groupware.les.inf.puc-rio.br/har We begin by downloading the training data, which we use to build the model and to estimate the error. We also download the testing data, to which we applied our model; the model predicted all twenty cases in the testing set correctly (according to feedback from the submission page on Coursera).

fileUrl<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(fileUrl, destfile="pml-training.csv", method="curl")
fileUrl<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(fileUrl, destfile="pml-testing.csv", method="curl")
testing<-read.csv("pml-testing.csv")
training<-read.csv("pml-training.csv")

Machine Learning Algorithm, Cross Validation, and Estimation of Error

In this section, we build a machine learning algorithm. Then we use cross validation and estimate the error to evaluate the effectiveness of our model.

First we download several packages that we will use in our analysis.

library(caret)
library(randomForest)
library(psych)

Selecting Predictors

We partition our training data into a test set and a train set. The train set contains 60 percent of the training data, and the test set contains the other 40 percent of the training data.

inTrain<-createDataPartition(y=training$classe, p=0.6, list=FALSE)
train<-training[inTrain,]
test<-training[-inTrain,]

We next find columns of train that are filled primarily with empty quotes or no data, and we eliminate them. The reason for eliminating them is that they will turn out to be poor predictors (since they contain no information for most rows of the data set). As the output below shows, columns with at least one empty entry are about 98% empty.

traindescribe<-describe(train)
nrows<-nrow(train)
unique(traindescribe$n)/nrows
## [1] 1.00000 0.01987
trainuse<-train[,traindescribe$n==nrows]
numcols<-ncol(trainuse)
blankcount<-rep(NA, numcols)
for(i in 1:numcols){
  blankcount[i]<-sum(trainuse[,i]=="")
}
unique(blankcount/nrows)
## [1] 0.0000 0.9801
trainuse2<-trainuse[,blankcount==0]

We determine which columns remain.

names(trainuse2)
##  [1] "X"                    "user_name"            "raw_timestamp_part_1"
##  [4] "raw_timestamp_part_2" "cvtd_timestamp"       "new_window"          
##  [7] "num_window"           "roll_belt"            "pitch_belt"          
## [10] "yaw_belt"             "total_accel_belt"     "gyros_belt_x"        
## [13] "gyros_belt_y"         "gyros_belt_z"         "accel_belt_x"        
## [16] "accel_belt_y"         "accel_belt_z"         "magnet_belt_x"       
## [19] "magnet_belt_y"        "magnet_belt_z"        "roll_arm"            
## [22] "pitch_arm"            "yaw_arm"              "total_accel_arm"     
## [25] "gyros_arm_x"          "gyros_arm_y"          "gyros_arm_z"         
## [28] "accel_arm_x"          "accel_arm_y"          "accel_arm_z"         
## [31] "magnet_arm_x"         "magnet_arm_y"         "magnet_arm_z"        
## [34] "roll_dumbbell"        "pitch_dumbbell"       "yaw_dumbbell"        
## [37] "total_accel_dumbbell" "gyros_dumbbell_x"     "gyros_dumbbell_y"    
## [40] "gyros_dumbbell_z"     "accel_dumbbell_x"     "accel_dumbbell_y"    
## [43] "accel_dumbbell_z"     "magnet_dumbbell_x"    "magnet_dumbbell_y"   
## [46] "magnet_dumbbell_z"    "roll_forearm"         "pitch_forearm"       
## [49] "yaw_forearm"          "total_accel_forearm"  "gyros_forearm_x"     
## [52] "gyros_forearm_y"      "gyros_forearm_z"      "accel_forearm_x"     
## [55] "accel_forearm_y"      "accel_forearm_z"      "magnet_forearm_x"    
## [58] "magnet_forearm_y"     "magnet_forearm_z"     "classe"

We now eliminate columns 1, 3, 4, 5, 6, and 7 since they are unrelated to the motions people took. (We acknowledge the possibility that a subject’s movement is time dependent, since their body might adapt to the exercises over time. As we will see from our error estimate later in this report, though, we obtain a good model without the timestamp data.) While a subject’s name does not encode information about a specific exercise, we have left user_name in the data set to account for the possibility that differences in technique between users might cause differences in measurements. (Since the test set is on the same set of subjects, this data is potentially relevant.)

trainuse3<-trainuse2[,append(c(2), 8:60)]

Random Forest, Cross Validation, and Estimation of Error

We use the randomForest funcion in R to build a random forest model. randomForest constructs 500 different trees. Each iteration uses a random subset of about two thirds of the data to construct a tree (and leaves out the remaining third of the data, which is used to test that particular tree). Thus the algorithm performs cross validation internally. Using these 500 iterations, the algorithm computes an out of bag (oob) error estimate.1

rforest<-randomForest(classe~., data=trainuse3)
rforest
## 
## Call:
##  randomForest(formula = classe ~ ., data = trainuse3) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.63%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3345    3    0    0    0   0.0008961
## B   10 2263    6    0    0   0.0070206
## C    0   21 2030    3    0   0.0116845
## D    0    0   23 1906    1   0.0124352
## E    0    0    2    5 2158   0.0032333
plot(rforest)

plot of chunk randomforestmethod From the output directly above, we see that the classification error is under 1.25% for each class. The confusion matrix also illustrates the high accuracy of this model. The out of bag (oob) estimate of the error rate is also low: 0.63%. The plot above illustrates how rapidly the error rate drops as the number of trees in the model increases.

Even though cross validation is performed by the randomForest function, we still test our model, rforest, on the data set test (i.e. the 40% of the training set that we partitioned off near the beginning of this report). This will give us an additional estimate of out of sample error (and an additional cross validation).

pred<-predict(rforest, test)
sum(pred==test$classe)/nrow(test)
## [1] 0.9945

So we see that the model predicts the correct classe 99.45% of the time, or equivalently, the estimated out of sample error rate is 0.55%. So the model predicts well.

Results

We used our model to predict the 20 cases in the testing data (using the predict function in R) and submitted the results to Coursera. All 20 cases were predicted correctly. (This is unsurprising, given the low estimated out of sample error rate of 0.55%.)

predict(rforest, testing)
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

  1. I found the following reference helpful for understanding randomForest: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm