Introduction

Classification is often the goal of supervised learning. In classification, the response variable has a finite number of outcomes, and typically is a categorical variable. Relevant information from a set of predictor variables is utilized to classify an observation according to one of the response variable’s categories. K-Nearest Neighbors ($k$-NN) is a popular machine learning technique for classification. The $k$-NN method is intuitive, non-parametric, and free of assumptions.

The difficult part of this process is choosing an appropriate $k$. If $k$ is too small, we may have instable predictions, and if $k$ is too large, we may be persistently producing biased predictions. Some form of cross-validation in the training data is usually used to tune the technique for the best $k$ so that we can confidently classify new observations

Although our focus is on an binary response variable, $k$-NN is easily generalized to response variables with more than 2 classes. Unfortunately, $k$-NN is already computationally intensive and requires more training data in this situation. The method also becomes more unreliable and difficult when considering many input variables, especially when a portion of the input variables may be irrelevant. Furthermore, since $k$-NN is affected by the scale of the input variables, all predictor variables should be standardized.

The data contained in library(titanic) is part of an ongoing competition at Kaggle.com. In this tutorial, we will develop an understanding of $k$-NN by attempting to predict whether a passenger would survive or die given important information known prior to the fatal iceberg collision. The data previewed below displays information we know about 891 different passengers that aboarded the death ship. For all these passengers, we know whether or not they lived based on the variable Survived. If Survived==1, we know the passenger lived to see better days. In this dataset, 342 passengers survived and the rest perished. The variables SibSp (# of Siblings or Spouses), Parch (# of Parents or Children), and Fare (Price of Ticket in $) represent information we want to utilize to classify passengers to one of two categores: Survived=1 or Died=0.

T1=titanic_train[,c("Survived","SibSp","Parch","Fare")]
head(T1)

##   Survived SibSp Parch    Fare
## 1        0     1     0  7.2500
## 2        1     1     0 71.2833
## 3        1     0     0  7.9250
## 4        1     1     0 53.1000
## 5        0     0     0  8.0500
## 6        0     0     0  8.4583

For more education on $k$-nearest neighbors, check out these killer links:

For more motivation, consider these killer quotes:

I will go down with this ship. I won’t put my hands up and surrender.

There will be no white flag above my door. I’m in love with k-NN and always will be.

– Doctor Dido

Oh no, not I, I will survive.

Oh, as long as I know k-NN, I know I’ll stay alive.

– Doctor Gloria Gaynor

#Part 1: Feature Engineering and Visualization

##Chunk 1: Creation of a New Variable Family

T2 = T1 %>%
      mutate(Family=SibSp+Parch) %>%
      select(-SibSp,-Parch)
head(T2)

##Chunk 2: Visualizing the Relationship

ggplot(T2) +
  geom_point(aes(x=Family,y=Fare,color=factor(Survived)),alpha=c(0.3))+
  theme_minimal()+
  theme(text=element_text(size=20)) +
  guides(color=guide_legend(title="Survived"))

#Part 2: $k$-NN For Prediction

##Chunk 1: Information for Alice

Alice=c(3,100)
ggplot(T2) +
  geom_point(aes(x=Family,y=Fare,color=factor(Survived)),alpha=c(0.3))+
  geom_point(aes(x=Alice[1],y=Alice[2]),shape="A",size=8,alpha=0.1)+
  theme_minimal()+
  theme(text=element_text(size=20)) +
  guides(color=guide_legend(title="Survived"))

##Chunk 2: Finding the $k=5$ Most Similar Passengers

k=5

dist.func=function(point1,point2){
  dist=sqrt(sum((point1-point2)^2))
  return(dist)
}

T3=T2 %>% 
      mutate(d=apply(select(T2,Family,Fare),1,dist.func,point2=Alice)) %>%
      arrange(d) %>%
      filter(rank(d,ties.method="min")<=k)

print(T3)

##Chunk 3: Visualize Alice’s $k=5$ Nearest Neighbors

ggplot(T3) +
  geom_point(aes(x=Family,y=Fare,color=factor(Survived)))+
  geom_point(aes(x=Alice[1],y=Alice[2]),shape="A",size=8)+
  theme_minimal()+
  xlim(min(T2$Family),max(T2$Family))+
  ylim(min(T2$Fare),max(T2$Fare))+
  theme(text=element_text(size=20)) +
  guides(color=guide_legend(title="Survived"))

#Part 3: Transform and Revisit $k$-NN

##Chunk 1: Standardizing and Visualizing Data

mean.Family=mean(T2$Family)
sd.Family=sd(T2$Family)
mean.Fare=mean(T2$Fare)
sd.Fare=sd(T2$Fare)

ST3= T2 %>% 
        mutate(Family=(Family-mean.Family)/sd.Family,
               Fare=(Fare-mean.Fare)/sd.Fare)

Z.Alice=(Alice-c(mean.Family,mean.Fare))/c(sd.Family,sd.Fare)
ggplot(ST3) +
  geom_point(aes(x=Family,y=Fare,color=factor(Survived)),alpha=c(0.3))+
  geom_point(aes(x=Z.Alice[1],y=Z.Alice[2]),shape="A",size=8)+
  theme_minimal()+
  xlim(-1,10)+
  ylim(-1,10)+
  xlab("Standardized Family")+
  ylab("Standardized Fare")+
  theme(text=element_text(size=20)) +
  guides(color=guide_legend(title="Survived"))

##Chunk 2: The $k=5$ Most Similar Passengers Again

ST4=ST3 %>% 
      mutate(d=apply(select(ST3,Family,Fare),1,dist.func,point2=Z.Alice)) %>%
      arrange(d) %>%
      filter(rank(d,ties.method="min")<=k)

print(ST4 %>%
        mutate(Family=sd.Family*Family+mean.Family,
               Fare=sd.Fare*Fare+mean.Fare)
      )

ggplot(ST4) +
  geom_point(aes(x=Family,y=Fare,color=factor(Survived)),size=4)+
  geom_point(aes(x=Z.Alice[1],y=Z.Alice[2]),shape="A",size=8)+
  theme_minimal()+
  xlim(-1,10)+
  ylim(-1,10)+
  theme(text=element_text(size=20)) +
  guides(color=guide_legend(title="Survived"))

#Part 4: Tuning $k$ for $k$-NN

##Chunk 1: Predicting Alice’s Survival for Large $k$

k=500
ST5=ST3 %>% 
  mutate(d=apply(select(ST3,Family,Fare),1,dist.func,point2=Z.Alice)) %>%
  arrange(d) %>%
  filter(rank(d,ties.method="min")<=k)

ggplot(ST5) +
  geom_point(aes(x=Family,y=Fare,color=factor(Survived)),size=4)+
  geom_point(aes(x=Z.Alice[1],y=Z.Alice[2]),shape="A",size=8)+
  theme_minimal()+
  xlim(-1,10)+
  ylim(-1,10)+
  theme(text=element_text(size=20)) +
  guides(color=guide_legend(title="Survived"))

KNN.PREDICT=table(ST5$Survived)
print(KNN.PREDICT)

##Chunk 2: Leave-One-Out CV for Different $k \in \{1,2,\cdots, 250\}$

library(class)
possible.k=1:250
accuracy.k=rep(NA,250)

for(k in 1:250){
  cv.out=knn.cv(train=select(ST3,Family,Fare),
                cl=factor(ST3$Survived,levels=c(0,1),labels=c("Died","Survived")),
                k=k)
  correct=mean(cv.out==factor(ST3$Survived,levels=c(0,1),labels=c("Died","Survived")))
  accuracy.k[k]=correct
}

ggplot(data=tibble(possible.k,accuracy.k)) +
  geom_line(aes(x=possible.k,y=accuracy.k),color="lightskyblue2",size=2) +
  theme_minimal() +
  xlab("Choice of k") +
  ylab("Percentage of Accurate Predictions") +
  theme(text=element_text(size=20))

##Chunk 3: Using Optimal $k$-NN for Prediction

best.k=which.max(accuracy.k)
TEST = titanic_test[,c("SibSp","Parch","Fare")] %>%
        mutate(Family=SibSp+Parch) %>%
        select(-SibSp,-Parch) %>%
        mutate(Fare=(Fare-mean.Fare)/sd.Fare,
               Family=(Family-mean.Family)/sd.Family) %>%
        na.omit()

TEST2 = TEST %>% 
          mutate(Predict=knn(train=select(ST3,Family,Fare),
                             test=select(TEST,Family,Fare),
                             cl=factor(ST3$Survived,levels=c(0,1),
                                       labels=c("Died","Survived")),
                             k=best.k)) %>%
          mutate(Family=sd.Family*Family+mean.Family,
                 Fare=sd.Fare*Fare+mean.Fare)

ggplot(TEST2) +
  geom_point(aes(x=Family,y=Fare,color=Predict),
             size=2,alpha=0.3) +
  theme_minimal() +
  theme(text=element_text(size=20))

Supplement for Modeling 7

Mario Giacomazzo

October 30, 2020

Introduction