About the Experiment

For this experiment I will be using the iris data set. Below is a quick look at the data set for the uninitiated.

data(iris)
iris <- iris[iris$Species != "virginica",]
iris$Species<- factor(iris$Species)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Typically, iris data set in R is used to predict the Species based on all other features.

However, before building a prediction model it is always a good practice to explore the relationship between depedant and indendant variables. Below is what you can expect from this post.

  1. Convert the Petal.Width columns to a categorical variable
  2. Drop Petal.Width column
  3. Perform Chi-Sqare test and interpret results
  4. Perform t test and interpret results
  5. Conclusion

So let’s get started.

1. Convert the Petal.Width columns to a categorical variable

There are multiple ways to convert a continuous variable to a categorical variable. Before we do that let’s look at some descriptive statistics of this variable.

  1. There are a total of 100 observation
  2. The range of these observations is 0.1 to 1.8
  3. Below is the summary
summary(iris$Petal.Width)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.200   0.800   0.786   1.300   1.800

We will break the variable into 2 categories, Below and Above the median value.

iris$Petal.Width.Cat <- cut(iris$Petal.Width, breaks = quantile(iris$Petal.Width, probs = seq(0, 1, 0.5)), include.lowest = TRUE)
levels(iris$Petal.Width.Cat) <- c("Below", "Above")
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
##   Petal.Width.Cat
## 1           Below
## 2           Below
## 3           Below
## 4           Below
## 5           Below
## 6           Below

2. Drop Petal.Width column

Now we will drop the Petal.Width Column with the below code.

iris <- iris[,!(names(iris) %in% "Petal.Width")]
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Species Petal.Width.Cat
## 1          5.1         3.5          1.4  setosa           Below
## 2          4.9         3.0          1.4  setosa           Below
## 3          4.7         3.2          1.3  setosa           Below
## 4          4.6         3.1          1.5  setosa           Below
## 5          5.0         3.6          1.4  setosa           Below
## 6          5.4         3.9          1.7  setosa           Below

3. Perform few tests and inferential statistics

A. Species Vs Petal.Width.Cat

Lets look if there is any interaction between Species and Petal.Width.Cat. Since these two are categorical variables we will use a chi-square test. The starting point is to take a look at the contingency table.

cont <- table(iris$Petal.Width.Cat, iris$Species)
cont
##        
##         setosa versicolor
##   Below     50          0
##   Above      0         50

To perform the chi-square test we will assume the null hypothesis as below:

H0 : The Petal.Width.Cat has no affect on the Species

Consequently, the alternate hypotheses will be defined as below:

Ha : The Petal.Width.Cat has some affect on the Species

We will perform the test using chisq.test function in R.

Xsqt <- chisq.test(cont)
Xsqt
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cont
## X-squared = 96.04, df = 1, p-value < 2.2e-16

As we can see from the results above, the p-value is 1.125856510^{-22} which is quite smaller then the threshold value of 5%. This enables us to safely reject the null hypothesis and accept the alternalte hypothesis. In other words, Petal.Width.Cat has an impact on Species which allows us to conclude that Petal.Width.Cat is a good predictor for Species.

B. Species Vs Sepal.Length

Now, we will explore the replatioship between Species and Sepal.Length. One is a categorical variable and the other is a continuous variable. To get a quick insight in to if one affects the other we can look at the box plot.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot() + labs(title = "Species Vs Sepal.Length", x = "Species", y = "Sepal.Length") + theme_bw()

As the box plots are clearly seperated with no overlap we can infer that the Sepal.Length can fairly estimate the Species. Lets corroborate this with a simple t.test

For the test below are the hypotheses:

Null Hypothesis H0 : Sepal.Length has no affect on Species. That is to say that the difference between the observed Sepal.Length values for various Species are not statistically different

Alternate Hypothese Ha : Sepal.Length has some affect on Species. That is to say that the difference between the observed Sepal.Length values for various Species are infact different from each other

x <- iris[iris$Species == "setosa", ]$Sepal.Length
y <- iris[iris$Species == "versicolor", ]$Sepal.Length
tt <- t.test(x, y, paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tt
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -10.521, df = 86.538, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.1057074 -0.7542926
## sample estimates:
## mean of x mean of y 
##     5.006     5.936

As we can observe from from the results above the p=value which is 3.746742610^{-17} is significantly lower than our threshold of 5%. Hence, we can reject the null hypothesis and accept the alternative.

As per the alternative hypothesis, Sepal.Length values for various Species are statistically different from each other, i.e. values are clearly clearly seperated across different classes of Species. Hence Sepal.Length can be a good predictor.

Conclusion

In this articale we have seen how to practically perform a chi-sqaure test and a t test when exploring the data. Especially, these tools become quite handly to understand the relationship between dependant and independant variables before building machine learning models. Depending on results of various tests we can choose which models to include or exclude from our prediction models to enhance the performance of the models.

December 21, 2016

Aditya SV