About the Experiment
For this experiment I will be using the iris
data set. Below is a quick look at the data set for the uninitiated.
data(iris)
iris <- iris[iris$Species != "virginica",]
iris$Species<- factor(iris$Species)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Typically, iris
data set in R is used to predict the Species
based on all other features.
However, before building a prediction model it is always a good practice to explore the relationship between depedant and indendant variables. Below is what you can expect from this post.
- Convert the
Petal.Width
columns to a categorical variable - Drop
Petal.Width
column - Perform Chi-Sqare test and interpret results
- Perform t test and interpret results
- Conclusion
So let’s get started.
1. Convert the Petal.Width
columns to a categorical variable
There are multiple ways to convert a continuous variable to a categorical variable. Before we do that let’s look at some descriptive statistics of this variable.
- There are a total of 100 observation
- The range of these observations is 0.1 to 1.8
- Below is the summary
summary(iris$Petal.Width)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.200 0.800 0.786 1.300 1.800
We will break the variable into 2 categories, Below
and Above
the median value.
iris$Petal.Width.Cat <- cut(iris$Petal.Width, breaks = quantile(iris$Petal.Width, probs = seq(0, 1, 0.5)), include.lowest = TRUE)
levels(iris$Petal.Width.Cat) <- c("Below", "Above")
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## Petal.Width.Cat
## 1 Below
## 2 Below
## 3 Below
## 4 Below
## 5 Below
## 6 Below
2. Drop Petal.Width
column
Now we will drop the Petal.Width Column
with the below code.
iris <- iris[,!(names(iris) %in% "Petal.Width")]
head(iris)
## Sepal.Length Sepal.Width Petal.Length Species Petal.Width.Cat
## 1 5.1 3.5 1.4 setosa Below
## 2 4.9 3.0 1.4 setosa Below
## 3 4.7 3.2 1.3 setosa Below
## 4 4.6 3.1 1.5 setosa Below
## 5 5.0 3.6 1.4 setosa Below
## 6 5.4 3.9 1.7 setosa Below
3. Perform few tests and inferential statistics
A. Species
Vs Petal.Width.Cat
Lets look if there is any interaction between Species
and Petal.Width.Cat
. Since these two are categorical variables we will use a chi-square test. The starting point is to take a look at the contingency table.
cont <- table(iris$Petal.Width.Cat, iris$Species)
cont
##
## setosa versicolor
## Below 50 0
## Above 0 50
To perform the chi-square
test we will assume the null hypothesis as below:
H0 : The Petal.Width.Cat
has no affect on the Species
Consequently, the alternate hypotheses will be defined as below:
Ha : The Petal.Width.Cat
has some affect on the Species
We will perform the test using chisq.test
function in R.
Xsqt <- chisq.test(cont)
Xsqt
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: cont
## X-squared = 96.04, df = 1, p-value < 2.2e-16
As we can see from the results above, the p-value
is 1.125856510^{-22} which is quite smaller then the threshold value of 5%
. This enables us to safely reject the null hypothesis and accept the alternalte hypothesis. In other words, Petal.Width.Cat
has an impact on Species
which allows us to conclude that Petal.Width.Cat
is a good predictor for Species
.
B. Species
Vs Sepal.Length
Now, we will explore the replatioship between Species
and Sepal.Length
. One is a categorical variable and the other is a continuous variable. To get a quick insight in to if one affects the other we can look at the box plot.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot() + labs(title = "Species Vs Sepal.Length", x = "Species", y = "Sepal.Length") + theme_bw()
As the box plots are clearly seperated with no overlap we can infer that the Sepal.Length
can fairly estimate the Species
. Lets corroborate this with a simple t.test
For the test below are the hypotheses:
Null Hypothesis H0 : Sepal.Length
has no affect on Species
. That is to say that the difference between the observed Sepal.Length
values for various Species
are not statistically different
Alternate Hypothese Ha : Sepal.Length
has some affect on Species
. That is to say that the difference between the observed Sepal.Length
values for various Species
are infact different from each other
x <- iris[iris$Species == "setosa", ]$Sepal.Length
y <- iris[iris$Species == "versicolor", ]$Sepal.Length
tt <- t.test(x, y, paired = FALSE, alternative = "two.sided", var.equal = FALSE)
tt
##
## Welch Two Sample t-test
##
## data: x and y
## t = -10.521, df = 86.538, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.1057074 -0.7542926
## sample estimates:
## mean of x mean of y
## 5.006 5.936
As we can observe from from the results above the p=value
which is 3.746742610^{-17} is significantly lower than our threshold of 5%. Hence, we can reject the null hypothesis and accept the alternative.
As per the alternative hypothesis, Sepal.Length
values for various Species
are statistically different from each other, i.e. values are clearly clearly seperated across different classes of Species
. Hence Sepal.Length
can be a good predictor.
Conclusion
In this articale we have seen how to practically perform a chi-sqaure test
and a t test
when exploring the data. Especially, these tools become quite handly to understand the relationship between dependant and independant variables before building machine learning models. Depending on results of various tests we can choose which models to include or exclude from our prediction models to enhance the performance of the models.
December 21, 2016
Aditya SV