Introduction
It is very important to access SQL database from RStudio to carry on with the down stream analysis. Below is what you can expect from this post:
Accessing SQL server hosted on your local computer
Accessing SQL server hosted on a cloud, Azure Cloud in this case
We will be using R’s RODBC
package for doing this.
This post assumes that you already have a local SQL server running on your computer and know credentials to access Azure SQL server database.
So let’s get started.
1. Accessing SQL server hosted on your local computer
A. The first step to connect to a local SQL server is to create a data source. You can see the path to reach that point from the image below.
B. The click on “Add” to select a driver for the SQL Server. In this case we choose SQL Server and click on Finish.
C. Next we have to provide a name to the data source, a description and the server we want to connect to. Description is optional. After providing these details click Next.
D. In the next step we go ahead with Wndows NT authentication and click Next
E. Finally, change the default database to “AdventureWorksLT2012” and click Next till Finish
F. Thats it, we can see below that we have successfully created a new data source with the name “DemoData1”
G. Now we switch over to the RStudio and install and load the RODBC
package. Then create a odbcConnect
object with the data source as the input
library(RODBC)
# Create a connection object with data source as the input
con <- odbcConnect("DemoData1")
H. We can take a look at all the tables in the database or under a specific schema using sqlTables
function
tables <- sqlTables(con, schema = "SalesLT")
tables$TABLE_NAME
## [1] "Address" "Customer"
## [3] "CustomerAddress" "Product"
## [5] "ProductCategory" "ProductDescription"
## [7] "ProductModel" "ProductModelProductDescription"
## [9] "SalesOrderDetail" "SalesOrderHeader"
## [11] "vGetAllCategories" "vProductAndDescription"
## [13] "vProductModelCatalogDescription"
I. From here on, we can infact start accessing each of the tables and tranfering tables as data frame in to R environment. Lets try it out.
df <- sqlQuery(con,
"
SELECT *
FROM SalesLT.ProductCategory
")
J. We can take a look at all columns in a particular table
sqlColumns(con, "SalesLT.Product")$COLUMN_NAME
## [1] "ProductID" "Name"
## [3] "ProductNumber" "Color"
## [5] "StandardCost" "ListPrice"
## [7] "Size" "Weight"
## [9] "ProductCategoryID" "ProductModelID"
## [11] "SellStartDate" "SellEndDate"
## [13] "DiscontinuedDate" "ThumbNailPhoto"
## [15] "ThumbnailPhotoFileName" "rowguid"
## [17] "ModifiedDate"
K. We can write more complicated queries in RStudio to get the data in required format. For example in the the below example we can pull out aggregate product sales by joining 2 tables.
df <- sqlQuery(con,
"
SELECT p.Name, sum(sod.LineTotal) AS Revenue
FROM SalesLT.Product AS p
JOIN SalesLT.SalesOrderDetail AS sod
ON p.ProductID = sod.ProductID
GROUP BY p.Name
ORDER BY Revenue DESC
")
odbcClose(con) # Always remember to close connect
L. Once we have the data in R environment, we can perform various summary statistics and visualizations. Below is a bar plot using ggplot
package of top 10 products.Of course the possibilities are limitless!
library(ggplot2)
library(dplyr)
library(plotly)
df <- df[1:10,]
df$Name <- factor(df$Name, levels = df$Name[order(df$Revenue, decreasing = TRUE)])
ggplotly(df %>% ggplot(aes(x = Name, y = Revenue, fill = Name)) + geom_bar(stat = "identity") + labs(title = "Top 10 Products by Sales Revenue") + theme_bw() + theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank()))
Now let’s move on to accessing cloud hosted SQL server from RStudio
2. Accessing SQL server hosted on a cloud
To access a SQL database which is hosted on a cloud we need the server name, database name, user id and password. The steps will be quite similar if you followed the experiment above.
A. First create a connection string which consists of Driver, Server, Database, User Id and Password. Then pass the connection string to `odbcDriverConnect
to create a connection to the server
connectionString <- "Driver=SQL Server;Server={Servern name here};Database={Database name here};Uid={User Id here};Pwd={Password here};Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;"
con <- odbcDriverConnect(connectionString)
B. Once the connection is established, we can view the tables in the database.
tables <- sqlTables(con, schema = "bi")
tables$TABLE_NAME
## [1] "date" "geo" "manufacturer" "product"
## [5] "salesFact" "sentiment"
C. Now we can explore various tables and pull out data using SQL queries for further analysis in R.
Let’s look at the columns in product
and salesFact
tables
sqlColumns(con, "bi.product")$COLUMN_NAME
## [1] "ProductID" "Product" "Category" "Segment"
## [5] "ManufacturerID" "Manufacturer"
sqlColumns(con, "bi.SalesFact")$COLUMN_NAME
## [1] "ProductID" "Date" "Zip" "Units" "Revenue"
Now, let’s say we want to look at the yearly sales trend of all time top 5 products by sales revenue.
First, lets try to identify the all time top 5 products by sales revenue.
top5products <- sqlQuery(con, "SELECT TOP 5 ProductID
FROM (
SELECT ProductID, ISNULL(Sum(Revenue),0) AS TotalRevenue
FROM bi.SalesFact
GROUP BY ProductID
) as t
ORDER BY TotalRevenue DESC")
top5products
## ProductID
## 1 487
## 2 826
## 3 438
## 4 506
## 5 475
We have got the top 5 Product IDs and Products names. To use the top 5 Product IDs as an argument inside the SQL query we need to convert it in to the format understood by the SQL syntax. The below code will help us do that.
do.call(paste, c(as.list(top5products$ProductID), sep=","))
## [1] "487,826,438,506,475"
Then, we will pull out the yearly sales revenue data for all these top 5 products. Please note from the below code that we replaced the ProductID
with Product
(name) using the join
statement in the query.
yoysales <- sqlQuery(con, paste("
SELECT p.Product, YEAR(sf.Date) as YR, SUM(sf.Revenue) AS YearRevenue
FROM bi.SalesFact as sf
JOIN bi.product as p
ON p.ProductID = sf.ProductID
WHERE sf.ProductID IN (", do.call(paste, c(as.list(top5products$ProductID), sep=",")), ")
GROUP BY sf.ProductID, p.Product, YEAR(sf.DATE)
ORDER BY sf.ProductID, p.Product, YEAR(sf.Date)
", sep = " "))
odbcClose(con) # Always remember to close connect
Finally we have the data frame we wanted and we can plot the yearly trend using plotly
.
library(plotly)
library(dplyr)
plot_ly(yoysales, x = ~YR, y = ~YearRevenue, color = ~Product) %>% add_lines(line = list(shape = "spline")) %>%
layout(title = "Yearly Sales Trend of All Time Top 5 Products",
xaxis = list(title = "Year"),
yaxis = list (title = "Revenue in USD"))
Conclusion
We are at the end of this long post. To summarize we have covered the following aspects:
- Access and explore the database hosted on a local SQL server
- Access and explore the database hosted on a cloud
- We also saw how we can do some exploratory data analysis and render in to beautiful charts
I hope you thoroughly enjoyed the post. Please reach out for any suggestions or query. I will be glad to help.
December 22, 2016
Aditya