Introduction

It is very important to access SQL database from RStudio to carry on with the down stream analysis. Below is what you can expect from this post:

  1. Accessing SQL server hosted on your local computer

  2. Accessing SQL server hosted on a cloud, Azure Cloud in this case

We will be using R’s RODBC package for doing this.

This post assumes that you already have a local SQL server running on your computer and know credentials to access Azure SQL server database.

So let’s get started.

1. Accessing SQL server hosted on your local computer

A. The first step to connect to a local SQL server is to create a data source. You can see the path to reach that point from the image below.

B. The click on “Add” to select a driver for the SQL Server. In this case we choose SQL Server and click on Finish.

C. Next we have to provide a name to the data source, a description and the server we want to connect to. Description is optional. After providing these details click Next.

D. In the next step we go ahead with Wndows NT authentication and click Next

E. Finally, change the default database to “AdventureWorksLT2012” and click Next till Finish

F. Thats it, we can see below that we have successfully created a new data source with the name “DemoData1”

G. Now we switch over to the RStudio and install and load the RODBC package. Then create a odbcConnect object with the data source as the input

library(RODBC)

# Create a connection object with data source as the input
con <- odbcConnect("DemoData1")

H. We can take a look at all the tables in the database or under a specific schema using sqlTables function

tables <- sqlTables(con, schema = "SalesLT")
tables$TABLE_NAME
##  [1] "Address"                         "Customer"                       
##  [3] "CustomerAddress"                 "Product"                        
##  [5] "ProductCategory"                 "ProductDescription"             
##  [7] "ProductModel"                    "ProductModelProductDescription" 
##  [9] "SalesOrderDetail"                "SalesOrderHeader"               
## [11] "vGetAllCategories"               "vProductAndDescription"         
## [13] "vProductModelCatalogDescription"

I. From here on, we can infact start accessing each of the tables and tranfering tables as data frame in to R environment. Lets try it out.

df <- sqlQuery(con, 
               "
               SELECT *
               FROM SalesLT.ProductCategory
               ")

J. We can take a look at all columns in a particular table

sqlColumns(con, "SalesLT.Product")$COLUMN_NAME
##  [1] "ProductID"              "Name"                  
##  [3] "ProductNumber"          "Color"                 
##  [5] "StandardCost"           "ListPrice"             
##  [7] "Size"                   "Weight"                
##  [9] "ProductCategoryID"      "ProductModelID"        
## [11] "SellStartDate"          "SellEndDate"           
## [13] "DiscontinuedDate"       "ThumbNailPhoto"        
## [15] "ThumbnailPhotoFileName" "rowguid"               
## [17] "ModifiedDate"

K. We can write more complicated queries in RStudio to get the data in required format. For example in the the below example we can pull out aggregate product sales by joining 2 tables.

df <- sqlQuery(con, 
               "
               SELECT p.Name, sum(sod.LineTotal) AS Revenue
               FROM SalesLT.Product AS p
               JOIN SalesLT.SalesOrderDetail AS sod
               ON p.ProductID = sod.ProductID
               GROUP BY p.Name
               ORDER BY Revenue DESC
               ")

odbcClose(con) # Always remember to close connect

L. Once we have the data in R environment, we can perform various summary statistics and visualizations. Below is a bar plot using ggplot package of top 10 products.Of course the possibilities are limitless!

library(ggplot2)
library(dplyr)
library(plotly)
df <- df[1:10,]
df$Name <-  factor(df$Name, levels = df$Name[order(df$Revenue, decreasing = TRUE)])
ggplotly(df %>% ggplot(aes(x = Name, y = Revenue, fill = Name)) + geom_bar(stat = "identity") + labs(title = "Top 10 Products by Sales Revenue") + theme_bw() + theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())) 

Now let’s move on to accessing cloud hosted SQL server from RStudio

2. Accessing SQL server hosted on a cloud

To access a SQL database which is hosted on a cloud we need the server name, database name, user id and password. The steps will be quite similar if you followed the experiment above.

A. First create a connection string which consists of Driver, Server, Database, User Id and Password. Then pass the connection string to `odbcDriverConnect to create a connection to the server

connectionString <- "Driver=SQL Server;Server={Servern name here};Database={Database name here};Uid={User Id here};Pwd={Password here};Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;"
con <- odbcDriverConnect(connectionString)

B. Once the connection is established, we can view the tables in the database.

tables <- sqlTables(con, schema = "bi")
tables$TABLE_NAME
## [1] "date"         "geo"          "manufacturer" "product"     
## [5] "salesFact"    "sentiment"

C. Now we can explore various tables and pull out data using SQL queries for further analysis in R.

Let’s look at the columns in product and salesFact tables

sqlColumns(con, "bi.product")$COLUMN_NAME
## [1] "ProductID"      "Product"        "Category"       "Segment"       
## [5] "ManufacturerID" "Manufacturer"
sqlColumns(con, "bi.SalesFact")$COLUMN_NAME
## [1] "ProductID" "Date"      "Zip"       "Units"     "Revenue"

Now, let’s say we want to look at the yearly sales trend of all time top 5 products by sales revenue.

First, lets try to identify the all time top 5 products by sales revenue.

top5products <- sqlQuery(con, "SELECT TOP 5 ProductID 
                                  FROM (
                                        SELECT ProductID, ISNULL(Sum(Revenue),0) AS TotalRevenue
                                        FROM bi.SalesFact
                                        GROUP BY ProductID
                                        ) as t
                                  ORDER BY TotalRevenue DESC")
top5products
##   ProductID
## 1       487
## 2       826
## 3       438
## 4       506
## 5       475

We have got the top 5 Product IDs and Products names. To use the top 5 Product IDs as an argument inside the SQL query we need to convert it in to the format understood by the SQL syntax. The below code will help us do that.

do.call(paste, c(as.list(top5products$ProductID), sep=","))
## [1] "487,826,438,506,475"

Then, we will pull out the yearly sales revenue data for all these top 5 products. Please note from the below code that we replaced the ProductID with Product (name) using the join statement in the query.

yoysales <- sqlQuery(con, paste("
                                SELECT p.Product, YEAR(sf.Date) as YR, SUM(sf.Revenue) AS YearRevenue
                                   FROM bi.SalesFact as sf
                                   JOIN bi.product as p
                                   ON p.ProductID = sf.ProductID
                                   WHERE sf.ProductID IN (", do.call(paste, c(as.list(top5products$ProductID), sep=",")), ") 
                                   GROUP BY sf.ProductID, p.Product, YEAR(sf.DATE)
                                   ORDER BY sf.ProductID, p.Product, YEAR(sf.Date)
                                   ", sep = " "))
odbcClose(con) # Always remember to close connect

Finally we have the data frame we wanted and we can plot the yearly trend using plotly.

library(plotly)
library(dplyr)
plot_ly(yoysales, x = ~YR, y = ~YearRevenue, color = ~Product) %>% add_lines(line = list(shape = "spline")) %>% 
        layout(title = "Yearly Sales Trend of All Time Top 5 Products",
         xaxis = list(title = "Year"),
         yaxis = list (title = "Revenue in USD"))

Conclusion

We are at the end of this long post. To summarize we have covered the following aspects:

  1. Access and explore the database hosted on a local SQL server
  2. Access and explore the database hosted on a cloud
  3. We also saw how we can do some exploratory data analysis and render in to beautiful charts

I hope you thoroughly enjoyed the post. Please reach out for any suggestions or query. I will be glad to help.

December 22, 2016

Aditya