ABSTRACT

INTRODUCTION

The dataset in question has been collected from the open repository of University of California Irvine (UCI). It contains 12,330 rows and 18 columns. Each row represents a session of online shopping and the columns represent attributes of each session (like - transaction amount, time spent, etc.).

ATTRIBUTE INFORMATION

Of all the attributes, 10 are numerical and 8 are categorical. The Revenue attribute represents whether a transaction was finalized (TRUE) or not (FALSE).

The attributes: Administrative, Administrative_Duration, Informational, Informational_Duration, Product Related and ProductRelated-Duration represent the number of different types of pages visited by the visitor in that session and total time spent (in seconds) in each of these page categories.

BounceRates, ExitRates and PageValues represent the “Google Analytics” metric. The SpecialDay column represents the closeness of the day of transaction to a specific special day (like - Valentines’ Day, New Year, etc.). Its values changes for each special day, but ranges from 0 to 1 with “1” being the closest.

Month, Region, TrafficType and VisitorType re all categorical features indicating the month of visit, geographical region of the visitor, traffic source of the visitor and whether the person is “returning”, “new” or “other” visitor.

OBJECTIVE AND APPROACH

Classify 30% of the dataset (test data) for the target variable “revenue”. Apply Support Vector Machine (SVM) and Random Forest (RF) algorithms for the classifications.

First, introduce NAs at random positions in the data through the prodNA function from missForest package.

Standardize the numerical columns of the data through Z-score method. Apply Factor Analysis of Mixed Data (FAMD) through the FAMD function from FactoMineR package. FAMD performs Principal Component Analysis (PCA) of numerical variables and Multiple Correspondence Analysis (MCA) of categorical values, and hence it is favored for mixed datasets. Analyze the FAMD results to select predictors that account for >85% variance in the data.

Using svm function from e1071 package, train the training set to predict for test set. Train randomForest function from the package of same name over the training data with unscaled features. Plot both the models and evaluate the SVM and RF models with Cohen’s kappa (k) value.

Perform kmeans clustering analysis of the dataset using to view groups within the data.

EXISTING RESEARCH

This dataset has been part of researches, specifically in this paper. Here, they used the LIBSVM implementation with optimized hyperparameter values. Decision tree from C4.5 algorithm and LSTM-RNN were also utilized.

METHOD

SETUP

# Reading the csv file from url into a dataframe
shop <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00468/online_shoppers_intention.csv")

# Changing column names for convenience
colnames(shop) <- c("Admin", "Admin_time", "Info", "Info_time", "Prod", "Prod_time", "Bounce", "Exit",
                    "Page_value", "Special_day", "Month", "OS", "Browser", "Region", "Traffic_type",
                    "Visitor_type", "Weekend", "Revenue")

# Explore dataframe
glimpse(shop)

## Rows: 12,330
## Columns: 18
## $ Admin        <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, ~
## $ Admin_time   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 53, 0, 0, 0,~
## $ Info         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Info_time    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Prod         <int> 1, 2, 1, 2, 10, 19, 1, 0, 2, 3, 3, 16, 7, 6, 2, 23, 1, 13~
## $ Prod_time    <dbl> 0.000000, 64.000000, 0.000000, 2.666667, 627.500000, 154.~
## $ Bounce       <dbl> 0.200000000, 0.000000000, 0.200000000, 0.050000000, 0.020~
## $ Exit         <dbl> 0.200000000, 0.100000000, 0.200000000, 0.140000000, 0.050~
## $ Page_value   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Special_day  <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.4, 0.0, 0.8, 0.4, 0.0, 0.~
## $ Month        <chr> "Feb", "Feb", "Feb", "Feb", "Feb", "Feb", "Feb", "Feb", "~
## $ OS           <int> 1, 2, 4, 3, 3, 2, 2, 1, 2, 2, 1, 1, 1, 2, 3, 1, 1, 1, 2, ~
## $ Browser      <int> 1, 2, 1, 2, 3, 2, 4, 2, 2, 4, 1, 1, 1, 5, 2, 1, 1, 1, 2, ~
## $ Region       <int> 1, 1, 9, 2, 1, 1, 3, 1, 2, 1, 3, 4, 1, 1, 3, 9, 4, 1, 1, ~
## $ Traffic_type <int> 1, 2, 3, 4, 4, 3, 3, 5, 3, 2, 3, 3, 3, 3, 3, 3, 3, 4, 3, ~
## $ Visitor_type <chr> "Returning_Visitor", "Returning_Visitor", "Returning_Visi~
## $ Weekend      <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FAL~
## $ Revenue      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F~

As we can see, the columns from Admin to Special_day are continuous in nature, and the others are categorical. Of the categorical columns, OS Browser, Region and Traffic_type are label_encoded without any order.

# Apply the prodNA function to introduce random missing values in dataset
shop_na <- prodNA(shop, noNA = 0.001)

IMPUTATING NAs

# Examine the data distribution for each columns
summary(shop_na)

##      Admin          Admin_time            Info          Info_time      
##  Min.   : 0.000   Min.   :   0.000   Min.   : 0.000   Min.   :   0.00  
##  1st Qu.: 0.000   1st Qu.:   0.000   1st Qu.: 0.000   1st Qu.:   0.00  
##  Median : 1.000   Median :   7.312   Median : 0.000   Median :   0.00  
##  Mean   : 2.316   Mean   :  80.833   Mean   : 0.504   Mean   :  34.49  
##  3rd Qu.: 4.000   3rd Qu.:  93.169   3rd Qu.: 0.000   3rd Qu.:   0.00  
##  Max.   :27.000   Max.   :3398.750   Max.   :24.000   Max.   :2549.38  
##  NA's   :11       NA's   :10         NA's   :10       NA's   :9        
##       Prod          Prod_time           Bounce              Exit        
##  Min.   :  0.00   Min.   :    0.0   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:  7.00   1st Qu.:  184.3   1st Qu.:0.000000   1st Qu.:0.01429  
##  Median : 18.00   Median :  598.8   Median :0.003125   Median :0.02518  
##  Mean   : 31.72   Mean   : 1193.3   Mean   :0.022203   Mean   :0.04307  
##  3rd Qu.: 38.00   3rd Qu.: 1463.6   3rd Qu.:0.016813   3rd Qu.:0.05000  
##  Max.   :705.00   Max.   :63973.5   Max.   :0.200000   Max.   :0.20000  
##  NA's   :14       NA's   :14        NA's   :16         NA's   :16       
##    Page_value       Special_day         Month                 OS       
##  Min.   :  0.000   Min.   :0.00000   Length:12330       Min.   :1.000  
##  1st Qu.:  0.000   1st Qu.:0.00000   Class :character   1st Qu.:2.000  
##  Median :  0.000   Median :0.00000   Mode  :character   Median :2.000  
##  Mean   :  5.885   Mean   :0.06146                      Mean   :2.124  
##  3rd Qu.:  0.000   3rd Qu.:0.00000                      3rd Qu.:3.000  
##  Max.   :361.764   Max.   :1.00000                      Max.   :8.000  
##  NA's   :10        NA's   :14                           NA's   :13     
##     Browser           Region       Traffic_type    Visitor_type      
##  Min.   : 1.000   Min.   :1.000   Min.   : 1.000   Length:12330      
##  1st Qu.: 2.000   1st Qu.:1.000   1st Qu.: 2.000   Class :character  
##  Median : 2.000   Median :3.000   Median : 2.000   Mode  :character  
##  Mean   : 2.357   Mean   :3.146   Mean   : 4.071                     
##  3rd Qu.: 2.000   3rd Qu.:4.000   3rd Qu.: 4.000                     
##  Max.   :13.000   Max.   :9.000   Max.   :20.000                     
##  NA's   :6        NA's   :12      NA's   :11                         
##   Weekend         Revenue       
##  Mode :logical   Mode :logical  
##  FALSE:9447      FALSE:10412    
##  TRUE :2862      TRUE :1907     
##  NA's :21        NA's :11       
##                                 
##                                 
##

hist.data.frame(shop_na[1:10], na.big = TRUE, rugs = TRUE)

# Imputing NAs through the Predictive Mean Matching method using the mice function
set.seed(1001)
imp <- mice(shop_na[1:10], method = "pmm", m=1)

## 
##  iter imp variable
##   1   1  Admin  Admin_time  Info  Info_time  Prod  Prod_time  Bounce  Exit  Page_value  Special_day
##   2   1  Admin  Admin_time  Info  Info_time  Prod  Prod_time  Bounce  Exit  Page_value  Special_day
##   3   1  Admin  Admin_time  Info  Info_time  Prod  Prod_time  Bounce  Exit  Page_value  Special_day
##   4   1  Admin  Admin_time  Info  Info_time  Prod  Prod_time  Bounce  Exit  Page_value  Special_day
##   5   1  Admin  Admin_time  Info  Info_time  Prod  Prod_time  Bounce  Exit  Page_value  Special_day

shop_na[1:10] <- complete(imp)

# Check the distribution of data
hist.data.frame(shop_na, na.big = TRUE, rugs = TRUE)

summary(shop_na)

##      Admin          Admin_time           Info           Info_time      
##  Min.   : 0.000   Min.   :   0.00   Min.   : 0.0000   Min.   :   0.00  
##  1st Qu.: 0.000   1st Qu.:   0.00   1st Qu.: 0.0000   1st Qu.:   0.00  
##  Median : 1.000   Median :   7.50   Median : 0.0000   Median :   0.00  
##  Mean   : 2.315   Mean   :  80.83   Mean   : 0.5037   Mean   :  34.49  
##  3rd Qu.: 4.000   3rd Qu.:  93.38   3rd Qu.: 0.0000   3rd Qu.:   0.00  
##  Max.   :27.000   Max.   :3398.75   Max.   :24.0000   Max.   :2549.38  
##                                                                        
##       Prod          Prod_time           Bounce              Exit        
##  Min.   :  0.00   Min.   :    0.0   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:  7.00   1st Qu.:  184.3   1st Qu.:0.000000   1st Qu.:0.01429  
##  Median : 18.00   Median :  598.8   Median :0.003125   Median :0.02519  
##  Mean   : 31.73   Mean   : 1194.2   Mean   :0.022196   Mean   :0.04309  
##  3rd Qu.: 38.00   3rd Qu.: 1464.2   3rd Qu.:0.016764   3rd Qu.:0.05000  
##  Max.   :705.00   Max.   :63973.5   Max.   :0.200000   Max.   :0.20000  
##                                                                         
##    Page_value       Special_day         Month                 OS       
##  Min.   :  0.000   Min.   :0.00000   Length:12330       Min.   :1.000  
##  1st Qu.:  0.000   1st Qu.:0.00000   Class :character   1st Qu.:2.000  
##  Median :  0.000   Median :0.00000   Mode  :character   Median :2.000  
##  Mean   :  5.886   Mean   :0.06139                      Mean   :2.124  
##  3rd Qu.:  0.000   3rd Qu.:0.00000                      3rd Qu.:3.000  
##  Max.   :361.764   Max.   :1.00000                      Max.   :8.000  
##                                                         NA's   :13     
##     Browser           Region       Traffic_type    Visitor_type      
##  Min.   : 1.000   Min.   :1.000   Min.   : 1.000   Length:12330      
##  1st Qu.: 2.000   1st Qu.:1.000   1st Qu.: 2.000   Class :character  
##  Median : 2.000   Median :3.000   Median : 2.000   Mode  :character  
##  Mean   : 2.357   Mean   :3.146   Mean   : 4.071                     
##  3rd Qu.: 2.000   3rd Qu.:4.000   3rd Qu.: 4.000                     
##  Max.   :13.000   Max.   :9.000   Max.   :20.000                     
##  NA's   :6        NA's   :12      NA's   :11                         
##   Weekend         Revenue       
##  Mode :logical   Mode :logical  
##  FALSE:9447      FALSE:10412    
##  TRUE :2862      TRUE :1907     
##  NA's :21        NA's :11       
##                                 
##                                 
##

The Stochastic Regression was able to substitute NAs in numerical columns. For the categorical variables, mode of the variable would be used to substitue the NAs.

# Mode of a column
getmode <- function(x) {
   uniq <- unique(x)
   uniq[which.max(tabulate(match(x, uniq)))]
}


# Applying getmode function for each categorical variable
mode.shop <- lapply(shop_na[11:18], getmode)
mode.shop <- as.data.frame(unclass(mode.shop))

# Imputing NAs from respective columns
shop_imp <- shop_na
for(i in 11:18){
  num.row <- which(is.na(shop_na[i]))
  shop_imp[num.row,i] <- mode.shop[i-10]
}

summary(shop_imp)

##      Admin          Admin_time           Info           Info_time      
##  Min.   : 0.000   Min.   :   0.00   Min.   : 0.0000   Min.   :   0.00  
##  1st Qu.: 0.000   1st Qu.:   0.00   1st Qu.: 0.0000   1st Qu.:   0.00  
##  Median : 1.000   Median :   7.50   Median : 0.0000   Median :   0.00  
##  Mean   : 2.315   Mean   :  80.83   Mean   : 0.5037   Mean   :  34.49  
##  3rd Qu.: 4.000   3rd Qu.:  93.38   3rd Qu.: 0.0000   3rd Qu.:   0.00  
##  Max.   :27.000   Max.   :3398.75   Max.   :24.0000   Max.   :2549.38  
##       Prod          Prod_time           Bounce              Exit        
##  Min.   :  0.00   Min.   :    0.0   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:  7.00   1st Qu.:  184.3   1st Qu.:0.000000   1st Qu.:0.01429  
##  Median : 18.00   Median :  598.8   Median :0.003125   Median :0.02519  
##  Mean   : 31.73   Mean   : 1194.2   Mean   :0.022196   Mean   :0.04309  
##  3rd Qu.: 38.00   3rd Qu.: 1464.2   3rd Qu.:0.016764   3rd Qu.:0.05000  
##  Max.   :705.00   Max.   :63973.5   Max.   :0.200000   Max.   :0.20000  
##    Page_value       Special_day         Month                 OS       
##  Min.   :  0.000   Min.   :0.00000   Length:12330       Min.   :1.000  
##  1st Qu.:  0.000   1st Qu.:0.00000   Class :character   1st Qu.:2.000  
##  Median :  0.000   Median :0.00000   Mode  :character   Median :2.000  
##  Mean   :  5.886   Mean   :0.06139                      Mean   :2.124  
##  3rd Qu.:  0.000   3rd Qu.:0.00000                      3rd Qu.:3.000  
##  Max.   :361.764   Max.   :1.00000                      Max.   :8.000  
##     Browser           Region       Traffic_type    Visitor_type      
##  Min.   : 1.000   Min.   :1.000   Min.   : 1.000   Length:12330      
##  1st Qu.: 2.000   1st Qu.:1.000   1st Qu.: 2.000   Class :character  
##  Median : 2.000   Median :3.000   Median : 2.000   Mode  :character  
##  Mean   : 2.357   Mean   :3.144   Mean   : 4.069                     
##  3rd Qu.: 2.000   3rd Qu.:4.000   3rd Qu.: 4.000                     
##  Max.   :13.000   Max.   :9.000   Max.   :20.000                     
##   Weekend         Revenue       
##  Mode :logical   Mode :logical  
##  FALSE:9468      FALSE:10423    
##  TRUE :2862      TRUE :1907     
##                                 
##                                 
##

# Colinearity of continous variables with target column
pairs.panels(cbind(shop_imp[1:10], shop_imp$Revenue))

# Outlier detection for continuous variables
hist.data.frame(shop_imp[1:10])

# Grubbs outlier test
grubbs.test(shop_imp$Exit, opposite = TRUE)

## 
##  Grubbs test for one outlier
## 
## data:  shop_imp$Exit
## G = 0.88667, U = 0.99994, p-value = 1
## alternative hypothesis: lowest value 0 is an outlier

Some of the columns like Prod and Prod_time show high correlation suggesting a collinearity between them. But we will employ PCA(MCA for categorical variables) for feature selection.

The columns Exit seems to have an outlier. But the outlier test does not report enough confidence to state that. So, we will continue to use the dataset as it is for further analysis.

STANDARDIZATION

# Apply z-score method to the numerical columns
num_var <- scale(shop_imp[1:10])

# Stitching the standardized numerical columns with the categorical ones
shop_norm <- cbind(num_var, shop_imp[11:18])

summary(shop_norm)

##      Admin           Admin_time            Info           Info_time     
##  Min.   :-0.6971   Min.   :-0.45722   Min.   :-0.3966   Min.   :-0.245  
##  1st Qu.:-0.6971   1st Qu.:-0.45722   1st Qu.:-0.3966   1st Qu.:-0.245  
##  Median :-0.3960   Median :-0.41480   Median :-0.3966   Median :-0.245  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.000  
##  3rd Qu.: 0.5071   3rd Qu.: 0.07093   3rd Qu.:-0.3966   3rd Qu.:-0.245  
##  Max.   : 7.4312   Max.   :18.76694   Max.   :18.4978   Max.   :17.867  
##       Prod           Prod_time           Bounce             Exit        
##  Min.   :-0.7134   Min.   :-0.6249   Min.   :-0.4578   Min.   :-0.8867  
##  1st Qu.:-0.5560   1st Qu.:-0.5284   1st Qu.:-0.4578   1st Qu.:-0.5927  
##  Median :-0.3087   Median :-0.3115   Median :-0.3933   Median :-0.3683  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.1410   3rd Qu.: 0.1413   3rd Qu.:-0.1120   3rd Qu.: 0.1422  
##  Max.   :15.1382   Max.   :32.8519   Max.   : 3.6670   Max.   : 3.2289  
##    Page_value       Special_day         Month                 OS       
##  Min.   :-0.3173   Min.   :-0.3087   Length:12330       Min.   :1.000  
##  1st Qu.:-0.3173   1st Qu.:-0.3087   Class :character   1st Qu.:2.000  
##  Median :-0.3173   Median :-0.3087   Mode  :character   Median :2.000  
##  Mean   : 0.0000   Mean   : 0.0000                      Mean   :2.124  
##  3rd Qu.:-0.3173   3rd Qu.:-0.3087                      3rd Qu.:3.000  
##  Max.   :19.1813   Max.   : 4.7191                      Max.   :8.000  
##     Browser           Region       Traffic_type    Visitor_type      
##  Min.   : 1.000   Min.   :1.000   Min.   : 1.000   Length:12330      
##  1st Qu.: 2.000   1st Qu.:1.000   1st Qu.: 2.000   Class :character  
##  Median : 2.000   Median :3.000   Median : 2.000   Mode  :character  
##  Mean   : 2.357   Mean   :3.144   Mean   : 4.069                     
##  3rd Qu.: 2.000   3rd Qu.:4.000   3rd Qu.: 4.000                     
##  Max.   :13.000   Max.   :9.000   Max.   :20.000                     
##   Weekend         Revenue       
##  Mode :logical   Mode :logical  
##  FALSE:9468      FALSE:10423    
##  TRUE :2862      TRUE :1907     
##                                 
##                                 
##

FACTOR ANALYSIS OF MIXED DATA

# Applying FAMD function to perform PCA and MCA of numerical and categorical columns, respectively
# On standardized dataset
res.famd_norm <- FAMD(shop_norm, ncp = 10, graph = FALSE)
print(res.famd_norm)

## *The results are available in the following objects:
## 
##   name          description                             
## 1 "$eig"        "eigenvalues and inertia"               
## 2 "$var"        "Results for the variables"             
## 3 "$ind"        "results for the individuals"           
## 4 "$quali.var"  "Results for the qualitative variables" 
## 5 "$quanti.var" "Results for the quantitative variables"

# Screeplot for percentages of inertia explained by each dimensions
fviz_screeplot(res.famd_norm)

The screeplot suggests the use of first 6 dimensions. The rest of the components do not account for substantial increase in the variance. Al

# Get the quality of representation and the contributions of each dimension
var <- get_famd_var(res.famd_norm)
ind <- get_famd_ind(res.famd_norm)

# Plot of variables
fviz_famd_var(res.famd_norm, repel = TRUE)

# Contribution of variables to first 6 dimension
fviz_contrib(res.famd_norm, "var", axes = 1)

fviz_contrib(res.famd_norm, "var", axes = 2)

fviz_contrib(res.famd_norm, "var", axes = 3)

fviz_contrib(res.famd_norm, "var", axes = 4)

fviz_contrib(res.famd_norm, "var", axes = 5)

fviz_contrib(res.famd_norm, "var", axes = 6)

# Plot individuals by label
fviz_ellipses(res.famd_norm, c("Bounce", "Visitor_type"), repel = TRUE)

Interpreting the contributions of variables to first 6 dimensions, we would progress with the following columns as the predictors for the SVM analysis:

Prod, Prod_time, Admin, Info, Bounce, Exit, Page_value, Visitor_type, OS, Month and Special_day.

The last two dimensions might be heavily affected by the individual variances in the depicted variables. Also, choosing both Prod and Prod_time as predictors might introduce some redundancy in the models since the features have shown high correlation.

SVM

# Subset relevant columns for SVM
shop.svm <- shop_norm[ ,c("Prod", "Prod_time", "Admin", "Info", "Bounce", "Exit", "Page_value",
                         "Visitor_type", "OS", "Month", "Special_day", "Revenue")]

# Convert factors character columns to numericals
shop.svm$Visitor_type <- ifelse(shop.svm$Visitor_type == "Other", 0,
                                ifelse(shop.svm$Visitor_type == "New_Visitor", 1,2))
shop.svm$Revenue <- ifelse(shop.svm$Revenue == TRUE, 1,0)

# Encode month labels
lbl <- LabelEncoder$new()
lbl$fit(shop.svm$Month)
shop.svm$Month <- lbl$fit_transform(shop.svm$Month)


# 70%/30% split into training and testing set
shop.svm_train <- shop.svm[1:8631,]
shop.svm_test <- shop.svm[8632:12330,]

# Apply svm function with linear kernel
model.svm <- svm(Revenue ~ ., shop.svm_train, kernel = "linear")
model.svm

## 
## Call:
## svm(formula = Revenue ~ ., data = shop.svm_train, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.09090909 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  2192

# Apply svm function with radial basis kernel
model.svm_rbf <- svm(Revenue ~ ., shop.svm_train, kernel = "radial")
model.svm_rbf

## 
## Call:
## svm(formula = Revenue ~ ., data = shop.svm_train, kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.09090909 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  2219

# Predicting for test set
# Linear kernel model
pred.svm <- predict(model.svm, shop.svm_test[-12], )
pred.svm <- ifelse(pred.svm > 0.5, 1,0)

# Non-linear kernel model
pred.svm_rbf <- predict(model.svm_rbf, shop.svm_test[-12])
pred.svm_rbf <- ifelse(pred.svm_rbf > 0.5, 1,0)

LINEAR AND NON-LINEAR SVM KERNEL

# Cohen's kappa value and confusion matrix for linear SVM
cohen.kappa(cbind(shop.svm_test$Revenue, pred.svm))

## Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels)
## 
## Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries 
##                  lower estimate upper
## unweighted kappa  0.26      0.3  0.34
## weighted kappa    0.26      0.3  0.34
## 
##  Number of subjects = 3699

CrossTable(shop.svm_test$Revenue, pred.svm)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3699 
## 
##  
##                       | pred.svm 
## shop.svm_test$Revenue |         0 |         1 | Row Total | 
## ----------------------|-----------|-----------|-----------|
##                     0 |      2899 |        55 |      2954 | 
##                       |     6.284 |    93.059 |           | 
##                       |     0.981 |     0.019 |     0.799 | 
##                       |     0.837 |     0.235 |           | 
##                       |     0.784 |     0.015 |           | 
## ----------------------|-----------|-----------|-----------|
##                     1 |       566 |       179 |       745 | 
##                       |    24.919 |   368.987 |           | 
##                       |     0.760 |     0.240 |     0.201 | 
##                       |     0.163 |     0.765 |           | 
##                       |     0.153 |     0.048 |           | 
## ----------------------|-----------|-----------|-----------|
##          Column Total |      3465 |       234 |      3699 | 
##                       |     0.937 |     0.063 |           | 
## ----------------------|-----------|-----------|-----------|
## 
##

# Cohen's kappa value and confusion matrix for non-linear SVM
cohen.kappa(cbind(shop.svm_test$Revenue, pred.svm_rbf))

## Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels)
## 
## Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries 
##                  lower estimate upper
## unweighted kappa  0.34     0.38  0.41
## weighted kappa    0.34     0.38  0.41
## 
##  Number of subjects = 3699

CrossTable(shop.svm_test$Revenue, pred.svm_rbf)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3699 
## 
##  
##                       | pred.svm_rbf 
## shop.svm_test$Revenue |         0 |         1 | Row Total | 
## ----------------------|-----------|-----------|-----------|
##                     0 |      2883 |        71 |      2954 | 
##                       |    11.198 |   123.730 |           | 
##                       |     0.976 |     0.024 |     0.799 | 
##                       |     0.850 |     0.231 |           | 
##                       |     0.779 |     0.019 |           | 
## ----------------------|-----------|-----------|-----------|
##                     1 |       509 |       236 |       745 | 
##                       |    44.403 |   490.601 |           | 
##                       |     0.683 |     0.317 |     0.201 | 
##                       |     0.150 |     0.769 |           | 
##                       |     0.138 |     0.064 |           | 
## ----------------------|-----------|-----------|-----------|
##          Column Total |      3392 |       307 |      3699 | 
##                       |     0.917 |     0.083 |           | 
## ----------------------|-----------|-----------|-----------|
## 
##

The linear kernel gives an accuracy of 83%, with a precision and recall of 76% and 24%, respectively. The kappa value is 0.3.

The non-linear RBF kernel produces an accuracy of 85%, with precision and recall of 76% and 32%, respectively. The kappa value is 0.37.

RANDOM FOREST

# Splitting the non-standardized dataset into training and testing set
shop.rf_train <- shop_imp[1:8631,]
shop.rf_test <- shop_imp[8632:12330,]

# Train the RF
model.rf <- randomForest(Revenue ~ ., shop.rf_train)

## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?

# Predict for test set
pred.rf <- predict(model.rf, shop.rf_test[-18], type = "response")
pred.rf <- ifelse(pred.rf>0.5, 1,0)

# Confusion matrix for RF
CrossTable(shop.rf_test$Revenue, pred.rf)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3699 
## 
##  
##                      | pred.rf 
## shop.rf_test$Revenue |         0 |         1 | Row Total | 
## ---------------------|-----------|-----------|-----------|
##                FALSE |      2884 |        70 |      2954 | 
##                      |    12.759 |   134.724 |           | 
##                      |     0.976 |     0.024 |     0.799 | 
##                      |     0.854 |     0.219 |           | 
##                      |     0.780 |     0.019 |           | 
## ---------------------|-----------|-----------|-----------|
##                 TRUE |       495 |       250 |       745 | 
##                      |    50.590 |   534.196 |           | 
##                      |     0.664 |     0.336 |     0.201 | 
##                      |     0.146 |     0.781 |           | 
##                      |     0.134 |     0.068 |           | 
## ---------------------|-----------|-----------|-----------|
##         Column Total |      3379 |       320 |      3699 | 
##                      |     0.913 |     0.087 |           | 
## ---------------------|-----------|-----------|-----------|
## 
##

# Cohen's kappa value for RF
cohen.kappa(cbind(shop.rf_test$Revenue, pred.rf))

## Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels)
## 
## Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries 
##                  lower estimate upper
## unweighted kappa  0.36      0.4  0.43
## weighted kappa    0.36      0.4  0.43
## 
##  Number of subjects = 3699

The default Random Forest algorithm has produced accuracy of 85%, with precision and recall of 80% and 35%, respectively. The kappa value is 0.41.

CLUSTERING

# Copying the svm dataset as it is devoid of characters and normalized
shop.cl <- shop.svm

# Dissmilarity matrix for standardized data
d <- dist(shop.cl, method = "manhattan")

# Applying the kmeans function
model.cl <- kmeans(shop.cl, centers = 2, algorithm = c("Forgy"), iter.max = 20)
model.cl$centers

##         Prod  Prod_time       Admin        Info      Bounce        Exit
## 1  0.1627774  0.1468311  0.08204478  0.06857329 -0.05956048 -0.08423854
## 2 -0.1565108 -0.1411784 -0.07888620 -0.06593333  0.05726751  0.08099551
##    Page_value Visitor_type       OS    Month Special_day   Revenue
## 1  0.04765796     1.808570 2.153375 7.425381  -0.3086808 0.1988749
## 2 -0.04582322     1.887687 2.095768 1.580815   0.2967972 0.1121540

# Plotting clusters
fviz_cluster(model.cl, shop.cl[, -12], palette = c("#2E9FDF", "#00AFBB", "#E7B800"), 
             geom = "point", ellipse = TRUE, ellipse.type = "convex", ggtheme = theme_bw())

The kmeans algorithm has differentiated two clusters. The first cluster represents the visitors who:

Spend the least time on product pages(Prod and Prod_time)
Tend to exit the current webpage they are on (Bounce and Exit)
Tend to visit the webpage around special days (Special_day)
Consume only on the basis of the timing (Month)

The second cluster represents visitors who visit the pages all the time, irrespective of the date or special day, tend to stay more on product pages and also spend around two times the cluster 1 representatives.

We can say that the cluster 2 visitors are the regular shoppers, while cluster 1 visitors can be termed as Special Day Shoppers.

ENSEMBLE MODEL

Since, among SVM the non-linear RBF has higher recall and kappa value, I have chosen to use RBF SVM and RF algorithms for the ensemble method.

# SVM features
ft.svm <- c("Prod", "Prod_time", "Admin", "Info", "Bounce", "Exit", "Page_vale", 
            "Visitor_type", "OS", "Month", "Special_day")
            
# Function for ensemble model
ensmbl <- function(x){
  
  p1 <- predict(model.svm_rbf, x[, ft.svm], type = "response")
  p1 <- ifelse(p1>0.5, 1,0)
  
  p2 <- predict(model.rf, x, type = "response")
  p2 <- ifelse(p2>0.5, 1,0)
  
  result <- ifelse(p2 == 1, 1,
                   ifelse(p1 == 1, 1,0))
  
  return(result)
}

This ensemble model favors TRUE or 1 classification of RF model over any other.

When the RF classifies the sample as FALSE or 2, it checks for RBF SVM’s results, which if TRUE or 1 would lead to final classification of TRUE or 1, else otherwise.

The idea is that it is better to misclassify non-buyer shoppers as potential buyers, that to misclassify potential buyers as non-buyers. Hence, a False Positive is more welcomed that False Negative

Real Time Prediction Online Shoppers Purchasing Intention