The dataset in question has been collected from the open repository of University of California Irvine (UCI). It contains 12,330 rows and 18 columns. Each row represents a session of online shopping and the columns represent attributes of each session (like - transaction amount, time spent, etc.).
Of all the attributes, 10 are numerical and 8 are categorical. The Revenue attribute represents whether a transaction was finalized (TRUE) or not (FALSE).
The attributes: Administrative, Administrative_Duration, Informational, Informational_Duration, Product Related and ProductRelated-Duration represent the number of different types of pages visited by the visitor in that session and total time spent (in seconds) in each of these page categories.
BounceRates, ExitRates and PageValues represent the “Google Analytics” metric. The SpecialDay column represents the closeness of the day of transaction to a specific special day (like - Valentines’ Day, New Year, etc.). Its values changes for each special day, but ranges from 0 to 1 with “1” being the closest.
Month, Region, TrafficType and VisitorType re all categorical features indicating the month of visit, geographical region of the visitor, traffic source of the visitor and whether the person is “returning”, “new” or “other” visitor.
Classify 30% of the dataset (test data) for the target variable “revenue”. Apply Support Vector Machine (SVM) and Random Forest (RF) algorithms for the classifications.
First, introduce NAs at random positions in the data through the prodNA function from missForest package.
Standardize the numerical columns of the data through Z-score method. Apply Factor Analysis of Mixed Data (FAMD) through the FAMD function from FactoMineR package. FAMD performs Principal Component Analysis (PCA) of numerical variables and Multiple Correspondence Analysis (MCA) of categorical values, and hence it is favored for mixed datasets. Analyze the FAMD results to select predictors that account for >85% variance in the data.
Using svm function from e1071 package, train the training set to predict for test set. Train randomForest function from the package of same name over the training data with unscaled features. Plot both the models and evaluate the SVM and RF models with Cohen’s kappa (k) value.
Perform kmeans clustering analysis of the dataset using to view groups within the data.
This dataset has been part of researches, specifically in this paper. Here, they used the LIBSVM implementation with optimized hyperparameter values. Decision tree from C4.5 algorithm and LSTM-RNN were also utilized.
# Reading the csv file from url into a dataframe
shop <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00468/online_shoppers_intention.csv")
# Changing column names for convenience
colnames(shop) <- c("Admin", "Admin_time", "Info", "Info_time", "Prod", "Prod_time", "Bounce", "Exit",
"Page_value", "Special_day", "Month", "OS", "Browser", "Region", "Traffic_type",
"Visitor_type", "Weekend", "Revenue")
# Explore dataframe
glimpse(shop)
## Rows: 12,330
## Columns: 18
## $ Admin <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, ~
## $ Admin_time <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 53, 0, 0, 0,~
## $ Info <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Info_time <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Prod <int> 1, 2, 1, 2, 10, 19, 1, 0, 2, 3, 3, 16, 7, 6, 2, 23, 1, 13~
## $ Prod_time <dbl> 0.000000, 64.000000, 0.000000, 2.666667, 627.500000, 154.~
## $ Bounce <dbl> 0.200000000, 0.000000000, 0.200000000, 0.050000000, 0.020~
## $ Exit <dbl> 0.200000000, 0.100000000, 0.200000000, 0.140000000, 0.050~
## $ Page_value <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Special_day <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.4, 0.0, 0.8, 0.4, 0.0, 0.~
## $ Month <chr> "Feb", "Feb", "Feb", "Feb", "Feb", "Feb", "Feb", "Feb", "~
## $ OS <int> 1, 2, 4, 3, 3, 2, 2, 1, 2, 2, 1, 1, 1, 2, 3, 1, 1, 1, 2, ~
## $ Browser <int> 1, 2, 1, 2, 3, 2, 4, 2, 2, 4, 1, 1, 1, 5, 2, 1, 1, 1, 2, ~
## $ Region <int> 1, 1, 9, 2, 1, 1, 3, 1, 2, 1, 3, 4, 1, 1, 3, 9, 4, 1, 1, ~
## $ Traffic_type <int> 1, 2, 3, 4, 4, 3, 3, 5, 3, 2, 3, 3, 3, 3, 3, 3, 3, 4, 3, ~
## $ Visitor_type <chr> "Returning_Visitor", "Returning_Visitor", "Returning_Visi~
## $ Weekend <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FAL~
## $ Revenue <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F~
As we can see, the columns from Admin to Special_day are continuous in nature, and the others are categorical. Of the categorical columns, OS Browser, Region and Traffic_type are label_encoded without any order.
# Apply the prodNA function to introduce random missing values in dataset
shop_na <- prodNA(shop, noNA = 0.001)
# Examine the data distribution for each columns
summary(shop_na)
## Admin Admin_time Info Info_time
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 1.000 Median : 7.312 Median : 0.000 Median : 0.00
## Mean : 2.316 Mean : 80.833 Mean : 0.504 Mean : 34.49
## 3rd Qu.: 4.000 3rd Qu.: 93.169 3rd Qu.: 0.000 3rd Qu.: 0.00
## Max. :27.000 Max. :3398.750 Max. :24.000 Max. :2549.38
## NA's :11 NA's :10 NA's :10 NA's :9
## Prod Prod_time Bounce Exit
## Min. : 0.00 Min. : 0.0 Min. :0.000000 Min. :0.00000
## 1st Qu.: 7.00 1st Qu.: 184.3 1st Qu.:0.000000 1st Qu.:0.01429
## Median : 18.00 Median : 598.8 Median :0.003125 Median :0.02518
## Mean : 31.72 Mean : 1193.3 Mean :0.022203 Mean :0.04307
## 3rd Qu.: 38.00 3rd Qu.: 1463.6 3rd Qu.:0.016813 3rd Qu.:0.05000
## Max. :705.00 Max. :63973.5 Max. :0.200000 Max. :0.20000
## NA's :14 NA's :14 NA's :16 NA's :16
## Page_value Special_day Month OS
## Min. : 0.000 Min. :0.00000 Length:12330 Min. :1.000
## 1st Qu.: 0.000 1st Qu.:0.00000 Class :character 1st Qu.:2.000
## Median : 0.000 Median :0.00000 Mode :character Median :2.000
## Mean : 5.885 Mean :0.06146 Mean :2.124
## 3rd Qu.: 0.000 3rd Qu.:0.00000 3rd Qu.:3.000
## Max. :361.764 Max. :1.00000 Max. :8.000
## NA's :10 NA's :14 NA's :13
## Browser Region Traffic_type Visitor_type
## Min. : 1.000 Min. :1.000 Min. : 1.000 Length:12330
## 1st Qu.: 2.000 1st Qu.:1.000 1st Qu.: 2.000 Class :character
## Median : 2.000 Median :3.000 Median : 2.000 Mode :character
## Mean : 2.357 Mean :3.146 Mean : 4.071
## 3rd Qu.: 2.000 3rd Qu.:4.000 3rd Qu.: 4.000
## Max. :13.000 Max. :9.000 Max. :20.000
## NA's :6 NA's :12 NA's :11
## Weekend Revenue
## Mode :logical Mode :logical
## FALSE:9447 FALSE:10412
## TRUE :2862 TRUE :1907
## NA's :21 NA's :11
##
##
##
hist.data.frame(shop_na[1:10], na.big = TRUE, rugs = TRUE)
# Imputing NAs through the Predictive Mean Matching method using the mice function
set.seed(1001)
imp <- mice(shop_na[1:10], method = "pmm", m=1)
##
## iter imp variable
## 1 1 Admin Admin_time Info Info_time Prod Prod_time Bounce Exit Page_value Special_day
## 2 1 Admin Admin_time Info Info_time Prod Prod_time Bounce Exit Page_value Special_day
## 3 1 Admin Admin_time Info Info_time Prod Prod_time Bounce Exit Page_value Special_day
## 4 1 Admin Admin_time Info Info_time Prod Prod_time Bounce Exit Page_value Special_day
## 5 1 Admin Admin_time Info Info_time Prod Prod_time Bounce Exit Page_value Special_day
shop_na[1:10] <- complete(imp)
# Check the distribution of data
hist.data.frame(shop_na, na.big = TRUE, rugs = TRUE)
summary(shop_na)
## Admin Admin_time Info Info_time
## Min. : 0.000 Min. : 0.00 Min. : 0.0000 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.: 0.00
## Median : 1.000 Median : 7.50 Median : 0.0000 Median : 0.00
## Mean : 2.315 Mean : 80.83 Mean : 0.5037 Mean : 34.49
## 3rd Qu.: 4.000 3rd Qu.: 93.38 3rd Qu.: 0.0000 3rd Qu.: 0.00
## Max. :27.000 Max. :3398.75 Max. :24.0000 Max. :2549.38
##
## Prod Prod_time Bounce Exit
## Min. : 0.00 Min. : 0.0 Min. :0.000000 Min. :0.00000
## 1st Qu.: 7.00 1st Qu.: 184.3 1st Qu.:0.000000 1st Qu.:0.01429
## Median : 18.00 Median : 598.8 Median :0.003125 Median :0.02519
## Mean : 31.73 Mean : 1194.2 Mean :0.022196 Mean :0.04309
## 3rd Qu.: 38.00 3rd Qu.: 1464.2 3rd Qu.:0.016764 3rd Qu.:0.05000
## Max. :705.00 Max. :63973.5 Max. :0.200000 Max. :0.20000
##
## Page_value Special_day Month OS
## Min. : 0.000 Min. :0.00000 Length:12330 Min. :1.000
## 1st Qu.: 0.000 1st Qu.:0.00000 Class :character 1st Qu.:2.000
## Median : 0.000 Median :0.00000 Mode :character Median :2.000
## Mean : 5.886 Mean :0.06139 Mean :2.124
## 3rd Qu.: 0.000 3rd Qu.:0.00000 3rd Qu.:3.000
## Max. :361.764 Max. :1.00000 Max. :8.000
## NA's :13
## Browser Region Traffic_type Visitor_type
## Min. : 1.000 Min. :1.000 Min. : 1.000 Length:12330
## 1st Qu.: 2.000 1st Qu.:1.000 1st Qu.: 2.000 Class :character
## Median : 2.000 Median :3.000 Median : 2.000 Mode :character
## Mean : 2.357 Mean :3.146 Mean : 4.071
## 3rd Qu.: 2.000 3rd Qu.:4.000 3rd Qu.: 4.000
## Max. :13.000 Max. :9.000 Max. :20.000
## NA's :6 NA's :12 NA's :11
## Weekend Revenue
## Mode :logical Mode :logical
## FALSE:9447 FALSE:10412
## TRUE :2862 TRUE :1907
## NA's :21 NA's :11
##
##
##
The Stochastic Regression was able to substitute NAs in numerical columns. For the categorical variables, mode of the variable would be used to substitue the NAs.
# Mode of a column
getmode <- function(x) {
uniq <- unique(x)
uniq[which.max(tabulate(match(x, uniq)))]
}
# Applying getmode function for each categorical variable
mode.shop <- lapply(shop_na[11:18], getmode)
mode.shop <- as.data.frame(unclass(mode.shop))
# Imputing NAs from respective columns
shop_imp <- shop_na
for(i in 11:18){
num.row <- which(is.na(shop_na[i]))
shop_imp[num.row,i] <- mode.shop[i-10]
}
summary(shop_imp)
## Admin Admin_time Info Info_time
## Min. : 0.000 Min. : 0.00 Min. : 0.0000 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.: 0.00
## Median : 1.000 Median : 7.50 Median : 0.0000 Median : 0.00
## Mean : 2.315 Mean : 80.83 Mean : 0.5037 Mean : 34.49
## 3rd Qu.: 4.000 3rd Qu.: 93.38 3rd Qu.: 0.0000 3rd Qu.: 0.00
## Max. :27.000 Max. :3398.75 Max. :24.0000 Max. :2549.38
## Prod Prod_time Bounce Exit
## Min. : 0.00 Min. : 0.0 Min. :0.000000 Min. :0.00000
## 1st Qu.: 7.00 1st Qu.: 184.3 1st Qu.:0.000000 1st Qu.:0.01429
## Median : 18.00 Median : 598.8 Median :0.003125 Median :0.02519
## Mean : 31.73 Mean : 1194.2 Mean :0.022196 Mean :0.04309
## 3rd Qu.: 38.00 3rd Qu.: 1464.2 3rd Qu.:0.016764 3rd Qu.:0.05000
## Max. :705.00 Max. :63973.5 Max. :0.200000 Max. :0.20000
## Page_value Special_day Month OS
## Min. : 0.000 Min. :0.00000 Length:12330 Min. :1.000
## 1st Qu.: 0.000 1st Qu.:0.00000 Class :character 1st Qu.:2.000
## Median : 0.000 Median :0.00000 Mode :character Median :2.000
## Mean : 5.886 Mean :0.06139 Mean :2.124
## 3rd Qu.: 0.000 3rd Qu.:0.00000 3rd Qu.:3.000
## Max. :361.764 Max. :1.00000 Max. :8.000
## Browser Region Traffic_type Visitor_type
## Min. : 1.000 Min. :1.000 Min. : 1.000 Length:12330
## 1st Qu.: 2.000 1st Qu.:1.000 1st Qu.: 2.000 Class :character
## Median : 2.000 Median :3.000 Median : 2.000 Mode :character
## Mean : 2.357 Mean :3.144 Mean : 4.069
## 3rd Qu.: 2.000 3rd Qu.:4.000 3rd Qu.: 4.000
## Max. :13.000 Max. :9.000 Max. :20.000
## Weekend Revenue
## Mode :logical Mode :logical
## FALSE:9468 FALSE:10423
## TRUE :2862 TRUE :1907
##
##
##
# Colinearity of continous variables with target column
pairs.panels(cbind(shop_imp[1:10], shop_imp$Revenue))
# Outlier detection for continuous variables
hist.data.frame(shop_imp[1:10])
# Grubbs outlier test
grubbs.test(shop_imp$Exit, opposite = TRUE)
##
## Grubbs test for one outlier
##
## data: shop_imp$Exit
## G = 0.88667, U = 0.99994, p-value = 1
## alternative hypothesis: lowest value 0 is an outlier
Some of the columns like Prod and Prod_time show high correlation suggesting a collinearity between them. But we will employ PCA(MCA for categorical variables) for feature selection.
The columns Exit seems to have an outlier. But the outlier test does not report enough confidence to state that. So, we will continue to use the dataset as it is for further analysis.
# Apply z-score method to the numerical columns
num_var <- scale(shop_imp[1:10])
# Stitching the standardized numerical columns with the categorical ones
shop_norm <- cbind(num_var, shop_imp[11:18])
summary(shop_norm)
## Admin Admin_time Info Info_time
## Min. :-0.6971 Min. :-0.45722 Min. :-0.3966 Min. :-0.245
## 1st Qu.:-0.6971 1st Qu.:-0.45722 1st Qu.:-0.3966 1st Qu.:-0.245
## Median :-0.3960 Median :-0.41480 Median :-0.3966 Median :-0.245
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.000
## 3rd Qu.: 0.5071 3rd Qu.: 0.07093 3rd Qu.:-0.3966 3rd Qu.:-0.245
## Max. : 7.4312 Max. :18.76694 Max. :18.4978 Max. :17.867
## Prod Prod_time Bounce Exit
## Min. :-0.7134 Min. :-0.6249 Min. :-0.4578 Min. :-0.8867
## 1st Qu.:-0.5560 1st Qu.:-0.5284 1st Qu.:-0.4578 1st Qu.:-0.5927
## Median :-0.3087 Median :-0.3115 Median :-0.3933 Median :-0.3683
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.1410 3rd Qu.: 0.1413 3rd Qu.:-0.1120 3rd Qu.: 0.1422
## Max. :15.1382 Max. :32.8519 Max. : 3.6670 Max. : 3.2289
## Page_value Special_day Month OS
## Min. :-0.3173 Min. :-0.3087 Length:12330 Min. :1.000
## 1st Qu.:-0.3173 1st Qu.:-0.3087 Class :character 1st Qu.:2.000
## Median :-0.3173 Median :-0.3087 Mode :character Median :2.000
## Mean : 0.0000 Mean : 0.0000 Mean :2.124
## 3rd Qu.:-0.3173 3rd Qu.:-0.3087 3rd Qu.:3.000
## Max. :19.1813 Max. : 4.7191 Max. :8.000
## Browser Region Traffic_type Visitor_type
## Min. : 1.000 Min. :1.000 Min. : 1.000 Length:12330
## 1st Qu.: 2.000 1st Qu.:1.000 1st Qu.: 2.000 Class :character
## Median : 2.000 Median :3.000 Median : 2.000 Mode :character
## Mean : 2.357 Mean :3.144 Mean : 4.069
## 3rd Qu.: 2.000 3rd Qu.:4.000 3rd Qu.: 4.000
## Max. :13.000 Max. :9.000 Max. :20.000
## Weekend Revenue
## Mode :logical Mode :logical
## FALSE:9468 FALSE:10423
## TRUE :2862 TRUE :1907
##
##
##
# Applying FAMD function to perform PCA and MCA of numerical and categorical columns, respectively
# On standardized dataset
res.famd_norm <- FAMD(shop_norm, ncp = 10, graph = FALSE)
print(res.famd_norm)
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues and inertia"
## 2 "$var" "Results for the variables"
## 3 "$ind" "results for the individuals"
## 4 "$quali.var" "Results for the qualitative variables"
## 5 "$quanti.var" "Results for the quantitative variables"
# Screeplot for percentages of inertia explained by each dimensions
fviz_screeplot(res.famd_norm)
The screeplot suggests the use of first 6 dimensions. The rest of the
components do not account for substantial increase in the variance.
Al
# Get the quality of representation and the contributions of each dimension
var <- get_famd_var(res.famd_norm)
ind <- get_famd_ind(res.famd_norm)
# Plot of variables
fviz_famd_var(res.famd_norm, repel = TRUE)
# Contribution of variables to first 6 dimension
fviz_contrib(res.famd_norm, "var", axes = 1)
fviz_contrib(res.famd_norm, "var", axes = 2)
fviz_contrib(res.famd_norm, "var", axes = 3)
fviz_contrib(res.famd_norm, "var", axes = 4)
fviz_contrib(res.famd_norm, "var", axes = 5)
fviz_contrib(res.famd_norm, "var", axes = 6)
# Plot individuals by label
fviz_ellipses(res.famd_norm, c("Bounce", "Visitor_type"), repel = TRUE)
Interpreting the contributions of variables to first 6 dimensions, we would progress with the following columns as the predictors for the SVM analysis:
Prod, Prod_time, Admin, Info, Bounce, Exit, Page_value, Visitor_type, OS, Month and Special_day.
The last two dimensions might be heavily affected by the individual variances in the depicted variables. Also, choosing both Prod and Prod_time as predictors might introduce some redundancy in the models since the features have shown high correlation.
# Subset relevant columns for SVM
shop.svm <- shop_norm[ ,c("Prod", "Prod_time", "Admin", "Info", "Bounce", "Exit", "Page_value",
"Visitor_type", "OS", "Month", "Special_day", "Revenue")]
# Convert factors character columns to numericals
shop.svm$Visitor_type <- ifelse(shop.svm$Visitor_type == "Other", 0,
ifelse(shop.svm$Visitor_type == "New_Visitor", 1,2))
shop.svm$Revenue <- ifelse(shop.svm$Revenue == TRUE, 1,0)
# Encode month labels
lbl <- LabelEncoder$new()
lbl$fit(shop.svm$Month)
shop.svm$Month <- lbl$fit_transform(shop.svm$Month)
# 70%/30% split into training and testing set
shop.svm_train <- shop.svm[1:8631,]
shop.svm_test <- shop.svm[8632:12330,]
# Apply svm function with linear kernel
model.svm <- svm(Revenue ~ ., shop.svm_train, kernel = "linear")
model.svm
##
## Call:
## svm(formula = Revenue ~ ., data = shop.svm_train, kernel = "linear")
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: linear
## cost: 1
## gamma: 0.09090909
## epsilon: 0.1
##
##
## Number of Support Vectors: 2192
# Apply svm function with radial basis kernel
model.svm_rbf <- svm(Revenue ~ ., shop.svm_train, kernel = "radial")
model.svm_rbf
##
## Call:
## svm(formula = Revenue ~ ., data = shop.svm_train, kernel = "radial")
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.09090909
## epsilon: 0.1
##
##
## Number of Support Vectors: 2219
# Predicting for test set
# Linear kernel model
pred.svm <- predict(model.svm, shop.svm_test[-12], )
pred.svm <- ifelse(pred.svm > 0.5, 1,0)
# Non-linear kernel model
pred.svm_rbf <- predict(model.svm_rbf, shop.svm_test[-12])
pred.svm_rbf <- ifelse(pred.svm_rbf > 0.5, 1,0)
# Cohen's kappa value and confusion matrix for linear SVM
cohen.kappa(cbind(shop.svm_test$Revenue, pred.svm))
## Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels)
##
## Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries
## lower estimate upper
## unweighted kappa 0.26 0.3 0.34
## weighted kappa 0.26 0.3 0.34
##
## Number of subjects = 3699
CrossTable(shop.svm_test$Revenue, pred.svm)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3699
##
##
## | pred.svm
## shop.svm_test$Revenue | 0 | 1 | Row Total |
## ----------------------|-----------|-----------|-----------|
## 0 | 2899 | 55 | 2954 |
## | 6.284 | 93.059 | |
## | 0.981 | 0.019 | 0.799 |
## | 0.837 | 0.235 | |
## | 0.784 | 0.015 | |
## ----------------------|-----------|-----------|-----------|
## 1 | 566 | 179 | 745 |
## | 24.919 | 368.987 | |
## | 0.760 | 0.240 | 0.201 |
## | 0.163 | 0.765 | |
## | 0.153 | 0.048 | |
## ----------------------|-----------|-----------|-----------|
## Column Total | 3465 | 234 | 3699 |
## | 0.937 | 0.063 | |
## ----------------------|-----------|-----------|-----------|
##
##
# Cohen's kappa value and confusion matrix for non-linear SVM
cohen.kappa(cbind(shop.svm_test$Revenue, pred.svm_rbf))
## Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels)
##
## Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries
## lower estimate upper
## unweighted kappa 0.34 0.38 0.41
## weighted kappa 0.34 0.38 0.41
##
## Number of subjects = 3699
CrossTable(shop.svm_test$Revenue, pred.svm_rbf)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3699
##
##
## | pred.svm_rbf
## shop.svm_test$Revenue | 0 | 1 | Row Total |
## ----------------------|-----------|-----------|-----------|
## 0 | 2883 | 71 | 2954 |
## | 11.198 | 123.730 | |
## | 0.976 | 0.024 | 0.799 |
## | 0.850 | 0.231 | |
## | 0.779 | 0.019 | |
## ----------------------|-----------|-----------|-----------|
## 1 | 509 | 236 | 745 |
## | 44.403 | 490.601 | |
## | 0.683 | 0.317 | 0.201 |
## | 0.150 | 0.769 | |
## | 0.138 | 0.064 | |
## ----------------------|-----------|-----------|-----------|
## Column Total | 3392 | 307 | 3699 |
## | 0.917 | 0.083 | |
## ----------------------|-----------|-----------|-----------|
##
##
The linear kernel gives an accuracy of 83%, with a precision and recall of 76% and 24%, respectively. The kappa value is 0.3.
The non-linear RBF kernel produces an accuracy of 85%, with precision and recall of 76% and 32%, respectively. The kappa value is 0.37.
# Splitting the non-standardized dataset into training and testing set
shop.rf_train <- shop_imp[1:8631,]
shop.rf_test <- shop_imp[8632:12330,]
# Train the RF
model.rf <- randomForest(Revenue ~ ., shop.rf_train)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
# Predict for test set
pred.rf <- predict(model.rf, shop.rf_test[-18], type = "response")
pred.rf <- ifelse(pred.rf>0.5, 1,0)
# Confusion matrix for RF
CrossTable(shop.rf_test$Revenue, pred.rf)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3699
##
##
## | pred.rf
## shop.rf_test$Revenue | 0 | 1 | Row Total |
## ---------------------|-----------|-----------|-----------|
## FALSE | 2884 | 70 | 2954 |
## | 12.759 | 134.724 | |
## | 0.976 | 0.024 | 0.799 |
## | 0.854 | 0.219 | |
## | 0.780 | 0.019 | |
## ---------------------|-----------|-----------|-----------|
## TRUE | 495 | 250 | 745 |
## | 50.590 | 534.196 | |
## | 0.664 | 0.336 | 0.201 |
## | 0.146 | 0.781 | |
## | 0.134 | 0.068 | |
## ---------------------|-----------|-----------|-----------|
## Column Total | 3379 | 320 | 3699 |
## | 0.913 | 0.087 | |
## ---------------------|-----------|-----------|-----------|
##
##
# Cohen's kappa value for RF
cohen.kappa(cbind(shop.rf_test$Revenue, pred.rf))
## Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels)
##
## Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries
## lower estimate upper
## unweighted kappa 0.36 0.4 0.43
## weighted kappa 0.36 0.4 0.43
##
## Number of subjects = 3699
The default Random Forest algorithm has produced accuracy of 85%, with precision and recall of 80% and 35%, respectively. The kappa value is 0.41.
# Copying the svm dataset as it is devoid of characters and normalized
shop.cl <- shop.svm
# Dissmilarity matrix for standardized data
d <- dist(shop.cl, method = "manhattan")
# Applying the kmeans function
model.cl <- kmeans(shop.cl, centers = 2, algorithm = c("Forgy"), iter.max = 20)
model.cl$centers
## Prod Prod_time Admin Info Bounce Exit
## 1 0.1627774 0.1468311 0.08204478 0.06857329 -0.05956048 -0.08423854
## 2 -0.1565108 -0.1411784 -0.07888620 -0.06593333 0.05726751 0.08099551
## Page_value Visitor_type OS Month Special_day Revenue
## 1 0.04765796 1.808570 2.153375 7.425381 -0.3086808 0.1988749
## 2 -0.04582322 1.887687 2.095768 1.580815 0.2967972 0.1121540
# Plotting clusters
fviz_cluster(model.cl, shop.cl[, -12], palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
geom = "point", ellipse = TRUE, ellipse.type = "convex", ggtheme = theme_bw())
The kmeans algorithm has differentiated two clusters. The first cluster
represents the visitors who:
The second cluster represents visitors who visit the pages all the time, irrespective of the date or special day, tend to stay more on product pages and also spend around two times the cluster 1 representatives.
We can say that the cluster 2 visitors are the regular shoppers, while cluster 1 visitors can be termed as Special Day Shoppers.
Since, among SVM the non-linear RBF has higher recall and kappa value, I have chosen to use RBF SVM and RF algorithms for the ensemble method.
# SVM features
ft.svm <- c("Prod", "Prod_time", "Admin", "Info", "Bounce", "Exit", "Page_vale",
"Visitor_type", "OS", "Month", "Special_day")
# Function for ensemble model
ensmbl <- function(x){
p1 <- predict(model.svm_rbf, x[, ft.svm], type = "response")
p1 <- ifelse(p1>0.5, 1,0)
p2 <- predict(model.rf, x, type = "response")
p2 <- ifelse(p2>0.5, 1,0)
result <- ifelse(p2 == 1, 1,
ifelse(p1 == 1, 1,0))
return(result)
}
This ensemble model favors TRUE or 1 classification of RF model over any other.
When the RF classifies the sample as FALSE or 2, it checks for RBF SVM’s results, which if TRUE or 1 would lead to final classification of TRUE or 1, else otherwise.
The idea is that it is better to misclassify non-buyer shoppers as potential buyers, that to misclassify potential buyers as non-buyers. Hence, a False Positive is more welcomed that False Negative