In order to make an appropriate use of your data it is important that to understand the underlying patterns and structure within the data. You can gain insights by exploring the data either by plotting or looking for prevalent patterns.
Unsupervised learning is often an important part of the exploratory data analysis step.
A defined by James, Witten, Hastie and Tibshirani, in An Introduction to Statistical Learning : “With unsupervised statistical learning, there are inputs but no supervising output; nevertheless we can learn relationships and structure from such data.”
The analysis tends to be more subjective than in supervised learning. There is no standard way to evaluate the results using cross-validation or independent datasets – no true answer is known!
Paradigmatic applications of unsupervised learning, include:
In certain types of data there is an imbalance between the number of features k and the number of observation or samples n. Domains like gene expression or image datasets (e.g. MNIST) the number of k is much larger than n, k >> n.
Most of the supervised Machine Learning algorithms will not cope well with the dimensionality problem. Collecting more samples may help to alleviate the case, but it is often impractical. Thus, it will be difficult to obtain robust statistical inference results.
Dimensionality Reduction (DR) is the process of reducing the number of input features under consideration and focus on a set of principal features.
These methods are very useful for data visualization. Focusing on the 2 or 3 most important dimensions/components may allow to visually detect patterns, such as clusters or groups.
DR algorithms rely on the assumption, often observed in practice and in real world high dimensional datasets, that data will lie close to a much lower dimensional manifold. A manifold is a lower-dimensional shape that can “bent” or “twisted” in a higher dimensional space. It locally resembles a k-dimensional hyper-plane.
DR methods can be generally divided in two types.
In this tutorial, I will show how to apply four of the most widely used methods for DR:
PCA is the most widely used algorithm for dimensionality reduction. It works as follows:
It can make use of Singular Value Decomposition (SVD), a standard matrix factorization to find the principal components.
MDS maps the distances between the objects in k-dimensions. The distances can be pre-calculated or derived from a correlation matrix. It calculates dissimilarities and returns a set of points such that the distances between the points are approximately equal to the dissimilarities. MDS can be metric or non-metric. In the metric version it tries to reproduce original metric or distance. In the non-metric version it assumes the ranks of the distances.
tSNE is a non-linear DR technique. The method tries to keep similar instances close to each other and dissimilar ones apart. It focus on retaining the structure of neighboring points. t-SNE is widely used for visualization of high-dimensional spaces.
Non-linear DR technique. Is a very effective technique to visualize clusters but can also be applied for prediction. Scales well (e.g. w.r.t. t-SNE). It can be applied directly to sparse matrices. Tends to preserve both local and global structure of data.
To demonstrate the application of these four methods we will use the iris dataset a widely used dataset in ML for demonstration purposes. It consists of measurements in centimetres of four variables and one class. Variable are:
The class corresponds to the species: Iris setosa, versicolor, and virginica. It contains 150 observations (4 input attributes; 3 classes)
Loading and Preparing the data.
iris <- read.csv("/Users/test/Dropbox/Meetings_Workshops_Courses/EUGHLOH/code/tutorial_code/iris.csv")
head(iris)
## sepal_length sepal_width petal_length petal_width class
## 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 4.9 3.0 1.4 0.2 Iris-setosa
## 3 4.7 3.2 1.3 0.2 Iris-setosa
## 4 4.6 3.1 1.5 0.2 Iris-setosa
## 5 5.0 3.6 1.4 0.2 Iris-setosa
## 6 5.4 3.9 1.7 0.4 Iris-setosa
dim(iris)
## [1] 150 5
colnames(iris)
## [1] "sepal_length" "sepal_width" "petal_length" "petal_width" "class"
Rows in the dataset are ordered by class attribute Lets permutate the dataset in a row-wise manner to shuffle the data.
iris2 <- iris[sample(nrow(iris)),]
Create a visualization for each pair of attributes.
Upper panel of the figure shows the correlation values. Load the function panel.smooth as defined in the following session:
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
par(mar = c(4, 4, 0.1, 0.1))
pairs(iris[,-5], lower.panel = panel.smooth, upper.panel = panel.cor)
Create a distance matrix for all the pairwise combination of objects as.matrix(). The function forces the conversion to data matrix. Two distances measures are used for demonstration purposes. Note that these distances are already pre-defined in the R language.
iris.euclidean.dist.full = as.matrix(dist(iris2[,c(1,2,3,4)], method = "euclidean", diag = TRUE, upper = TRUE))
iris.canberra.dist.full = as.matrix(dist(iris2[,c(1,2,3,4)], method = "canberra", diag = TRUE, upper = TRUE))
image(iris.euclidean.dist.full)
image(iris.canberra.dist.full)
At this point the two above images do not make much sense as samples are all mixed in the matrix. We can rearrange them according to their distance by using hierarchical clustering and plotting them as an heatmap. The colors of the heatmap and the dendrogram in the rows start to show several groups of samples with similar profiles. We will see in another post more about clustering samples and go through the clustering process in more detail.
For now, what as can expect is some sort of clustering of part of the samples. We will see that this is also confirmed by the dimensionality reduction methods.
library(gplots)
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
#heatmap.2(iris.euclidean.dist.full,col=redblue(30),trace="none", scale="none",key=TRUE, keysize=0.8, symkey=TRUE, labCol = NULL)
heatmap.2(iris.euclidean.dist.full,col=redblue(30),trace="none", scale="none", labCol = rep("",150), labRow =iris2[,"class"], cexRow = 0.2, key=FALSE)
Remove duplicate entries for t-sne, otherwise an error will be generated.
uiris = unique(iris)
colors = c("red","green","blue")
names(colors) = unique(uiris$class)
#PCA
#princomp performs a principal components analysis on the given numeric data matrix and returns the results as an object of class princomp
pca = princomp(uiris[,1:4])$scores[,1:2]
plot(pca, t='n', main="PCA", "cex.main"=1, "cex.lab"=1)
points(pca, col=colors[uiris$class], pch=19, cex=0.6)
legend("bottomright", legend=names(colors), col=colors,pch=19, bty = "n")
#MDS
library(MASS)
d.s <- as.dist(1-cor(t(uiris[,1:4]), method="pearson"))
im <- isoMDS(d.s, k=2) # for 4 dimensions
## initial value 7.508006
## iter 5 value 4.909392
## iter 10 value 3.186345
## iter 15 value 1.912468
## iter 20 value 1.487507
## iter 25 value 0.922168
## iter 30 value 0.826509
## iter 35 value 0.776087
## iter 40 value 0.591153
## iter 45 value 0.547403
## iter 50 value 0.370706
## final value 0.370706
## stopped after 50 iterations
plot(im$points, main ="MDS", type="n",xlab="dimension 1",ylab="dimension 2", "cex.main"=1, "cex.lab"=1)
points(im$points, col= colors[uiris$class], pch=19, cex=0.6)
#
library(Rtsne)
set.seed(1) # for reproducibility
tsne <- Rtsne(uiris[,1:4], dims = 2, perplexity=5, verbose=TRUE, max_iter = 500)
## Performing PCA
## Read the 147 x 4 data matrix successfully!
## Using no_dims = 2, perplexity = 5.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.00 seconds (sparsity = 0.130224)!
## Learning embedding...
## Iteration 50: error is 65.840745 (50 iterations in 0.01 seconds)
## Iteration 100: error is 63.291579 (50 iterations in 0.01 seconds)
## Iteration 150: error is 59.059998 (50 iterations in 0.01 seconds)
## Iteration 200: error is 60.236560 (50 iterations in 0.01 seconds)
## Iteration 250: error is 61.012784 (50 iterations in 0.01 seconds)
## Iteration 300: error is 1.938745 (50 iterations in 0.01 seconds)
## Iteration 350: error is 0.841795 (50 iterations in 0.01 seconds)
## Iteration 400: error is 0.449615 (50 iterations in 0.01 seconds)
## Iteration 450: error is 0.419955 (50 iterations in 0.01 seconds)
## Iteration 500: error is 0.405385 (50 iterations in 0.01 seconds)
## Fitting performed in 0.10 seconds.
plot(tsne$Y, t='n', main="tSNE", xlab="dimension 1", ylab="dimension 2", "cex.main"=1, "cex.lab"=1)
points(tsne$Y, col=colors[uiris$class], pch=19, cex=0.6)
#legend("topleft", legend=names(colors), col=colors,pch=19, bty = "n")
library(umap)
iris_umap <- umap(iris[,1:4], init = "naive")
## Warning: failed creating initial embedding; using random embedding instead
plot(iris_umap$layout, t='n', main="UMAP", "cex.main"=1, "cex.lab"=1, xlab="dimension 1", ylab="dimension 2")
points(iris_umap$layout, col=colors[uiris$class], pch=19, cex=0.6)
legend("bottomright", legend=names(colors), col=colors,pch=19, bty = "n")
Overall, these results show that the three classes have a clear separation, with Iris-setosa showing a more global distance to the other two, with some samples as bordeline between virginica and versicolor.
Document generated with R Studio, R Markdown and Knit