Dimensionalty Reduction

In order to make an appropriate use of your data it is important that to understand the underlying patterns and structure within the data. You can gain insights by exploring the data either by plotting or looking for prevalent patterns.

Unsupervised learning is often an important part of the exploratory data analysis step.

A defined by James, Witten, Hastie and Tibshirani, in An Introduction to Statistical Learning : “With unsupervised statistical learning, there are inputs but no supervising output; nevertheless we can learn relationships and structure from such data.”

The analysis tends to be more subjective than in supervised learning. There is no standard way to evaluate the results using cross-validation or independent datasets – no true answer is known!

Paradigmatic applications of unsupervised learning, include:

Find sub-groups of samples representing tumor sub-types within tumor samples in gene expression datasets.
Identify groups of shoppers in online shopping sites based on browsing or purchasing history. Adapt the website in future interactions of the user.

In certain types of data there is an imbalance between the number of features k and the number of observation or samples n. Domains like gene expression or image datasets (e.g. MNIST) the number of k is much larger than n, k >> n.

Most of the supervised Machine Learning algorithms will not cope well with the dimensionality problem. Collecting more samples may help to alleviate the case, but it is often impractical. Thus, it will be difficult to obtain robust statistical inference results.

Dimensionality Reduction (DR) is the process of reducing the number of input features under consideration and focus on a set of principal features.

Feature selection: apply statistical tests to infer the importance of the features and retain a subset. Implies information loss and less stability.
Feature extraction: create new independent features as combination of the original features. Linear and non-linear methodologies.

These methods are very useful for data visualization. Focusing on the 2 or 3 most important dimensions/components may allow to visually detect patterns, such as clusters or groups.

DR algorithms rely on the assumption, often observed in practice and in real world high dimensional datasets, that data will lie close to a much lower dimensional manifold. A manifold is a lower-dimensional shape that can “bent” or “twisted” in a higher dimensional space. It locally resembles a k-dimensional hyper-plane.

DR methods can be generally divided in two types.

Matrix factorization: decompose a matrix into a product of matrices, e.g. PCA, NMF, MDS.
Neighbour graph: builds a graph representing distances between objects embedding it in a lower dimensional space: t-SNE, Isomap, UMAP.

In this tutorial, I will show how to apply four of the most widely used methods for DR:

Principal Component Analysis (PCA)
Multidimensional Scaling (MDS)
t-Distributed Stochastic Neighbor Embedding (t-SNE)
UMAP

PCA

PCA is the most widely used algorithm for dimensionality reduction. It works as follows:

Identifies the hyperplane that lies closest to the data (preserve the variance);
Projects the data onto the hyperplane;

It can make use of Singular Value Decomposition (SVD), a standard matrix factorization to find the principal components.

MDS

MDS maps the distances between the objects in k-dimensions. The distances can be pre-calculated or derived from a correlation matrix. It calculates dissimilarities and returns a set of points such that the distances between the points are approximately equal to the dissimilarities. MDS can be metric or non-metric. In the metric version it tries to reproduce original metric or distance. In the non-metric version it assumes the ranks of the distances.

tSNE

tSNE is a non-linear DR technique. The method tries to keep similar instances close to each other and dissimilar ones apart. It focus on retaining the structure of neighboring points. t-SNE is widely used for visualization of high-dimensional spaces.

UMAP

Non-linear DR technique. Is a very effective technique to visualize clusters but can also be applied for prediction. Scales well (e.g. w.r.t. t-SNE). It can be applied directly to sparse matrices. Tends to preserve both local and global structure of data.

Iris dataset

To demonstrate the application of these four methods we will use the iris dataset a widely used dataset in ML for demonstration purposes. It consists of measurements in centimetres of four variables and one class. Variable are:

sepal length and sepal width
petal length and petal width.

The class corresponds to the species: Iris setosa, versicolor, and virginica. It contains 150 observations (4 input attributes; 3 classes)

Loading and Preparing the data.

iris <- read.csv("/Users/test/Dropbox/Meetings_Workshops_Courses/EUGHLOH/code/tutorial_code/iris.csv")
head(iris)

##   sepal_length sepal_width petal_length petal_width       class
## 1          5.1         3.5          1.4         0.2 Iris-setosa
## 2          4.9         3.0          1.4         0.2 Iris-setosa
## 3          4.7         3.2          1.3         0.2 Iris-setosa
## 4          4.6         3.1          1.5         0.2 Iris-setosa
## 5          5.0         3.6          1.4         0.2 Iris-setosa
## 6          5.4         3.9          1.7         0.4 Iris-setosa

dim(iris)

## [1] 150   5

colnames(iris)

## [1] "sepal_length" "sepal_width"  "petal_length" "petal_width"  "class"

Rows in the dataset are ordered by class attribute Lets permutate the dataset in a row-wise manner to shuffle the data.

iris2 <- iris[sample(nrow(iris)),]

Create a visualization for each pair of attributes.

Upper panel of the figure shows the correlation values. Load the function panel.smooth as defined in the following session:

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y))
  txt <- format(c(r, 0.123456789), digits = digits)[1]
  txt <- paste0(prefix, txt)
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex.cor * r)
}

Including Plots

par(mar = c(4, 4, 0.1, 0.1))
pairs(iris[,-5], lower.panel = panel.smooth, upper.panel = panel.cor)

Create a distance matrix for all the pairwise combination of objects as.matrix(). The function forces the conversion to data matrix. Two distances measures are used for demonstration purposes. Note that these distances are already pre-defined in the R language.

iris.euclidean.dist.full = as.matrix(dist(iris2[,c(1,2,3,4)], method = "euclidean", diag = TRUE, upper = TRUE))
iris.canberra.dist.full = as.matrix(dist(iris2[,c(1,2,3,4)], method = "canberra", diag = TRUE, upper = TRUE))

image(iris.euclidean.dist.full)

image(iris.canberra.dist.full)

At this point the two above images do not make much sense as samples are all mixed in the matrix. We can rearrange them according to their distance by using hierarchical clustering and plotting them as an heatmap. The colors of the heatmap and the dendrogram in the rows start to show several groups of samples with similar profiles. We will see in another post more about clustering samples and go through the clustering process in more detail.

For now, what as can expect is some sort of clustering of part of the samples. We will see that this is also confirmed by the dimensionality reduction methods.

library(gplots)

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

#heatmap.2(iris.euclidean.dist.full,col=redblue(30),trace="none", scale="none",key=TRUE, keysize=0.8, symkey=TRUE, labCol = NULL)
heatmap.2(iris.euclidean.dist.full,col=redblue(30),trace="none", scale="none", labCol = rep("",150), labRow =iris2[,"class"], cexRow = 0.2, key=FALSE)

Dimensionality Reduction

Remove duplicate entries for t-sne, otherwise an error will be generated.

uiris = unique(iris)
colors = c("red","green","blue")
names(colors) = unique(uiris$class)

#PCA
#princomp performs a principal components analysis on the given numeric data matrix and returns the results as an object of class princomp
pca = princomp(uiris[,1:4])$scores[,1:2]
plot(pca, t='n', main="PCA", "cex.main"=1, "cex.lab"=1)
points(pca, col=colors[uiris$class], pch=19, cex=0.6)
legend("bottomright", legend=names(colors), col=colors,pch=19,  bty = "n")

#MDS
library(MASS)
d.s <- as.dist(1-cor(t(uiris[,1:4]), method="pearson"))
im <- isoMDS(d.s, k=2) # for 4 dimensions

## initial  value 7.508006 
## iter   5 value 4.909392
## iter  10 value 3.186345
## iter  15 value 1.912468
## iter  20 value 1.487507
## iter  25 value 0.922168
## iter  30 value 0.826509
## iter  35 value 0.776087
## iter  40 value 0.591153
## iter  45 value 0.547403
## iter  50 value 0.370706
## final  value 0.370706 
## stopped after 50 iterations

plot(im$points, main ="MDS",  type="n",xlab="dimension 1",ylab="dimension 2",  "cex.main"=1, "cex.lab"=1)
points(im$points, col= colors[uiris$class], pch=19, cex=0.6)

#
library(Rtsne)
set.seed(1) # for reproducibility
tsne <- Rtsne(uiris[,1:4], dims = 2, perplexity=5, verbose=TRUE, max_iter = 500)

## Performing PCA
## Read the 147 x 4 data matrix successfully!
## Using no_dims = 2, perplexity = 5.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.00 seconds (sparsity = 0.130224)!
## Learning embedding...
## Iteration 50: error is 65.840745 (50 iterations in 0.01 seconds)
## Iteration 100: error is 63.291579 (50 iterations in 0.01 seconds)
## Iteration 150: error is 59.059998 (50 iterations in 0.01 seconds)
## Iteration 200: error is 60.236560 (50 iterations in 0.01 seconds)
## Iteration 250: error is 61.012784 (50 iterations in 0.01 seconds)
## Iteration 300: error is 1.938745 (50 iterations in 0.01 seconds)
## Iteration 350: error is 0.841795 (50 iterations in 0.01 seconds)
## Iteration 400: error is 0.449615 (50 iterations in 0.01 seconds)
## Iteration 450: error is 0.419955 (50 iterations in 0.01 seconds)
## Iteration 500: error is 0.405385 (50 iterations in 0.01 seconds)
## Fitting performed in 0.10 seconds.

plot(tsne$Y, t='n', main="tSNE", xlab="dimension 1", ylab="dimension 2", "cex.main"=1, "cex.lab"=1)
points(tsne$Y, col=colors[uiris$class], pch=19, cex=0.6)
#legend("topleft", legend=names(colors), col=colors,pch=19,  bty = "n")
library(umap)

iris_umap <- umap(iris[,1:4], init = "naive")

## Warning: failed creating initial embedding; using random embedding instead

plot(iris_umap$layout, t='n', main="UMAP", "cex.main"=1, "cex.lab"=1, xlab="dimension 1", ylab="dimension 2")
points(iris_umap$layout, col=colors[uiris$class], pch=19, cex=0.6)
legend("bottomright", legend=names(colors), col=colors,pch=19,  bty = "n")

Overall, these results show that the three classes have a clear separation, with Iris-setosa showing a more global distance to the other two, with some samples as bordeline between virginica and versicolor.

Document generated with R Studio, R Markdown and Knit