Exploring Dimensionality Reduction with Python
A Comparison of PCA, t-SNE and UMAP on the Palmer Dataset
Modern datasets typically have numerous columns, hence making it difficult to process and visualize them. Additionally, machine learning models perform poorly on such datasets, due to the curse of dimensionality, a phenomenon that emerges in high-dimensional feature spaces [1]. Dimensionality reduction is a machine learning technique that aims to decrease the dataset columns, while retaining most of the valuable information. Furthermore, dimensionality reduction can improve the accuracy of machine learning models, while reducing the computational resources and time required to train them. There are several techniques for dimensionality reduction, including Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP). In this article, we are going to compare the various techniques, by utilizing the scikit-learn Python library, as well as the Palmer Penguins dataset. Let’s begin!
The Palmer Penguins Dataset
The Palmer Penguins dataset is widely used to teach various statistical concepts, while being freely available for everyone. Palmer Penguins is provided as an alternative to the established Iris dataset, which is considered obsolete and inadequate by some researchers [2]. The dataset classes are Chinstrap, Gentoo and Adelie, which are penguin species living in the Palmer Archipelago of Antarctica. Furthermore, the features include numeric and categorical variables, such as bill length, body mass, sex and island. Additionally, the data was collected by Dr. Kristen Gorman in the Palmer Station, which is part of the the United States Long-Term Ecological Research (US LTER) network. Real-world data are usually more complicated than Palmer Penguins, but working with toy datasets allows beginners to become acquainted with statistics and machine learning, and get a grasp of the associated techniques. In the rest of this article, we are going to explore the Palmer Penguins dataset, and apply various dimensionality reduction techniques on it.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap
mpl.rcParams['figure.dpi'] = 300
plt.style.use('seaborn-v0_8-muted')
We begin by importing all the necessary Python libraries for our project, such as pandas, Matplotlib and Seaborn. Furthermore, we also import the PCA and t-SNE classes of scikit-learn, as well as the umap-learn library. The latter will be used in the dimensionality reduction examples of the following article sections. Finally, we set the figure DPI to 300 for high resolution plots, and change the Matplotlib style sheet, but those settings are strictly optional.
df = sns.load_dataset('penguins').dropna(thresh = 6)
df.head()
As mentioned previously, this article will focus on the Palmer Penguins dataset, so we create a pandas dataframe based on it. This dataset is included with the Seaborn library, so we load it on a pandas dataframe, by using the load_dataset()
function. Afterwards, we use the pandas head()
function to display the first rows, thus getting a better understanding of the data. As we can see, there are four numeric features comprised of the physical characteristics of each penguin, such as bill length and bill depth. In addition, there are some categorical features as well, including species, sex and island.
sns.pairplot(df, hue = 'species')
plt.show()
Next, we are going to visualize the dataset, with the purpose of applying exploratory data analysis and extracting some useful insights. This will be accomplished by utilizing pairplot()
, a Seaborn function that creates scatterplot matrices, as well as KDE plots. Furthermore, we also used the hue
parameter to highlight the differences between each penguin species. The plot above contains a wealth of information, so we are only focusing on the most important aspects. First of all, there are numerous linear relationships between variable pairs, such as body mass and flipper length. Notably, there are some occurrences of Simpson's paradox[3], where two variables appear to have negative correlation, which is inversed when taking into account a third categorical variable. For example, bill depth and flipper length have a Pearson correlation coefficient of 𝑟 = −0.58, thus suggesting an inverse linear relationship between them. Regardless, we can easily verify that the variables are linearly related for each individual species, by visually inspecting the plot. Simpson's paradox highlights the fact that analysts should be diligent to avoid mistakes.
Principal Component Analysis
Principal Component Analysis (PCA) is a widely used technique for dimensionality reduction in machine learning and data analysis. PCA transforms the original high dimensional data into a lower dimensional space, while preserving most of the variance. This is accomplished by applying a linear orthogonal transformation to a different system of coordinates, where the first principal component explain most of the data variance. PCA helps us effectively reduce dimensionality, resulting in a dataset that is easier to visualize and interpret. In this section, we are going to apply PCA on the Penguins dataset, by utilizing the scikit-learn library.
cols_num = ['bill_length_mm', 'bill_depth_mm',
'flipper_length_mm', 'body_mass_g']
cols_cat = ['species', 'island', 'sex']
scaler = StandardScaler()
X = df[cols_num]
X = scaler.fit_transform(X)
y = df['species'].astype('category').cat.codes
After loading the dataset to a pandas dataframe, we also create the X and y variables, i.e. the features and class label. In addition, we converted the class to the category data type, which includes both text and numeric representations of categorical variables. Furthermore, we use the StandardScaler()
class, that helps us standardize the features by removing the mean and scaling to unit variance. It is necessary to apply scaling, because PCA assumes the data is normally distributed, while also being sensitive to the variance of each variable.
pca = PCA(n_components = 2)
components = pca.fit_transform(X)
explained_var = pca.explained_variance_ratio_
print(f'''The explained variance of PC1 is {explained_var[0]:.2%}
The explained variance of PC2 is {explained_var[1]:.2%}
The total explained variance is {explained_var.sum():.2%}''')
The explained variance of PC1 is 68.84%
The explained variance of PC2 is 19.31%
The total explained variance is 88.16%
We create a PCA model with two principal components, by utilizing the PCA()
class of the scikit-learn library. Afterwards, we display the explained_variance_ratio_
list, to check the explained variance of each principal component. As we can see, the two components explain almost 90% of the total variance, resulting in an accurate representation of the original dataset.
df_ = df.join(pd.DataFrame(components, index = df.index,
columns = ['pc_1', 'pc_2']))
fig, ax = plt.subplots(figsize = (8,5))
ax.grid(False)
ax.set_frame_on(False)
sns.scatterplot(data=df_, x='pc_1', y='pc_2', hue='species', ax =ax)
plt.show()
We add the principal components to the original pandas dataframe, by using the join()
function. After doing that, we plot them by using the scatterplot()
Seaborn function. Evidently, the Gentoo class is clearly separated from the rest, while there is some overlap between Adelie and Chinstrap, indicating their similarities. Adding more principal components will help us retain the entire dataset variance, and possibly create a classifier machine learning model based on the data.
t-Distributed Stochastic Neighbor Embedding
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear machine learning algorithm that is commonly used for dimensionality reduction and data visualization[5]. This algorithm is particularly useful for embedding high-dimensional data into fewer dimensions, with the most common option being a 2D space. t-SNE is an iterative algorithm that starts by computing the similarity between instances and then maps these similarities to a low-dimensional space, while preserving their structure. There are two hyperparameters that significantly affect t-SNE: the number of iterations and perplexity, which is the number of neighboring points that must be considered. When being compared to PCA, t-SNE is better at preserving relationships between data points and visualizing complex patterns, while PCA is more suitable for retaining the dataset variance.
tsne = TSNE(n_components = 2, perplexity = 50)
components = tsne.fit_transform(X)
df_ = df.join(pd.DataFrame(components, index = df.index,
columns = ['pc_1', 'pc_2']))
fig, ax = plt.subplots(figsize = (8,5))
ax.grid(False)
ax.set_frame_on(False)
sns.scatterplot(data=df_, x='pc_1', y='pc_2', hue='species', ax =ax)
plt.show()
We create a t-SNE model with two components, by utilizing the scikit-learn TSNE()
class. After doing so, we plot the result with the scatterplot() Seaborn function. As we can see, the Palmer penguin species are separated in clearly defined clusters, with Gentoo being less similar to Adelie and Chinstrap. As mentioned previously, the perplexity
and n_iter
parameters can affect the t-SNE algorithm drastically, so it may be useful to experiment with different values and check the results.
Uniform Manifold Approximation and Projection
Uniform Manifold Approximation and Projection (UMAP) is a relatively new dimensionality reduction technique that has gained popularity in recent years, and is based in Riemannian geometry and algebraic topology [6]. UMAP is a non-linear method that seeks to preserve the data structure in a lower-dimensional space, making it similar to t-SNE. However, according to its creators, UMAP is designed to be more efficient and scalable, making it suitable for larger datasets. UMAP constructs a high-dimensional graph representation of the data and then projects it onto a low-dimensional space, by applying a series of optimization steps. This results in a visual representation of the data that is both accurate and interpretable. In comparison to PCA and t-SNE, UMAP offers a good balance of accuracy, efficiency, and scalability, making it a popular choice for dimensionality reduction in machine learning and data analysis.
umap_model = umap.UMAP(n_components = 2, n_neighbors = 10,
random_state = 42)
components = umap_model.fit_transform(X)
df_ = df.join(pd.DataFrame(components, index = df.index,
columns = ['pc_1', 'pc_2']))
fig, ax = plt.subplots(figsize = (8,5))
ax.grid(False)
ax.set_frame_on(False)
sns.scatterplot(data=df_, x='pc_1', y='pc_2', hue='species', ax =ax)
plt.show()
The UMAP model is not included in scikit-learn, so we have to install the umap-learn library separately, by using pip or conda. The UMAP()
class helps us create a UMAP model that is fitted on our dataset, and is configured with various hyperparameters. In this particular case, we set n_neighbors
to 15, and specified the random_state
parameter to ensure reproducibility. By examining the scatter plot, we can see that UMAP captured the local structure successfully. Similarly to t-SNE, it can be useful to experiment with different hyperparameter values and compare the results.
Conclusion
In this article, we explored various dimensionality reduction techniques, by utilizing standard Python libraries, as well as the Palmer Penguins dataset. We have observed that t-SNE and UMAP have separated the Penguin species more successfully, hence making them suitable for data visualization. In general, those techniques are effective at capturing local structure, whereas PCA is better at retaining the global dataset structure. Considering this is an introductory article, I encourage you to look into newer dimensionality reduction techniques, such as PaCMAP or TriMAP [7]. I also suggest that you follow me on LinkedIn, where I regularly post data science content. You can also visit my personal website or check my book, titled Simplifying Machine Learning with PyCaret.
References
[1] Verleysen, Michel, and Damien François. “The curse of dimensionality in data mining and time series prediction.” International work-conference on artificial neural networks. Springer Berlin Heidelberg, 2005.
[2] Horst, Allison M., Alison Presmanes Hill, and Kristen B. Gorman. “Palmer Archipelago Penguins Data in the palmerpenguins R Package-An Alternative to Anderson’s Irises.” R Journal 14.1 (2022).
[3] Norton, H. James, and George Divine. “Simpson’s paradox… and how to avoid it.” Significance 12.4 (2015): 40–43.
[4] Shlens, Jonathon. “A tutorial on principal component analysis.” arXiv preprint arXiv:1404.1100 (2014).
[5] Van der Maaten, Laurens, and Geoffrey Hinton. “Visualizing data using t-SNE.” Journal of machine learning research 9.11 (2008).
[6] McInnes, Leland, John Healy, and James Melville. “Umap: Uniform manifold approximation and projection for dimension reduction.” arXiv preprint arXiv:1802.03426 (2018).
[7] Wang, Yingfan, et al. “Understanding how dimension reduction tools work: an empirical approach to deciphering t-SNE, UMAP, TriMAP, and PaCMAP for data visualization.” The Journal of Machine Learning Research 22.1 (2021): 9129–9201.