Dimensionality Reduction Techniques in Machine Learning: PCA, t-SNE, and UMAP Guide
, }
Dimensionality Reduction Techniques in Machine Learning: PCA, t-SNE, and UMAP Guide
Introduction
High-dimensional data is everywhere—from image pixels to gene expressions. Dimensionality reduction helps us visualize, understand, and process this data more efficiently. This guide covers the most important techniques: PCA for linear reduction, t-SNE for non-linear visualization, and UMAP for modern high-dimensional analysis.
Understanding these methods will help you tackle the curse of dimensionality and create meaningful visualizations of complex datasets.
Why Dimensionality Reduction?
Key Benefits:
- Visualization: Plot high-dimensional data in 2D/3D
- Faster computation: Fewer features = faster algorithms
- Storage efficiency: Compress data while preserving information
- Noise reduction: Focus on important patterns
- Feature extraction: Discover hidden structures
Common challenges with high-dimensional data:
- Curse of dimensionality
- Visualization difficulty
- Computational complexity
- Overfitting risk
- Memory constraints
Principal Component Analysis (PCA)
Complete PCA Implementation
# @filename: Dockerfile
from sklearn.datasets import load_digits, load_breast_cancer, make_blobs
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from typing import Tuple, List, Dict, Optional
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')
class PCAAnalyzer:
"""Comprehensive PCA analysis and visualization"""
def __init__(self, random_state: int = 42):
self.random_state = random_state
self.pca_model: Optional[PCA] = None
self.scaler: Optional[StandardScaler] = None
def fit_pca(self, X: np.ndarray, n_components: Optional[int] = None) -> 'PCAAnalyzer':
"""Fit PCA model with optional standardization"""
# Standardize data
self.scaler = StandardScaler()
X_scaled = self.scaler.fit_transform(X)
# Fit PCA
if n_components is None:
n_components = min(X.shape[0], X.shape[1])
self.pca_model = PCA(n_components=n_components, random_state=self.random_state)
self.pca_model.fit(X_scaled)
return self
def transform(self, X: np.ndarray) -> np.ndarray:
"""Transform data using fitted PCA"""
if self.pca_model is None or self.scaler is None:
raise ValueError("Must fit PCA first")
X_scaled = self.scaler.transform(X)
return self.pca_model.transform(X_scaled)
def explained_variance_analysis(self) -> Dict:
"""Analyze explained variance ratios"""
if self.pca_model is None:
raise ValueError("Must fit PCA first")
explained_variance_ratio = self.pca_model.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance_ratio)
# Find components for different thresholds
thresholds = [0.8, 0.9, 0.95, 0.99]
components_needed = {}
for threshold in thresholds:
n_components = np.argmax(cumulative_variance >= threshold) + 1
components_needed[threshold] = n_components
return {
'explained_variance_ratio': explained_variance_ratio,
'cumulative_variance': cumulative_variance,
'components_needed': components_needed
}
def plot_explained_variance(self, max_components: int = 20):
"""Plot explained variance analysis"""
if self.pca_model is None:
raise ValueError("Must fit PCA first")
variance_data = self.explained_variance_analysis()
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Individual explained variance
n_components = min(max_components, len(variance_data['explained_variance_ratio']))
components = range(1, n_components + 1)
ax1.bar(components, variance_data['explained_variance_ratio'][:n_components],
alpha=0.7, color='steelblue')
ax1.set_xlabel('Principal Component', fontsize=12)
ax1.set_ylabel('Explained Variance Ratio', fontsize=12)
ax1.set_title('Individual Component Variance', fontweight='bold')
ax1.grid(True, alpha=0.3)
# Cumulative explained variance
ax2.plot(components, variance_data['cumulative_variance'][:n_components],
'o-', linewidth=3, markersize=6, color='red')
# Add threshold lines
thresholds = [0.8, 0.9, 0.95]
colors = ['orange', 'green', 'purple']
for threshold, color in zip(thresholds, colors):
ax2.axhline(y=threshold, color=color, linestyle='--', alpha=0.7,
label=f'{threshold*100:.0f}% variance')
if threshold in variance_data['components_needed']:
n_comp = variance_data['components_needed'][threshold]
if n_comp <= n_components:
ax2.axvline(x=n_comp, color=color, linestyle='--', alpha=0.7)
ax2.text(n_comp + 0.5, threshold + 0.02, f'{n_comp} comp.',
color=color, fontweight='bold')
ax2.set_xlabel('Number of Components', fontsize=12)
ax2.set_ylabel('Cumulative Explained Variance', fontsize=12)
ax2.set_title('Cumulative Variance Explained', fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Print summary
print("PCA Variance Analysis:")
print("-" * 30)
for threshold, n_comp in variance_data['components_needed'].items():
print(f"{threshold*100:.0f}% variance: {n_comp} components")
def plot_2d_projection(self, X: np.ndarray, y: Optional[np.ndarray] = None,
feature_names: Optional[List[str]] = None):
"""Plot first two principal components"""
if self.pca_model is None:
raise ValueError("Must fit PCA first")
X_pca = self.transform(X)
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
# Scatter plot
if y is not None:
scatter = axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y,
cmap='viridis', alpha=0.7, s=50)
plt.colorbar(scatter, ax=axes[0])
else:
axes[0].scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.7, s=50)
explained_var = self.pca_model.explained_variance_ratio_
axes[0].set_xlabel(f'PC1 ({explained_var[0]:.3f})', fontsize=12)
axes[0].set_ylabel(f'PC2 ({explained_var[1]:.3f})', fontsize=12)
axes[0].set_title('First Two Principal Components', fontweight='bold')
axes[0].grid(True, alpha=0.3)
# Component loadings
if feature_names is not None and len(feature_names) <= 20:
loadings = self.pca_model.components_[:2].T
for i, feature in enumerate(feature_names):
axes[1].arrow(0, 0, loadings[i, 0], loadings[i, 1],
head_width=0.02, head_length=0.02, fc='red', ec='red')
axes[1].text(loadings[i, 0]*1.1, loadings[i, 1]*1.1, feature,
fontsize=10, ha='center', va='center')
axes[1].set_xlabel(f'PC1 Loadings ({explained_var[0]:.3f})', fontsize=12)
axes[1].set_ylabel(f'PC2 Loadings ({explained_var[1]:.3f})', fontsize=12)
axes[1].set_title('Component Loadings', fontweight='bold')
axes[1].grid(True, alpha=0.3)
axes[1].set_xlim(-1.1, 1.1)
axes[1].set_ylim(-1.1, 1.1)
else:
axes[1].text(0.5, 0.5, 'Too many features\nfor loading plot',
ha='center', va='center', transform=axes[1].transAxes,
fontsize=14)
axes[1].set_title('Component Loadings (Skipped)', fontweight='bold')
plt.tight_layout()
plt.show()
# Load and analyze digits dataset
print("=== PCA Analysis on Digits Dataset ===")
digits = load_digits()
X_digits, y_digits = digits.data, digits.target
print(f"Original data shape: {X_digits.shape}")
print(f"Number of classes: {len(np.unique(y_digits))}")
# Fit PCA
pca_analyzer = PCAAnalyzer()
pca_analyzer.fit_pca(X_digits, n_components=50)
# Analyze explained variance
pca_analyzer.plot_explained_variance(max_components=20)
# Visualize 2D projection
pca_analyzer.plot_2d_projection(X_digits, y_digits)
PCA for Classification Performance
# @filename: utils.py
def pca_classification_analysis(X: np.ndarray, y: np.ndarray,
max_components: int = 50) -> Dict:
"""Analyze how PCA affects classification performance"""
component_range = range(2, min(max_components + 1, X.shape[1]), 2)
results = []
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
# Standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
for n_components in component_range:
# Apply PCA
pca = PCA(n_components=n_components, random_state=42)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)
# Train classifier
clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(X_train_pca, y_train)
# Evaluate
train_acc = clf.score(X_train_pca, y_train)
test_acc = clf.score(X_test_pca, y_test)
# Store results
results.append({
'n_components': n_components,
'train_accuracy': train_acc,
'test_accuracy': test_acc,
'explained_variance': pca.explained_variance_ratio_.sum()
})
print(f"Components: {n_components:2d}, "
f"Test Acc: {test_acc:.4f}, "
f"Variance: {pca.explained_variance_ratio_.sum():.3f}")
return results
def plot_pca_performance_analysis(results: List[Dict]):
"""Plot PCA performance analysis"""
n_components = [r['n_components'] for r in results]
train_accs = [r['train_accuracy'] for r in results]
test_accs = [r['test_accuracy'] for r in results]
explained_vars = [r['explained_variance'] for r in results]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Accuracy vs components
ax1.plot(n_components, train_accs, 'o-', linewidth=2,
markersize=6, label='Training Accuracy', color='blue')
ax1.plot(n_components, test_accs, 's-', linewidth=2,
markersize=6, label='Test Accuracy', color='red')
ax1.set_xlabel('Number of PCA Components', fontsize=12)
ax1.set_ylabel('Accuracy', fontsize=12)
ax1.set_title('Classification Performance vs PCA Components', fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)
# Explained variance vs components
ax2.plot(n_components, explained_vars, 'o-', linewidth=2,
markersize=6, color='green')
ax2.set_xlabel('Number of PCA Components', fontsize=12)
ax2.set_ylabel('Explained Variance Ratio', fontsize=12)
ax2.set_title('Explained Variance vs Components', fontweight='bold')
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Find optimal number of components
best_test_idx = np.argmax([r['test_accuracy'] for r in results])
best_result = results[best_test_idx]
print(f"\nOptimal Configuration:")
print(f"Components: {best_result['n_components']}")
print(f"Test Accuracy: {best_result['test_accuracy']:.4f}")
print(f"Explained Variance: {best_result['explained_variance']:.3f}")
# Run classification analysis
print("\n=== PCA Classification Performance Analysis ===")
pca_results = pca_classification_analysis(X_digits, y_digits, max_components=30)
plot_pca_performance_analysis(pca_results)
t-SNE for Non-linear Visualization
# @filename: Dockerfile
from sklearn.manifold import TSNE
class TSNEAnalyzer:
"""t-SNE analysis and visualization"""
def __init__(self, random_state: int = 42):
self.random_state = random_state
def parameter_analysis(self, X: np.ndarray, y: np.ndarray,
sample_size: int = 1000) -> Dict:
"""Analyze different t-SNE parameters"""
# Sample data for faster analysis
if X.shape[0] > sample_size:
indices = np.random.choice(X.shape[0], sample_size, replace=False)
X_sample = X[indices]
y_sample = y[indices]
else:
X_sample, y_sample = X, y
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_sample)
# Different parameter combinations
parameter_combinations = [
{'perplexity': 30, 'learning_rate': 200, 'n_iter': 1000},
{'perplexity': 50, 'learning_rate': 200, 'n_iter': 1000},
{'perplexity': 30, 'learning_rate': 'auto', 'n_iter': 1000},
{'perplexity': 30, 'learning_rate': 200, 'n_iter': 2000}
]
results = {}
for i, params in enumerate(parameter_combinations):
print(f"Running t-SNE with parameters: {params}")
start_time = time.time()
tsne = TSNE(
n_components=2,
random_state=self.random_state,
**params
)
X_tsne = tsne.fit_transform(X_scaled)
runtime = time.time() - start_time
results[f"Config_{i+1}"] = {
'params': params,
'embedding': X_tsne,
'labels': y_sample,
'runtime': runtime,
'kl_divergence': tsne.kl_divergence_
}
print(f" Runtime: {runtime:.2f}s, KL divergence: {tsne.kl_divergence_:.2f}")
return results
def plot_tsne_comparison(self, results: Dict):
"""Plot t-SNE results comparison"""
n_configs = len(results)
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.flatten()
for i, (config_name, data) in enumerate(results.items()):
X_tsne = data['embedding']
y_sample = data['labels']
params = data['params']
runtime = data['runtime']
kl_div = data['kl_divergence']
scatter = axes[i].scatter(X_tsne[:, 0], X_tsne[:, 1],
c=y_sample, cmap='tab10',
alpha=0.7, s=30)
# Title with parameters
title = f"Perplexity: {params['perplexity']}, LR: {params['learning_rate']}\n"
title += f"Runtime: {runtime:.1f}s, KL: {kl_div:.2f}"
axes[i].set_title(title, fontsize=10, fontweight='bold')
axes[i].set_xlabel('t-SNE 1', fontsize=10)
axes[i].set_ylabel('t-SNE 2', fontsize=10)
if i == 0: # Add colorbar to first plot
plt.colorbar(scatter, ax=axes[i])
plt.tight_layout()
plt.show()
def perplexity_analysis(self, X: np.ndarray, y: np.ndarray,
sample_size: int = 500) -> Dict:
"""Analyze impact of perplexity parameter"""
# Sample data
if X.shape[0] > sample_size:
indices = np.random.choice(X.shape[0], sample_size, replace=False)
X_sample = X[indices]
y_sample = y[indices]
else:
X_sample, y_sample = X, y
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_sample)
perplexity_values = [5, 15, 30, 50, 100]
results = {}
for perplexity in perplexity_values:
print(f"Testing perplexity: {perplexity}")
# Adjust perplexity if too large for dataset
actual_perplexity = min(perplexity, (X_sample.shape[0] - 1) // 3)
tsne = TSNE(
n_components=2,
perplexity=actual_perplexity,
random_state=self.random_state,
n_iter=1000
)
X_tsne = tsne.fit_transform(X_scaled)
results[actual_perplexity] = {
'embedding': X_tsne,
'labels': y_sample,
'kl_divergence': tsne.kl_divergence_
}
return results
def plot_perplexity_analysis(self, results: Dict):
"""Plot perplexity analysis results"""
n_perplexities = len(results)
fig, axes = plt.subplots(1, n_perplexities, figsize=(4*n_perplexities, 5))
if n_perplexities == 1:
axes = [axes]
for i, (perplexity, data) in enumerate(results.items()):
X_tsne = data['embedding']
y_sample = data['labels']
kl_div = data['kl_divergence']
scatter = axes[i].scatter(X_tsne[:, 0], X_tsne[:, 1],
c=y_sample, cmap='tab10',
alpha=0.7, s=30)
axes[i].set_title(f'Perplexity: {perplexity}\nKL: {kl_div:.2f}',
fontweight='bold')
axes[i].set_xlabel('t-SNE 1', fontsize=10)
axes[i].set_ylabel('t-SNE 2', fontsize=10)
plt.tight_layout()
plt.show()
# Analyze t-SNE on digits dataset
print("\n=== t-SNE Analysis ===")
tsne_analyzer = TSNEAnalyzer()
# Parameter analysis
print("Comparing different t-SNE parameters...")
tsne_param_results = tsne_analyzer.parameter_analysis(X_digits, y_digits, sample_size=800)
tsne_analyzer.plot_tsne_comparison(tsne_param_results)
# Perplexity analysis
print("\nAnalyzing perplexity impact...")
perplexity_results = tsne_analyzer.perplexity_analysis(X_digits, y_digits, sample_size=600)
tsne_analyzer.plot_perplexity_analysis(perplexity_results)
UMAP for Modern Dimensionality Reduction
# @filename: utils.py
try:
import umap
class UMAPAnalyzer:
"""UMAP analysis and comparison"""
def __init__(self, random_state: int = 42):
self.random_state = random_state
def parameter_analysis(self, X: np.ndarray, y: np.ndarray) -> Dict:
"""Analyze different UMAP parameters"""
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
parameter_combinations = [
{'n_neighbors': 15, 'min_dist': 0.1, 'metric': 'euclidean'},
{'n_neighbors': 30, 'min_dist': 0.1, 'metric': 'euclidean'},
{'n_neighbors': 15, 'min_dist': 0.5, 'metric': 'euclidean'},
{'n_neighbors': 15, 'min_dist': 0.1, 'metric': 'cosine'}
]
results = {}
for i, params in enumerate(parameter_combinations):
print(f"Running UMAP with parameters: {params}")
start_time = time.time()
umap_model = umap.UMAP(
n_components=2,
random_state=self.random_state,
**params
)
X_umap = umap_model.fit_transform(X_scaled)
runtime = time.time() - start_time
results[f"Config_{i+1}"] = {
'params': params,
'embedding': X_umap,
'labels': y,
'runtime': runtime
}
print(f" Runtime: {runtime:.2f}s")
return results
def plot_umap_comparison(self, results: Dict):
"""Plot UMAP results comparison"""
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.flatten()
for i, (config_name, data) in enumerate(results.items()):
X_umap = data['embedding']
y_labels = data['labels']
params = data['params']
runtime = data['runtime']
scatter = axes[i].scatter(X_umap[:, 0], X_umap[:, 1],
c=y_labels, cmap='tab10',
alpha=0.7, s=30)
# Title with parameters
title = f"n_neighbors: {params['n_neighbors']}, min_dist: {params['min_dist']}\n"
title += f"metric: {params['metric']}, Runtime: {runtime:.1f}s"
axes[i].set_title(title, fontsize=10, fontweight='bold')
axes[i].set_xlabel('UMAP 1', fontsize=10)
axes[i].set_ylabel('UMAP 2', fontsize=10)
if i == 0: # Add colorbar to first plot
plt.colorbar(scatter, ax=axes[i])
plt.tight_layout()
plt.show()
def compare_with_other_methods(self, X: np.ndarray, y: np.ndarray,
sample_size: int = 1000) -> Dict:
"""Compare UMAP with PCA and t-SNE"""
# Sample data for fair comparison
if X.shape[0] > sample_size:
indices = np.random.choice(X.shape[0], sample_size, replace=False)
X_sample = X[indices]
y_sample = y[indices]
else:
X_sample, y_sample = X, y
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_sample)
methods = {}
# PCA
print("Running PCA...")
start_time = time.time()
pca = PCA(n_components=2, random_state=self.random_state)
X_pca = pca.fit_transform(X_scaled)
pca_time = time.time() - start_time
methods['PCA'] = {
'embedding': X_pca,
'runtime': pca_time,
'explained_variance': pca.explained_variance_ratio_.sum()
}
# t-SNE
print("Running t-SNE...")
start_time = time.time()
tsne = TSNE(n_components=2, random_state=self.random_state,
perplexity=30, n_iter=1000)
X_tsne = tsne.fit_transform(X_scaled)
tsne_time = time.time() - start_time
methods['t-SNE'] = {
'embedding': X_tsne,
'runtime': tsne_time,
'kl_divergence': tsne.kl_divergence_
}
# UMAP
print("Running UMAP...")
start_time = time.time()
umap_model = umap.UMAP(n_components=2, random_state=self.random_state,
n_neighbors=15, min_dist=0.1)
X_umap = umap_model.fit_transform(X_scaled)
umap_time = time.time() - start_time
methods['UMAP'] = {
'embedding': X_umap,
'runtime': umap_time
}
# Add labels to all methods
for method_data in methods.values():
method_data['labels'] = y_sample
return methods
def plot_method_comparison(self, methods: Dict):
"""Plot comparison of different methods"""
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for i, (method_name, data) in enumerate(methods.items()):
embedding = data['embedding']
labels = data['labels']
runtime = data['runtime']
scatter = axes[i].scatter(embedding[:, 0], embedding[:, 1],
c=labels, cmap='tab10', alpha=0.7, s=30)
title = f"{method_name}\nRuntime: {runtime:.2f}s"
if method_name == 'PCA':
title += f"\nExplained Var: {data['explained_variance']:.3f}"
elif method_name == 't-SNE':
title += f"\nKL Divergence: {data['kl_divergence']:.2f}"
axes[i].set_title(title, fontweight='bold')
axes[i].set_xlabel(f'{method_name} 1', fontsize=12)
axes[i].set_ylabel(f'{method_name} 2', fontsize=12)
if i == 0: # Add colorbar to first plot
plt.colorbar(scatter, ax=axes[i])
plt.tight_layout()
plt.show()
# Print runtime comparison
print("\nRuntime Comparison:")
print("-" * 20)
for method, data in methods.items():
print(f"{method:6}: {data['runtime']:.2f}s")
# Run UMAP analysis
print("\n=== UMAP Analysis ===")
umap_analyzer = UMAPAnalyzer()
# Parameter analysis
print("Comparing different UMAP parameters...")
umap_param_results = umap_analyzer.parameter_analysis(X_digits[:800], y_digits[:800])
umap_analyzer.plot_umap_comparison(umap_param_results)
# Method comparison
print("\nComparing PCA, t-SNE, and UMAP...")
method_comparison = umap_analyzer.compare_with_other_methods(X_digits, y_digits, sample_size=800)
umap_analyzer.plot_method_comparison(method_comparison)
except ImportError:
print("UMAP not installed. Install with: pip install umap-learn")
print("Skipping UMAP analysis...")
Comprehensive Comparison
# @filename: utils.py
def comprehensive_dimensionality_reduction_analysis():
"""Complete comparison of dimensionality reduction techniques"""
# Load different datasets
datasets = {
'Digits': load_digits(),
'Breast Cancer': load_breast_cancer()
}
results_summary = []
for dataset_name, dataset in datasets.items():
print(f"\n=== Analysis on {dataset_name} Dataset ===")
X, y = dataset.data, dataset.target
print(f"Original shape: {X.shape}")
print(f"Number of classes: {len(np.unique(y))}")
# Sample for consistent comparison
if X.shape[0] > 500:
indices = np.random.choice(X.shape[0], 500, replace=False)
X_sample = X[indices]
y_sample = y[indices]
else:
X_sample, y_sample = X, y
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_sample)
# Apply methods
methods_data = {}
# PCA
pca = PCA(n_components=2, random_state=42)
start_time = time.time()
X_pca = pca.fit_transform(X_scaled)
pca_time = time.time() - start_time
methods_data['PCA'] = {
'embedding': X_pca,
'runtime': pca_time,
'variance_explained': pca.explained_variance_ratio_.sum()
}
# t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=500)
start_time = time.time()
X_tsne = tsne.fit_transform(X_scaled)
tsne_time = time.time() - start_time
methods_data['t-SNE'] = {
'embedding': X_tsne,
'runtime': tsne_time,
'kl_divergence': tsne.kl_divergence_
}
# Store results
for method, data in methods_data.items():
results_summary.append({
'dataset': dataset_name,
'method': method,
'original_dims': X.shape[1],
'n_samples': X_sample.shape[0],
'runtime': data['runtime']
})
# Plot comparison for this dataset
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
for i, (method, data) in enumerate(methods_data.items()):
embedding = data['embedding']
scatter = axes[i].scatter(embedding[:, 0], embedding[:, 1],
c=y_sample, cmap='tab10', alpha=0.7, s=30)
title = f"{method} - {dataset_name}\nRuntime: {data['runtime']:.2f}s"
if method == 'PCA':
title += f"\nVar Explained: {data['variance_explained']:.3f}"
elif method == 't-SNE':
title += f"\nKL Div: {data['kl_divergence']:.2f}"
axes[i].set_title(title, fontweight='bold')
axes[i].set_xlabel(f'{method} 1')
axes[i].set_ylabel(f'{method} 2')
if i == 0:
plt.colorbar(scatter, ax=axes[i])
plt.tight_layout()
plt.show()
return results_summary
# Run comprehensive analysis
print("\n=== Comprehensive Dimensionality Reduction Analysis ===")
summary_results = comprehensive_dimensionality_reduction_analysis()
# Create summary table
summary_df = pd.DataFrame(summary_results)
print("\nSummary Results:")
print(summary_df.to_string(index=False))
When to Use Each Method
Method Selection Guide
| Method | Best For | Pros | Cons |
|---|---|---|---|
| PCA | Linear relationships, feature reduction | Fast, interpretable, preserves variance | Linear only, may miss non-linear patterns |
| t-SNE | Visualization, clustering analysis | Excellent clusters, non-linear | Slow, not deterministic, local structure |
| UMAP | General purpose, large datasets | Fast, preserves local & global structure | Newer, fewer theoretical guarantees |
Key Recommendations
- Start with PCA for initial exploration and linear relationships
- Use t-SNE for beautiful visualizations and cluster discovery
- Choose UMAP for large datasets and balanced local/global structure
- Combine methods - use PCA for preprocessing, then t-SNE/UMAP
- Consider computational cost - PCA is fastest, t-SNE is slowest
Performance Guidelines
- PCA: Use when you need interpretable components and fast computation
- t-SNE: Perplexity = 5-50, higher for larger datasets
- UMAP: n_neighbors = 10-50, min_dist = 0.1-0.5
- Preprocessing: Always standardize features first
- Sample size: Use sampling for t-SNE on large datasets (>10k samples)
Conclusion
Dimensionality reduction is essential for high-dimensional data analysis. Key takeaways:
- PCA for linear reduction and fast feature extraction
- t-SNE for non-linear visualization and cluster discovery
- UMAP for modern, balanced dimensionality reduction
- Parameter tuning significantly impacts results
- Preprocessing and standardization are crucial
- Method combination often works better than single approaches
Choose your method based on data characteristics, computational constraints, and analysis goals.
References
-
Jolliffe, I. T. (2002). Principal component analysis. Springer.
-
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research.
-
McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform manifold approximation and projection. arXiv preprint.
Connect with me on LinkedIn or X to discuss dimensionality reduction techniques!
