As the title suggests

    第十讲是由校外邀请来的嘉宾来讲数据降维,所以课程网站上没有课件。时间轴和语音转录相对来说不那么有意义了。所以在听课的过程中做了如下笔记。

    Contents

    • High dimensinoal data
    • Dimensionality Reduction Methods
    • Clustering (a discrete form of reduction)

    High Dimensional Data

    data matrix X:n*m.

    n samples as rows, m columns as observables

    How to visualize?

    • m=1: list
    • m=2: x-y plot

    Aim of dimension reduction: reduce the number of observables, while keeping relevant information.

    Examples

    Ferromagnetic Ising model of a 28*28 lattice, 50,000 samples. The 1st principal component scales with the magnetization.

    500,000 DNA sites of 3,000 humans. Projections onto the first 2 principal components givees a rough map of Europe

    Fashion MNIST: 60,000 pictures of 784 (28*28) pixels. UMAP projections onto 2 dimensions

    Linear dimension reduction

    • Principal component analysis
    • Non-negative matrix factorization
    • Independent component analysis
    • Factor analysis

    Principal component analysis

    change of basis

    Aim: to find a transformation such that

    • the components are uncorrelated
    • ordered by the explained variability

    Math:

    XTXX^T X is a m×mm \times m matrix. Eigen-decomposition gives (λ,W)(\lambda,W) where WW's columns are eigen vectors sorted by λ\lambda

    T=XWT = X \cdot W where TT is the score onto principal components, and TTTT^TT is diagonal

    Tr=XWrT_r = X \cdot W_r where WrW_r is a n*r matrix of the first r components

    Practice In R:

    • Iris Dataset
    • Pitures of Objects at Different Angles:
    • Swiss Roll: the transformed data almost have the same structure as the original data. This shows the limit of PCA due to linearity

    wikipedia - Nonlinear_dimensionality_reduction

    UMAP: uniform manifold approximation

    Assumptions:

    • uniformed distributeed on Riemann manifold
    • Riemann metric is locally constant
    • The manifold is locally conncected

    Results:

    • models the manifold with a fuzzy topological structure
    • finds low dimensional projection of the data that has the cloest possible equivalent fuzzy topological structure

    Simplex 0-simplex: dot 1-simplex: line 2-simplex: triangle 3-simplex: 正四面体

    Problem:

    When the uniform distribution is not met, the result may not be what we like. To circumvent this, we need to define the local metric with the nearest neighbours.

    Finding a low dimensional representation

    • derive similar fuzzy topological representation of a low dimensional representation
    • interpret edge weight s as proba
    • minimize cross entropy
    • pair-code.github.io/understand_umap

    Practice in R

    • Swiss Roll becomes a 2-d ring
    • Pictures at different angles are well seperated

    Clustering

    hierarchiral clustering

    • metric
    • linkage criterion / distance between clusters

    centroid base clustering (k-means)

    • pre-specify number of clusters, k
    • partition the data such that the squared distance to cluster mean position is minimal
    • Lloyd algorithm
      • start with random centers
      • assign data to neatest center
      • compute new centers

    graph based-clustering

    • graph representation of data
    • partition the data such that modularity Q is optimized
    • Q=i=1ceiiγai2,ai=j1ceijQ = \sum_{i=1}^c e_{ii}-\gamma {a_i}^2, a_i = \sum_{j-1}^c e_{ij}.
      • eije_{ij}: fraction of links between cluster i and j
      • γ\gamma: resulution

    Summary

    Question: How to identify order in high-dimensinoal data?

    • dimensionality reductuon to visualize high-dimensional data and to identify macroscopic observables
    • clustering to identify discrete order in high dimensional data
    • methods to se depending on the problem at hand

    See also: