Random Matrix Theory and its applications in biology

1. Random matrix examples and the Wigner’s semicircle law

Example 1: Normal random symmetric matrix

  • Generate a 5, 000 × 5, 000 random symmetric matrix with entries Aij ∼ N(0,1)
  • Compute eigenvalues
  • Draw the histogram of all eigenvalues.
n=5000
m = array(rnorm(n^2),c(n,n));
m2 = (m+t(m))/sqrt(2*n); # Make m symmetric
lambda = eigen(m2, symmetric=T, only.values = T);
ev = lambda$values;
hist(ev, breaks=seq(-2.01,2.01,0.02),main=NA, xlab="Eigenvalues",freq=F)

## Example 2: Uniform random symmetric matrix - Generate a 5,000 x 5,000 random symmetric matrix with entries Aij ∼ Uniform(0, 1) - Compute eigenvalues - Draw the histogram of all eigenvalues

n = 5000;
mu = array(runif(n^2),c(n,n))
mu2 = sqrt(12)*(mu+t(mu)-1)/sqrt(2*n)
lambdau = eigen(mu2, symmetric=T, only.values = T)
ev = lambdau$values;
hist(ev, breaks=seq(-2.01,2.01,0.02), main=NA, xlab="Eigenvalues",freq=F)

Pattern: the density of eigenvalues is a semicircle!

Wigner’s semicircle law

semicircle.law

2. Random correlation matrices and the Marc̆henko-Pastur distribution

Describe the eigenvalue distribution of a correlation matrix

https://journals.aps.org/pre/abstract/10.1103/PhysRevE.65.066126

MP1 MP2

Example: Independent and identically distributed (IID) random normal returns

t = 5000; #5000 time points
m = 1000; #1000 stocks
h = array(rnorm(m*t),c(m,t)); # Time series in rows, obs in row, feature in column
e = h %*% t(h)/t; # Form the correlation matrix
lambdae = eigen(e, symmetric=T, only.values = T);
ee = lambdae$values;
hist(ee, breaks =seq(0.01,3.01,.02), 
     #main='Empirical eigenvalue density',
     main='Empirical eigenvalue density with \nsuperimposed Marc ̆henko-Pastur density',
     xlab='Eigenvalues', freq=F)

# ratio of the matrix dimensions
Q = t/m
s=sd(as.numeric(h))
# boundaries
a=s^2*(1-sqrt(1/Q))^2
b=s^2*(1+sqrt(1/Q))^2
# theoretical MP distribution
MP<-function(x){
  Q/2/pi/s^2 * ( ((a-x)*(x-b))^0.5 / x )
}
#pos=seq(min(ee),max(ee),length=100);
pos=seq(0,3,length=200);
lines(pos,MP(pos),type='l',col='red',lwd=2)

Compare the empirical eigenvalue density with theoretical Marc̆henko-Pastur density with different matrix sizes.

MP_size_mt<-function(m,t){
  #t = 5000; #5000 time points
  #m = 1000; #1000 stocks
  h = array(rnorm(m*t),c(m,t)); # Time series in rows, obs in row, feature in column
  e = h %*% t(h)/t; # Form the correlation matrix
  lambdae = eigen(e, symmetric=T, only.values = T);
  ee = lambdae$values;
  hist(ee, breaks =seq(0.01,3.01,.02), 
       #main='Empirical eigenvalue density',
       main=paste0(m,'x',t,'\nEmpirical vs MP'),
       xlab='Eigenvalues', freq=F)
  
  # ratio of the matrix dimensions
  Q = t/m
  s=sd(as.numeric(h))
  # boundaries
  a=s^2*(1-sqrt(1/Q))^2
  b=s^2*(1+sqrt(1/Q))^2
  # theoretical MP distribution
  MP<-function(x){
    Q/2/pi/s^2 * ( ((a-x)*(x-b))^0.5 / x )
  }
  #pos=seq(min(ee),max(ee),length=100);
  pos=seq(0,3,length=200);
  lines(pos,MP(pos),type='l',col='red',lwd=2)
}
MP_size_mt(m=1000,t=5000)

MP_size_mt(m=100,t=500)

MP_size_mt(m=10,t=50)

MP densities depends on Q = t/m

out=list()
s=1
for(Q in c(1,2,5)){
  # boundaries
  a=s^2*(1-sqrt(1/Q))^2
  b=s^2*(1+sqrt(1/Q))^2
  # theoretical MP distribution
  MP<-function(x){
    Q/2/pi/s^2 * ( ((a-x)*(x-b))^0.5 / x )
  }
  pos=seq(0,5,length=1000);
  out[[as.character(Q)]]=MP(pos)
}
plot(0,0,xlim=c(0,5),ylim=c(0,1.2),type='n');
cl=rainbow(3)
for(i in 1:length(out)){
  lines(pos,out[[i]],col=cl[i],type='l',lwd=2)
}
legend("topright", legend = paste('Q',names(out)), col=cl, pch=1) # optional legend

Applications

  • denoise single cell data
  • selecting the number of components in ICA
    • Teschendorff, Andrew E., Joanna Zhuang, and Martin Widschwendter. “Independent surrogate variable analysis to deconvolve confounding factors in large-scale microarray profiling studies.” Bioinformatics 27.11 (2011): 1496-1505.
    • Wang, Weixu, et al. “Independent component analysis based gene co-expression network inference (ICAnet) to decipher functional modules for better single-cell clustering and batch integration.” Nucleic acids research 49.9 (2021): e54-e54.
  • identify cell clusters with non-random substructure
    • Mircea, Maria, et al. “Phiclust: a clusterability measure for single-cell transcriptomics reveals phenotypic subpopulations.” Genome biology 23.1 (2022): 1-24.

3. The Tracy-Widom law

The distribution of the largest eigenvalue of a random correlation matrix

TW

Applications

  • detect structure in genetic data
    • Patterson, Nick, Alkes L. Price, and David Reich. “Population structure and eigenanalysis.” PLoS genetics 2.12 (2006): e190.
  • selecting the number of PC components in a PCA analysis
    • Wang, Lin, Wei Zhang, and Qizhai Li. “AssocTests: an R package for genetic association studies.” Journal of Statistical Software 94.5 (2020): 1-26.
AssocTests::tw(eigenvalues = c(5, 3, 1, 0), eigenL = 4, criticalpoint = 2.0234)
## 
## 	Tracy-Widom test
## 
## data:  c(5, 3, 1, 0)
## TW1 = -0.82427, TW2 = -0.60186, TW3 = -0.55525, TW4 = NaN
## alternative hypothesis: the eigenvalue is significant
#eigenvalues: a numeric vector whose elements are the eigenvalues of a matrix. The values should be sorted in the descending order.
#eigenL: the number of eigenvalues.
#criticalpoint: a numeric value corresponding to the significance level. If the significance level is 0.05, 0.01, 0.005, or 0.001, the criticalpoint should be set to be 0.9793, 2.0234, 2.4224, or 3.2724, accordingly. The default is 2.0234.

4. NNSD, Distribution of Nearest-Neighbor eigenvalue Spacings.

the distribution of the spacing between the eigenvalues

R package RMThreshold: https://github.com/cran/RMThreshold https://github.com/cran/RMThreshold/blob/master/inst/doc/RMThreshold_Intro.pdf

nnsd

Application: Use NNSD to threshold a correlation matrix

  • Luo, Feng, et al. “Constructing gene co-expression networks and predicting functions of unknown genes by random matrix theory.” BMC bioinformatics 8.1 (2007): 1-17.
  • Gibson, Scott M., et al. “Massive-scale gene co-expression network construction and robustness testing using random matrix theory.” PloS one 8.2 (2013): e55871.
  • RMThreshold: Signal-Noise Separation in Random Matrices by using Eigenvalue Spectrum Analysis, https://cran.r-project.org/web/packages/RMThreshold/index.html
  • Wang, Haiying, et al. “Integrated metagenomic analysis of the rumen microbiome of cattle reveals key biological mechanisms associated with methane traits.” Methods 124 (2017): 108-119.

References: - https://www.stats.ox.ac.uk/~cucuring/ Lecture 4: PCA in high dimensions, random matrix theory and financial applications Foundations of Data Science: Algorithms and Mathematical Foundations

  • Plerou, Vasiliki, et al. “Random matrix approach to cross correlations in financial data.” Physical Review E 65.6 (2002): 066126.
sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] rstudioapi_0.13     knitr_1.37          cluster_2.1.2      
##  [4] magrittr_2.0.2      R6_2.5.1            rlang_1.0.6        
##  [7] fastmap_1.1.0       stringr_1.4.0       highr_0.9          
## [10] timeSeries_3062.100 tools_4.1.3         timeDate_3043.102  
## [13] xfun_0.39           cli_3.5.0           jquerylib_0.1.4    
## [16] htmltools_0.5.2     yaml_2.3.4          digest_0.6.29      
## [19] bookdown_0.25       sass_0.4.0          evaluate_0.15      
## [22] rmarkdown_2.13      blogdown_1.17.2     fGarch_3042.83.2   
## [25] stringi_1.7.8       compiler_4.1.3      bslib_0.3.1        
## [28] fBasics_3042.89.1   fExtremes_3042.82   AssocTests_1.0-1   
## [31] spatial_7.3-15      combinat_0.0-8      mvtnorm_1.1-3      
## [34] jsonlite_1.8.0