Rank Aggregation Analysis

Overview

Figure source: Li, Xue, Xinlei Wang, and Guanghua Xiao. “A comparative study of rank aggregation methods for partial and top ranked lists in genomic applications.” Briefings in bioinformatics 20.1 (2019): 178-189.

Intro

The first time I came across rank aggregation was from a paper titled A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules by Stuart Kim in 2003.

This is one of those early papers proposing new analytical tools to extract biological information from omics data.

This paper caught my attention as it used gene correlations to infer gene functions. Gene function knowledge are usually learned from three sources: genetic evidence, evolutioanry evidence, biochemical evidence (ref: Defining functional DNA elements in the human genome).

Assembling microarray datasets across four species (human, fly, worm, yeast), this study used recurrently highly correlated gene pairs conserved across species to infer gene functions based on the evolutionary conservation ~ biological function principle.

Specifically, it first calculated gene correlations within each species and assigned a rank to each gene pair. Repeating this across species led to a ranking vector per gene pair (ortholog identification analysis involved), composed of ranks across datasets.

How to determine if a gene pair is significantly highly correlated across datasets?

In the supplementary material, this paper proposed a probabilistic model.

It first transformed a ranking vector into a ranking ratio vector, dividing each rank by the total number of gene pairs in a dataset, (\(r_1, r_2, ..., r_n\)).

If each \(r_s\) was drawn indepently and uniformly, the P-value of the observed ranking ratio vector could be computed from a joint cumulative distribution of an n-dimensional order statistic: \(P(r_1,r_2,...,r_n) = n!\int_{0}^{r1}\int_{s1}^{r2}...\int_{s_{n-1}}^{r_n}ds_1ds_2...ds_n\)

With this equation, the author used an recursive formula to efficiently compute the P-value.

This approach looks intuitive and direct, the rank aggregation analysis they employed actually entailed more implications.

Distribution-based Rank Aggregation

Then in 2006, a method paper on nature biotechnology generalized this rank aggregation idea to be applied in much broader contexts.

The problem formulation was as follows: if you have information about gene importance from multiple data sources, how to robustly and efficiently integrate these information and give a final importance score for each gene.

This gene prioritization problem is nicely visualized by their Figure 1.

Figure source: Aerts, Stein, et al. “Gene prioritization through genomic data fusion.” Nature biotechnology 24.5 (2006): 537-544.

The math formula for calculating probabilites from an N-dimensional order statistic are exactly the same as the 2003 paper, but they designed a more efficient and accurate computation algorithm.

Along with this paper, they released a software package Endeavour implementing the algorithm, which is a user-friendly tool in those “early” bioinformatic world. But now, it’s R or Python universe~

R package implementing rank aggregation

Actually, there are several R packages performing rank aggregation analysis, as reviewer in this paper. A ‘walk-through’ map can be found at the begining of this post.

Here I use R package RobustRankAggreg as a show case.

This package implements several different algorithms to calculate P values using the function aggregateRanks

Suppose you have s set of items, all of them or subset of them were separately ranked from different data sources. After ranking, you got a input list of ranking vectors.

For the distribution based method P value, aggregateRanks(..,method='RRA'...) would give each item a Score, which is close to P value but not exact P value.

Refer to the original publication: RRA

In practice, use the number of input list (the number of data sources) to obtain the upper bound of P value for each item, then use ‘BH’ to perform multiple-testing correction.

R code demo:

library(RobustRankAggreg)
data(cellCycleKO)
r = rankMatrix(cellCycleKO$gl, N = cellCycleKO$N)
dim(r) #item in row, data source in column

## [1] 2372   12

r[1:3,1:4]

##                 ACE2 CAC2 FKH1         HST3
## YLR286C 0.0001611344    1    1 0.0141798260
## YHR143W 0.0003222688    1    1 0.0225588141
## YFL026W 0.0004834032    1    1 0.0001611344

ar = aggregateRanks(rmat = r,method='RRA',full=T)
head(ar)

##            Name        Score
## YJR148W YJR148W 2.326157e-12
## YMR034C YMR034C 5.704013e-10
## YPL016W YPL016W 1.626147e-09
## YKR093W YKR093W 8.819815e-09
## YOR043W YOR043W 2.520095e-07
## YFL026W YFL026W 2.972454e-07

dim(ar) #2372 features ranked by Score or P values

## [1] 2372    2

ar$derive.p.value=ar$Score*nrow(r); #tested list number
ar$adjust.p= p.adjust(ar$derive.p.value,method='BH')
par(mfrow=c(1,3))
hist(ar$Score)
hist(ar$derive.p.value)
hist(ar$adjust.p)

RandomNote

The corresponding author, Stuart Kim of the 2003 Science paper, is also the corresponding author of a Aging research paper Aging Mice Show a Decreasing Correlation of Gene Expression within Genetic Modules, leveraging microarray data and employing a network analysis approach to study aging from a systems biology perspective.

The first author, Stein Aerts of the 2006 nature biotechnology paper, is the corresponding author of the first Single-cell Aging Fly Brain Atlas paper. With machine learning algorithms and single-cell techniques, his lab is generating new knowledge of fly brains.

I think their trajectories are pretty illuminating, as simple ideas in early careers gradually mature and develop into something huge.

References

Stuart, Joshua M., et al. “A gene-coexpression network for global discovery of conserved genetic modules.” science 302.5643 (2003): 249-255.
Aerts, Stein, et al. “Gene prioritization through genomic data fusion.” Nature biotechnology 24.5 (2006): 537-544.
Kolde, Raivo, et al. “Robust rank aggregation for gene list integration and meta-analysis.” Bioinformatics 28.4 (2012): 573-580.

sessionInfo()

## R version 4.1.3 (2022-03-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] RobustRankAggreg_1.1
## 
## loaded via a namespace (and not attached):
##  [1] bookdown_0.25   digest_0.6.29   R6_2.5.1        jsonlite_1.8.0 
##  [5] magrittr_2.0.2  evaluate_0.15   highr_0.9       blogdown_1.17.2
##  [9] stringi_1.7.8   rlang_1.0.6     cli_3.5.0       rstudioapi_0.13
## [13] jquerylib_0.1.4 bslib_0.3.1     rmarkdown_2.13  tools_4.1.3    
## [17] stringr_1.4.0   xfun_0.39       yaml_2.3.4      fastmap_1.1.0  
## [21] compiler_4.1.3  htmltools_0.5.2 knitr_1.37      sass_0.4.0