Using voomByGroup with RNA-seq data

0.1 Overview

Group heteroscedasticity is commonly observed in pseudo-bulk single cell RNA-seq datasets and when not modelled appropriately, its presence can hamper the detection of differentially expressed genes. Here, we introduce two methods that account for heteroscedastic groups, namely voomByGroup and voomWithQualityWeights using a blocked design (voomQWB). Since they are both voom derived methods, to illustrate how to use them and provide you an idea of the difference between results they generated, we will follow the standard RNA-seq analysis workflow using limma from RNASeq123. Whilst we use pseudo-bulk samples in this help page, the idea of modelling group variances more closely can in theory be extended to DE analysis on other data types, such as bulk RNA-seq data, pseudo-bulk of spatial scRNA-seq data, and surface protein data from CITE-seq.

0.2 Loading and processing the data

The data we use here is the COVID-19 PBMC data used in our preprint. CD56dim CD16+ NK cells are extracted to create pseudo-bulk samples. rds files are provided in our github, under the folder example/data.

library(limma)
library(edgeR)

readRDS("data/counts.rds") -> counts
readRDS("data/group_id.rds") -> group_id

Filter out lowly expressed genes and create a DGElist object.

counts[rowSums(counts)>50,]-> counts_k
#create a DGElist
DGEList(counts_k) -> y
#normalization
y <- calcNormFactors(y)
y$samples$group <- group_id

Make the design matrix based on group information we have.

cd <- y$samples$group
model.matrix(~0+cd) -> design
design

##    cdAsymptomatic cdHC cdModerate
## 1               1    0          0
## 2               1    0          0
## 3               0    0          1
## 4               0    0          1
## 5               0    0          1
## 6               0    0          1
## 7               0    1          0
## 8               0    1          0
## 9               0    1          0
## 10              1    0          0
## 11              1    0          0
## 12              0    0          1
## 13              1    0          0
## 14              1    0          0
## 15              1    0          0
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$cd
## [1] "contr.treatment"

Make contrasts for pairwise comparisons between cell populations based on our design matrix.

contr.matrix <- makeContrasts(
  AvsH = cdAsymptomatic - cdHC,
  MvsH = cdModerate - cdHC,
  MvsA = cdModerate- cdAsymptomatic,
  levels = colnames(design))
contr.matrix

##                 Contrasts
## Levels           AvsH MvsH MvsA
##   cdAsymptomatic    1    0   -1
##   cdHC             -1   -1    0
##   cdModerate        0    1    1

Have a check on the MDS plot.

cols=y$samples$group
table(cols)

## cols
## Asymptomatic           HC     Moderate 
##            7            3            5

cols[cols=="Asymptomatic"]<- "red"
cols[cols=="HC"]<- "blue"
cols[cols=="Moderate"]<- "orange"
par(mfrow=c(1,1))
limma::plotMDS(y,col=cols)

As we can see here, based on the MDS plot, samples from HC exhibit relatively lower group variation with all three samples sitting close to each other, while samples from moderate and asymptomatic patients exhibit relatively more variation.

0.3 Removing heteroscedascity from count data

Before we fit a linear model to find the DE genes, we need to deal with the dependency between mean and variance from RNA-seq data. voom was originally designed to model the mean-variance relationship with a LOWESS trends fitted, and precision weights were then calculated for all observations based on estimated variance. Besides voom, there are 3 voomderived strategies, including voomWithQualityWeights , which takes sample variability into account, voomWithQualityWeights with block used, and voomByGroup, which take group-wise variability into account. Details of the difference between their strategies are covered in our preprint.

In this help page, we would focus on how to use these functions and diagnostic plots generated by them.

0.3.1 voom

To run voom, y (count matrix), and design (design matrix with cell population information) are required as input. Input for other parameters are optional. Typically, the “voom-plot” shows a decreasing trend between the means and variances resulting from a combination of technical variation in the sequencing experiment and biological variation.

voom(y, design = design, plot = TRUE) -> y_voom

0.3.2 voomWithQualityWeights

To run voomWithQualityWeights with sample variability estimated, similarly, we need to provide y (count matrix) and design (design matrix). Besides the ‘voom-plot’, this function provides a bar plot listing sample-specific weights, which are used to adjust the observational mean-variance relationship obtained by voom. If one sample weight is above 1, then this sample is less precise, and it will be taken less care of, and vice versa.

voomWithQualityWeights(y, design = design, plot = TRUE) -> y_vqw

0.3.3 voomWithQualityWeights (block)

To run voomWithQualityWeights with block variability estimated, besides, y (count matrix) and design (design matrix), we need to specify var.group. Next to the ‘voom-plot’, the value of sample-specific weights in the bar plot changes from that each sample has its own estimated weight to that samples from the same block share the same weight.

voomWithQualityWeights(y, design = design, var.group= cd, plot = TRUE) -> y_qwb

0.3.4 voomByGroup

To run voomByGroup with block variability estimated, besides, y (count matrix) and design (design matrix), we need to specify group. Here we set the plot=“combine”, so that we have all group-wise trends summarized in one plot which makes relative differences between the curves easier to spot. Besides running voomByGroup to do the group-wise mean-variance modeling, we also recommend the user to use this group-specific mean-variance plots as a useful ‘first glance’ of the data before any formal testing was carried out

source("/stornext/HPCScratch/home/you.y/simulation/functions/voomByGroup.R")
voomByGroup(y,design = design, group = cd, plot = "combine") -> y_vbg

## Group:
## 1 Asymptomatic 
## 2 HC 
## 3 Moderate

As we can see here, the shapes between groups are slightly different, and HC exhibit relatively lower biological variation.

0.4 Differential expression analysis

Then we fit the linear model. Here, to simply the code, we generate a function find_de to generate differentially expressed genes with same cutoff used at once.

find_de <- function(y, design ,contr.matrix) {
 fit <- lmFit(y, design)
 vfit <- contrasts.fit(fit, contrasts=contr.matrix)
 tfit <- treat(vfit, lfc=0)
 decideTests(tfit) -> de
 return(de)
}

find_de(y_voom, design, contr.matrix) -> de_voom
summary(de_voom)

##        AvsH MvsH MvsA
## Down    211   55    7
## NotSig 8781 9134 9226
## Up      242   45    1

find_de(y_vqw, design, contr.matrix) -> de_vqw
summary(de_vqw)

##        AvsH MvsH MvsA
## Down    368  128   11
## NotSig 8456 8978 9222
## Up      410  128    1

find_de(y_qwb, design, contr.matrix) -> de_qwb
summary(de_qwb)

##        AvsH MvsH MvsA
## Down    327   73    7
## NotSig 8515 9080 9226
## Up      392   81    1

find_de(y_vbg, design, contr.matrix) -> de_vbg
summary(de_vbg)

##        AvsH MvsH MvsA
## Down    402  154    2
## NotSig 8354 8915 9232
## Up      478  165    0

0.5 Comparing between the DE genes

To visualize the differences between number of DE genes generated by DE methods with different weighting strategies, we make a venn plot.

cbind(de_voom[,1],de_vqw[,1],de_qwb[,1],de_vbg[,1]) -> df
colnames(df) <- c("voom","voomQW","voomQWB","voomByGroup")
vennDiagram(df,
            circle.col=c("turquoise","salmon","green","purple"))

From voom to voomQW and voomQWB, then voomByGroup, the methods increase in their level of group-specific variance modelling. The overlaps between these methods and the extra DE genes reflect the hierarchy in variance modelling for these methods, and demonstrate the potential gain in statistical power when capturing group variance more accurately.

0.6 Conclusion

Above example shows that when there are unequal group variances, voom detected fewest genes, because it overestimate the variance from HC by using an overall trend obtained by all observations. voomQW and voomQWB detected relatively more DE genes compare to typical voom on account of their weighting strategy that take sample-level variability in to account. voomByGroup detected more DE genes as well, which results from its flexibility in capturing technical and biological variation based on mean-variance curves.