Group heteroscedasticity is commonly observed in pseudo-bulk single
cell RNA-seq datasets and when not modelled appropriately, its presence
can hamper the detection of differentially expressed genes. Here, we
introduce two methods that account for heteroscedastic groups, namely
voomByGroup
and voomWithQualityWeights
using a
blocked design (voomQWB). Since they are both voom
derived
methods, to illustrate how to use them and provide you an idea of the
difference between results they generated, we will follow the standard
RNA-seq analysis workflow using limma from RNASeq123. Whilst
we use pseudo-bulk samples in this help page, the idea of modelling
group variances more closely can in theory be extended to DE analysis on
other data types, such as bulk RNA-seq data, pseudo-bulk of spatial
scRNA-seq data, and surface protein data from CITE-seq.
The data we use here is the COVID-19 PBMC data used in our preprint. CD56dim CD16+ NK cells are extracted to create pseudo-bulk samples. rds files are provided in our github, under the folder example/data.
library(limma)
library(edgeR)
readRDS("data/counts.rds") -> counts
readRDS("data/group_id.rds") -> group_id
Filter out lowly expressed genes and create a DGElist object.
counts[rowSums(counts)>50,]-> counts_k
#create a DGElist
DGEList(counts_k) -> y
#normalization
y <- calcNormFactors(y)
y$samples$group <- group_id
Make the design matrix based on group information we have.
cd <- y$samples$group
model.matrix(~0+cd) -> design
design
## cdAsymptomatic cdHC cdModerate
## 1 1 0 0
## 2 1 0 0
## 3 0 0 1
## 4 0 0 1
## 5 0 0 1
## 6 0 0 1
## 7 0 1 0
## 8 0 1 0
## 9 0 1 0
## 10 1 0 0
## 11 1 0 0
## 12 0 0 1
## 13 1 0 0
## 14 1 0 0
## 15 1 0 0
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$cd
## [1] "contr.treatment"
Make contrasts for pairwise comparisons between cell populations based on our design matrix.
contr.matrix <- makeContrasts(
AvsH = cdAsymptomatic - cdHC,
MvsH = cdModerate - cdHC,
MvsA = cdModerate- cdAsymptomatic,
levels = colnames(design))
contr.matrix
## Contrasts
## Levels AvsH MvsH MvsA
## cdAsymptomatic 1 0 -1
## cdHC -1 -1 0
## cdModerate 0 1 1
Have a check on the MDS plot.
cols=y$samples$group
table(cols)
## cols
## Asymptomatic HC Moderate
## 7 3 5
cols[cols=="Asymptomatic"]<- "red"
cols[cols=="HC"]<- "blue"
cols[cols=="Moderate"]<- "orange"
par(mfrow=c(1,1))
limma::plotMDS(y,col=cols)
As we can see here, based on the MDS plot, samples from HC exhibit relatively lower group variation with all three samples sitting close to each other, while samples from moderate and asymptomatic patients exhibit relatively more variation.
Before we fit a linear model to find the DE genes, we need to deal
with the dependency between mean and variance from RNA-seq data.
voom
was originally designed to model the mean-variance
relationship with a LOWESS trends fitted, and precision weights were
then calculated for all observations based on estimated variance.
Besides voom
, there are 3 voom
derived
strategies, including voomWithQualityWeights
, which takes
sample variability into account, voomWithQualityWeights
with block used, and voomByGroup
, which take group-wise
variability into account. Details of the difference between their
strategies are covered in our preprint.
In this help page, we would focus on how to use these functions and diagnostic plots generated by them.
To run voom
, y (count matrix), and
design (design matrix with cell population information)
are required as input. Input for other parameters are optional.
Typically, the “voom-plot” shows a decreasing trend between the means
and variances resulting from a combination of technical variation in the
sequencing experiment and biological variation.
voom(y, design = design, plot = TRUE) -> y_voom
To run voomWithQualityWeights
with sample variability
estimated, similarly, we need to provide y (count
matrix) and design (design matrix). Besides the
‘voom-plot’, this function provides a bar plot listing sample-specific
weights, which are used to adjust the observational mean-variance
relationship obtained by voom
. If one sample weight is
above 1, then this sample is less precise, and it will be taken less
care of, and vice versa.
voomWithQualityWeights(y, design = design, plot = TRUE) -> y_vqw
To run voomWithQualityWeights
with block variability
estimated, besides, y (count matrix) and
design (design matrix), we need to specify
var.group. Next to the ‘voom-plot’, the value of
sample-specific weights in the bar plot changes from that each sample
has its own estimated weight to that samples from the same block share
the same weight.
voomWithQualityWeights(y, design = design, var.group= cd, plot = TRUE) -> y_qwb
To run voomByGroup
with block variability estimated,
besides, y (count matrix) and design
(design matrix), we need to specify group. Here we set
the plot=“combine”, so that we have all group-wise
trends summarized in one plot which makes relative differences between
the curves easier to spot. Besides running voomByGroup
to
do the group-wise mean-variance modeling, we also recommend the user to
use this group-specific mean-variance plots as a useful ‘first glance’
of the data before any formal testing was carried out
source("/stornext/HPCScratch/home/you.y/simulation/functions/voomByGroup.R")
voomByGroup(y,design = design, group = cd, plot = "combine") -> y_vbg
## Group:
## 1 Asymptomatic
## 2 HC
## 3 Moderate
As we can see here, the shapes between groups are slightly different, and HC exhibit relatively lower biological variation.
Then we fit the linear model. Here, to simply the code, we generate a function find_de to generate differentially expressed genes with same cutoff used at once.
find_de <- function(y, design ,contr.matrix) {
fit <- lmFit(y, design)
vfit <- contrasts.fit(fit, contrasts=contr.matrix)
tfit <- treat(vfit, lfc=0)
decideTests(tfit) -> de
return(de)
}
find_de(y_voom, design, contr.matrix) -> de_voom
summary(de_voom)
## AvsH MvsH MvsA
## Down 211 55 7
## NotSig 8781 9134 9226
## Up 242 45 1
find_de(y_vqw, design, contr.matrix) -> de_vqw
summary(de_vqw)
## AvsH MvsH MvsA
## Down 368 128 11
## NotSig 8456 8978 9222
## Up 410 128 1
find_de(y_qwb, design, contr.matrix) -> de_qwb
summary(de_qwb)
## AvsH MvsH MvsA
## Down 327 73 7
## NotSig 8515 9080 9226
## Up 392 81 1
find_de(y_vbg, design, contr.matrix) -> de_vbg
summary(de_vbg)
## AvsH MvsH MvsA
## Down 402 154 2
## NotSig 8354 8915 9232
## Up 478 165 0
To visualize the differences between number of DE genes generated by DE methods with different weighting strategies, we make a venn plot.
cbind(de_voom[,1],de_vqw[,1],de_qwb[,1],de_vbg[,1]) -> df
colnames(df) <- c("voom","voomQW","voomQWB","voomByGroup")
vennDiagram(df,
circle.col=c("turquoise","salmon","green","purple"))
From voom
to voomQW
and
voomQWB
, then voomByGroup
, the methods
increase in their level of group-specific variance modelling. The
overlaps between these methods and the extra DE genes reflect the
hierarchy in variance modelling for these methods, and demonstrate the
potential gain in statistical power when capturing group variance more
accurately.
Above example shows that when there are unequal group variances,
voom
detected fewest genes, because it overestimate the
variance from HC by using an overall trend obtained by all observations.
voomQW
and voomQWB
detected relatively more DE
genes compare to typical voom
on account of their weighting
strategy that take sample-level variability in to account.
voomByGroup
detected more DE genes as well, which results
from its flexibility in capturing technical and biological variation
based on mean-variance curves.