Researchers may wish to superimpose a subset of the full dataset onto the full dataset. If a researcher is using the package to visualize RNA-seq data, then this subset of data is often differentially expressed genes (DEGs) returned from a model. In this case, the user may wish to use the dataMetrics
input parameter, which contains at least one quantitative variable returned from a model such as FDR, p-value, and log fold change.
As was shown in the article Data object, the data
object called soybean_ir_sub
contained 5,604 genes and two treatment groups, N and P (Lauter and Graham 2016). We can examine the structure of its corresponding dataMetrics
object called soybean_ir_sub_metrics
as follows:
## List of 1
## $ N_P:'data.frame': 5604 obs. of 6 variables:
## ..$ ID : chr [1:5604] "Glyma.19G168700.Wm82.a2.v1" "Glyma.13G293500.Wm82.a2.v1"
## "Glyma.05G188700.Wm82.a2.v1" "Glyma.13G173100.Wm82.a2.v1" ...
## ..$ logFC : num [1:5604] -5.92 2.99 -3.51 -3.91 -3.51 ...
## ..$ logCPM: num [1:5604] 7.52 8.08 8.83 8.27 10.19 ...
## ..$ LR : num [1:5604] 266 171 167 157 154 ...
## ..$ PValue: num [1:5604] 9.18e-60 3.65e-39 2.73e-38 6.04e-36 2.58e-35 ...
## ..$ FDR : num [1:5604] 5.14e-56 1.02e-35 5.09e-35 8.46e-33 2.89e-32 ...
Similarly, as was shown in the data page, the data
object called soybean_cn_sub
contained 7,332 genes and three treatment groups, S1, S2, and S3 (Brown and Hudson 2015). We can examine the structure of its corresponding dataMetrics
object called soybean_cn_sub_metrics
as follows:
## List of 3
## $ S1_S2:'data.frame': 7332 obs. of 6 variables:
## ..$ ID : chr [1:7332] "Glyma18g00690.1" "Glyma08g22380.1" "Glyma20g30460.1"
## "Glyma07g09700.1" ...
## ..$ logFC : num [1:7332] 3.14 2.62 2.5 2.37 2.74 ...
## ..$ logCPM: num [1:7332] 8.04 8.05 8.14 8.19 7.93 ...
## ..$ LR : num [1:7332] 29.3 24.8 24 22.7 21 ...
## ..$ PValue: num [1:7332] 6.08e-08 6.50e-07 9.82e-07 1.94e-06 4.57e-06 ...
## ..$ FDR : num [1:7332] 0.000446 0.002384 0.0024 0.003557 0.006697 ...
## $ S1_S3:'data.frame': 7332 obs. of 6 variables:
## ..$ ID : chr [1:7332] "Glyma08g22380.1" "Glyma08g19290.1" "Glyma20g30460.1"
## "Glyma18g00690.1" ...
## ..$ logFC : num [1:7332] 3.18 3.32 2.44 2.46 3.03 ...
## ..$ logCPM: num [1:7332] 8.05 8.14 8.14 8.04 7.85 ...
## ..$ LR : num [1:7332] 30.4 24.7 23.3 22.5 22.3 ...
## ..$ PValue: num [1:7332] 3.60e-08 6.53e-07 1.42e-06 2.09e-06 2.30e-06 ...
## ..$ FDR : num [1:7332] 0.000264 0.002393 0.003378 0.003378 0.003378 ...
## $ S2_S3:'data.frame': 7332 obs. of 6 variables:
## ..$ ID : chr [1:7332] "Glyma08g14670.3" "Glyma08g14670.2" "Glyma08g19290.1"
## "Glyma06g46701.1" ...
## ..$ logFC : num [1:7332] 2.66 2.72 2.78 2.19 2.29 ...
## ..$ logCPM: num [1:7332] 7.86 7.82 8.14 7.63 7.84 ...
## ..$ LR : num [1:7332] 16.1 16 14.3 11.8 10.5 ...
## ..$ PValue: num [1:7332] 6.01e-05 6.30e-05 1.57e-04 5.81e-04 1.21e-03 ...
## ..$ FDR : num [1:7332] 0.231 0.231 0.385 1 1 ...
As demonstrated in the two examples above, the dataMetrics
object must meet the following conditions:
list
data
object. For example, the soybean_ir_sub_metrics
object contains one list element (“N_P”) and the soybean_cn_sub_metrics
object contains three list elements (“S1_S2”, “S1_S3”, “S2_S3”).data.frame
^[a-zA-Z0-9]+_[a-zA-Z0-9]+
, where
character
consisting of the unique names of the genesnumeric
or integer
consisting of a quantitative variable. This can be called anything. In the examples above, there are five of such columns called “logFC”, “logCPM”, “LR”, “PValue”, and “FDR”.You can quickly double-check the names of the list elements in your dataMetrics
object as follows:
names(soybean_ir_sub_metrics)
## [1] "N_P"
names(soybean_cn_sub_metrics)
## [1] "S1_S2" "S1_S3" "S2_S3"
If your dataMetrics
object does not fit this format, bigPint
will likely throw an informative error about why your format was not recognized.
If a researcher is using the bigPint
package to plot RNA-seq data, then many will create the dataMetrics
object using popular RNA-seq packages such as edgeR (Robinson, McCarthy, and Smyth 2010), DESeq2 (Love, Huber, and Anders 2014), and limma (Ritchie et al. 2015). These R packages will output several interesting quantitative variables for each gene in the dataset that can be incorporated into the dataMetrics
object. bigPint
can then apply thresholds to these variables and create subsets of genes to superimpose. To create numerous bigPint
plots with the least effort, we recommend creating a dataMetrics
object that contains at least the following column types:
Many bigPint
plots use “FDR” to determine “significant” genes and subset them as overlay (FDR < 0.05). The bigPint
volcano plot uses “PValue” and “logFC”. Naming these columns as above will save you time but the names and the default threshold values can be specified away from default when creating each bigPint
plot.
We now provide reproducible code for creating dataMetrics
objects with two or three treatment groups using both edgeR (Robinson, McCarthy, and Smyth 2010) and DESeq2 (Love, Huber, and Anders 2014).
The following example shows how to create the dataMetrics
object called soybean_ir_sub_metrics
, which was shown in the article Data metrics object (Lauter and Graham 2016). The dataset from which it is created (soybean_ir_sub
) contains only two treatment groups, N and P. In this case, the edgeR (Robinson, McCarthy, and Smyth 2010) package was primarily followed.
library(bigPint) library(edgeR) library(data.table) data(soybean_ir_sub) data = soybean_ir_sub rownames(data) = data[,1] y = DGEList(counts=data[,-1]) group = c(1,1,1,2,2,2) y = DGEList(counts=y, group=group) Group = factor(c(rep("N",3), rep("P",3))) design <- model.matrix(~0+Group, data=y$samples) colnames(design) <- levels(Group) y <- estimateDisp(y, design) fit <- glmFit(y, design) soybean_ir_sub_metrics <- list() for (i in 1:(ncol(fit)-1)){ for (j in (i+1):ncol(fit)){ contrast=rep(0,ncol(fit)) contrast[i]=1 contrast[j]=-1 lrt <- glmLRT(fit, contrast=contrast) lrt <- topTags(lrt, n = nrow(y[[1]]))[[1]] setDT(lrt, keep.rownames = TRUE)[] colnames(lrt)[1] = "ID" lrt <- as.data.frame(lrt) soybean_ir_sub_metrics[[paste0(colnames(fit)[i], "_", colnames(fit)[j])]] <- lrt } }
We can indeed examine the generated soybean_ir_sub_metrics
object as follows:
str(soybean_ir_sub_metrics, strict.width = "wrap")
## List of 1
## $ N_P:'data.frame': 5604 obs. of 6 variables:
## ..$ ID : chr [1:5604] "Glyma.19G168700.Wm82.a2.v1" "Glyma.13G293500.Wm82.a2.v1"
## "Glyma.05G188700.Wm82.a2.v1" "Glyma.13G173100.Wm82.a2.v1" ...
## ..$ logFC : num [1:5604] -5.92 2.99 -3.51 -3.91 -3.51 ...
## ..$ logCPM: num [1:5604] 7.52 8.08 8.83 8.27 10.19 ...
## ..$ LR : num [1:5604] 266 171 167 157 154 ...
## ..$ PValue: num [1:5604] 9.18e-60 3.65e-39 2.73e-38 6.04e-36 2.58e-35 ...
## ..$ FDR : num [1:5604] 5.14e-56 1.02e-35 5.09e-35 8.46e-33 2.89e-32 ...
And verify that it contains one list element:
names(soybean_ir_sub_metrics)
## [1] "N_P"
The following example shows how to create the dataMetrics
object called soybean_cn_sub_metrics
, which was shown in the article Data metrics object). The dataset from which it is created (soybean_cn_sub
) contains three treatment groups, S1, S2, and S3 (Brown and Hudson 2015). In this case, the edgeR (Robinson, McCarthy, and Smyth 2010) package was primarily followed.
data(soybean_cn_sub) data = soybean_cn_sub rownames(data) = data[,1] y = DGEList(counts=data[,-1]) group = c(1,1,1,2,2,2,3,3,3) y = DGEList(counts=y, group=group) Group = factor(c(rep("S1",3), rep("S2",3), rep("S3",3))) design <- model.matrix(~0+Group, data=y$samples) colnames(design) <- levels(Group) y <- estimateDisp(y, design) fit <- glmFit(y, design) soybean_cn_sub_metrics <- list() for (i in 1:(ncol(fit)-1)){ for (j in (i+1):ncol(fit)){ contrast=rep(0,ncol(fit)) contrast[i]=1 contrast[j]=-1 lrt <- glmLRT(fit, contrast=contrast) lrt <- topTags(lrt, n = nrow(y[[1]]))[[1]] setDT(lrt, keep.rownames = TRUE)[] colnames(lrt)[1] = "ID" lrt <- as.data.frame(lrt) soybean_cn_sub_metrics[[paste0(colnames(fit)[i], "_", colnames(fit)[j])]] <- lrt } }
We can indeed examine the generated soybean_cn_sub_metrics
object as follows:
str(soybean_cn_sub_metrics, strict.width = "wrap")
## List of 3
## $ S1_S2:'data.frame': 7332 obs. of 6 variables:
## ..$ ID : chr [1:7332] "Glyma18g00690.1" "Glyma08g22380.1" "Glyma20g30460.1"
## "Glyma07g09700.1" ...
## ..$ logFC : num [1:7332] 3.14 2.62 2.5 2.37 2.74 ...
## ..$ logCPM: num [1:7332] 8.04 8.05 8.14 8.19 7.93 ...
## ..$ LR : num [1:7332] 29.3 24.8 24 22.7 21 ...
## ..$ PValue: num [1:7332] 6.08e-08 6.50e-07 9.82e-07 1.94e-06 4.57e-06 ...
## ..$ FDR : num [1:7332] 0.000446 0.002384 0.0024 0.003557 0.006697 ...
## $ S1_S3:'data.frame': 7332 obs. of 6 variables:
## ..$ ID : chr [1:7332] "Glyma08g22380.1" "Glyma08g19290.1" "Glyma20g30460.1"
## "Glyma18g00690.1" ...
## ..$ logFC : num [1:7332] 3.18 3.32 2.44 2.46 3.03 ...
## ..$ logCPM: num [1:7332] 8.05 8.14 8.14 8.04 7.85 ...
## ..$ LR : num [1:7332] 30.4 24.7 23.3 22.5 22.3 ...
## ..$ PValue: num [1:7332] 3.60e-08 6.53e-07 1.42e-06 2.09e-06 2.30e-06 ...
## ..$ FDR : num [1:7332] 0.000264 0.002393 0.003378 0.003378 0.003378 ...
## $ S2_S3:'data.frame': 7332 obs. of 6 variables:
## ..$ ID : chr [1:7332] "Glyma08g14670.3" "Glyma08g14670.2" "Glyma08g19290.1"
## "Glyma06g46701.1" ...
## ..$ logFC : num [1:7332] 2.66 2.72 2.78 2.19 2.29 ...
## ..$ logCPM: num [1:7332] 7.86 7.82 8.14 7.63 7.84 ...
## ..$ LR : num [1:7332] 16.1 16 14.3 11.8 10.5 ...
## ..$ PValue: num [1:7332] 6.01e-05 6.30e-05 1.57e-04 5.81e-04 1.21e-03 ...
## ..$ FDR : num [1:7332] 0.231 0.231 0.385 1 1 ...
And verify that it contains three list element:
names(soybean_cn_sub_metrics)
## [1] "S1_S2" "S1_S3" "S2_S3"
This example shows how to create a dataMetrics
object from (soybean_ir_sub
). In this case, the DESeq2 (Love, Huber, and Anders 2014) package was used.
library(DESeq2) data(soybean_ir_sub) data = soybean_ir_sub rownames(data) = data[,1] data = as.matrix(data[,-1]) coldata = data.frame(row.names = colnames(data), treatment = unlist(lapply( colnames(data), function (x) unlist(strsplit(x, "[.]"))[1]))) dds = DESeqDataSetFromMatrix(countData = data, colData = coldata, design = ~ treatment) dds <- DESeq(dds) uTreat = unique(unlist(lapply(colnames(data), function (x) unlist(strsplit( x, "[.]"))[1]))) soybean_ir_sub_metrics <- list() for (i in 1:(length(uTreat)-1)){ for (j in (i+1):length(uTreat)){ res <- results(dds, contrast=c("treatment", uTreat[i], uTreat[j])) metrics = as.data.frame(res@listData) metrics = cbind(ID = res@rownames, metrics) metrics$ID = as.character(metrics$ID) metrics <- metrics[order(metrics$padj), ] soybean_ir_sub_metrics[[paste0(uTreat[i], "_", uTreat[j])]] <- metrics } }
By default, DESeq2
gives output with variables called “pvalue”, “padj”, and “log2FoldChange”. Various functions in bigPint
expect column names like “FDR”, “logFC”, and “PValue” respectively in the dataMetrics
object. That can be modified manually using the threshVar
input parameter each time creating a plot. But it is easier to simply rename these parameters from the start in the dataMetrics
object.
for (df in seq_len(length(soybean_ir_sub_metrics))){ whichPadj = which(colnames(soybean_ir_sub_metrics[[df]])=="pvalue") colnames(soybean_ir_sub_metrics[[df]])[whichPadj] = "PValue" whichPadj = which(colnames(soybean_ir_sub_metrics[[df]])=="padj") colnames(soybean_ir_sub_metrics[[df]])[whichPadj] = "FDR" whichPadj = which(colnames(soybean_ir_sub_metrics[[df]])=="log2FoldChange") colnames(soybean_ir_sub_metrics[[df]])[whichPadj] = "logFC" }
We can indeed examine the generated soybean_ir_sub_metrics
object as follows:
str(soybean_ir_sub_metrics, strict.width = "wrap")
## List of 1
## $ N_P:'data.frame': 5604 obs. of 7 variables:
## ..$ ID : chr [1:5604] "Glyma.11G141800.Wm82.a2.v1" "Glyma.11G044800.Wm82.a2.v1"
## "Glyma.05G188700.Wm82.a2.v1" "Glyma.13G173100.Wm82.a2.v1" ...
## ..$ baseMean: num [1:5604] 778 795 302 204 420 ...
## ..$ logFC : num [1:5604] -3.54 2.75 -3.53 -3.94 -3.1 ...
## ..$ lfcSE : num [1:5604] 0.249 0.207 0.267 0.31 0.244 ...
## ..$ stat : num [1:5604] -14.2 13.3 -13.2 -12.7 -12.7 ...
## ..$ PValue : num [1:5604] 5.49e-46 3.49e-40 5.34e-40 4.80e-37 5.76e-37 ...
## ..$ FDR : num [1:5604] 1.95e-42 6.20e-37 6.32e-37 4.09e-34 4.09e-34 ...
And verify that it contains one list element:
names(soybean_ir_sub_metrics)
## [1] "N_P"
This example shows how to create a dataMetrics
object from (soybean_cn_sub
). In this case, the DESeq2 (Love, Huber, and Anders 2014) package was used. The DESeq
package expects a count table that contains integers and has gene-wise dispersion estimates larger than two orders of magnitude from the minimum value. To fit this requirement just for this didactic exercise, we multiply each value by ten and perform a ceiling()
function.
data(soybean_cn_sub) data = soybean_cn_sub rownames(data) = data[,1] data = as.matrix(ceiling(data[,-1] * 10)) coldata = data.frame(row.names = colnames(data), treatment = unlist(lapply( colnames(data), function (x) unlist(strsplit(x, "[.]"))[1]))) dds = DESeqDataSetFromMatrix(countData = data, colData = coldata, design = ~ treatment) dds <- DESeq(dds) uTreat <- unique(unlist(lapply(colnames(data), function (x) unlist(strsplit(x, "[.]"))[1]))) soybean_cn_sub_metrics <- list() for (i in 1:(length(uTreat)-1)){ for (j in (i+1):length(uTreat)){ res <- results(dds, contrast=c("treatment", uTreat[i], uTreat[j])) metrics = as.data.frame(res@listData) metrics = cbind(ID = res@rownames, metrics) metrics$ID = as.character(metrics$ID) metrics <- metrics[order(metrics$padj), ] soybean_cn_sub_metrics[[paste0(uTreat[i], "_", uTreat[j])]] <- metrics } }
By default, DESeq2
gives output with variables called “pvalue”, “padj”, and “log2FoldChange”. Various functions in bigPint
expect column names like “FDR”, “logFC”, and “PValue” respectively in the dataMetrics
object. That can be modified manually using the threshVar
input parameter each time creating a plot. But it is easier to simply rename these parameters from the start in the dataMetrics
object.
for (df in seq_len(length(soybean_ir_sub_metrics))){ whichPadj = which(colnames(soybean_ir_sub_metrics[[df]])=="pvalue") colnames(soybean_ir_sub_metrics[[df]])[whichPadj] = "PValue" whichPadj = which(colnames(soybean_ir_sub_metrics[[df]])=="padj") colnames(soybean_ir_sub_metrics[[df]])[whichPadj] = "FDR" whichPadj = which(colnames(soybean_ir_sub_metrics[[df]])=="log2FoldChange") colnames(soybean_ir_sub_metrics[[df]])[whichPadj] = "logFC" }
We can indeed examine the generated soybean_cn_sub_metrics
object as follows:
str(soybean_cn_sub_metrics, strict.width = "wrap")
## List of 3
## $ S1_S2:'data.frame': 7332 obs. of 7 variables:
## ..$ ID : chr [1:7332] "Glyma18g00690.1" "Glyma03g29150.1" "Glyma05g27450.2"
## "Glyma10g31780.1" ...
## ..$ baseMean : num [1:7332] 50.4 33 46.3 45.3 29.8 ...
## ..$ log2FoldChange: num [1:7332] 3.22 3.14 -2.89 2.83 2.99 ...
## ..$ lfcSE : num [1:7332] 0.427 0.419 0.4 0.395 0.42 ...
## ..$ stat : num [1:7332] 7.55 7.5 -7.22 7.16 7.12 ...
## ..$ pvalue : num [1:7332] 4.30e-14 6.35e-14 5.03e-13 7.92e-13 1.06e-12 ...
## ..$ padj : num [1:7332] 1.92e-10 1.92e-10 1.01e-09 1.20e-09 1.28e-09 ...
## $ S1_S3:'data.frame': 7332 obs. of 7 variables:
## ..$ ID : chr [1:7332] "Glyma08g19290.1" "Glyma08g22380.1" "Glyma16g08810.1"
## "Glyma04g37510.1" ...
## ..$ baseMean : num [1:7332] 55.4 50.8 34 39.1 45.7 ...
## ..$ log2FoldChange: num [1:7332] 3.39 3.26 3.17 -2.99 -2.9 ...
## ..$ lfcSE : num [1:7332] 0.391 0.404 0.424 0.405 0.4 ...
## ..$ stat : num [1:7332] 8.67 8.06 7.49 -7.4 -7.26 ...
## ..$ pvalue : num [1:7332] 4.29e-18 7.39e-16 7.15e-14 1.40e-13 3.90e-13 ...
## ..$ padj : num [1:7332] 2.66e-14 2.29e-12 1.48e-10 2.17e-10 4.83e-10 ...
## $ S2_S3:'data.frame': 7332 obs. of 7 variables:
## ..$ ID : chr [1:7332] "Glyma08g19290.1" "Glyma08g14670.2" "Glyma08g14670.3"
## "Glyma02g06650.1" ...
## ..$ baseMean : num [1:7332] 55.4 40.7 42.3 41.4 45.7 ...
## ..$ log2FoldChange: num [1:7332] 2.81 2.77 2.68 2.31 1.92 ...
## ..$ lfcSE : num [1:7332] 0.396 0.395 0.387 0.4 0.342 ...
## ..$ stat : num [1:7332] 7.1 7.02 6.94 5.77 5.62 ...
## ..$ pvalue : num [1:7332] 1.28e-12 2.28e-12 3.88e-12 8.01e-09 1.94e-08 ...
## ..$ padj : num [1:7332] 6.91e-09 6.91e-09 7.83e-09 1.21e-05 2.35e-05 ...
And verify that it contains three list element:
names(soybean_cn_sub_metrics)
## [1] "S1_S2" "S1_S3" "S2_S3"
Brown, Anne V., and Karen A. Hudson. 2015. “Developmental Profiling of Gene Expression in Soybean Trifoliate Leaves and Cotyledons.” BMC Plant Biology 15 (1): 169.
Lauter, AN Moran, and MA Graham. 2016. “NCBI Sra Bioproject Accession: PRJNA318409.”
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for Rna-Seq Data with Deseq2.” Genome Biology 15 (12): 550.
Ritchie, Matthew E., Belinda Phipson, Di Wu, Yifang Hu, Charity W. Law, Wei Shi, and Gordon K. Smyth. 2015. “Limma Powers Differential Expression Analyses for Rna-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47–e47.
Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth. 2010. “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40.