All functions in the bigPint
package require an input parameter called data
, which should be a data frame that contains the full dataset of interest. If a researcher is using the package to visualize RNA-seq data, then this data
object should be a count table that contains the read counts for all genes of interest.
The data
object requires the same particular data frame format for all bigPint
functions. There should be \(n\) rows in the data frame, where \(n\) is the number of genes. There should be \(p + 1\) columns in the data frame, where \(p\) is the number of samples. The first column contains the genes names and the rest of the columns should contain the read counts for all samples of interest. An example of this format is shown below:
## ID N.1 N.2 N.3 P.1 P.2 P.3
## 14881 Glyma.06G158700.Wm82.a2.v1 48 28 15 29 16 8
## 20855 Glyma.08G156000.Wm82.a2.v1 0 0 0 0 0 0
## 32104 Glyma.12G070600.Wm82.a2.v1 8 11 11 5 8 7
## 50897 Glyma.19G045200.Wm82.a2.v1 200 192 187 186 193 183
## 11303 Glyma.05G050000.Wm82.a2.v1 0 0 0 0 0 0
## 50345 Glyma.18G292300.Wm82.a2.v1 583 669 497 419 467 426
We can also examine the structure of an example data
object as follows:
str(soybean_ir_sub, strict.width = "wrap")
## 'data.frame': 5604 obs. of 7 variables:
## $ ID : chr "Glyma.06G158700.Wm82.a2.v1" "Glyma.08G156000.Wm82.a2.v1"
## "Glyma.12G070600.Wm82.a2.v1" "Glyma.19G045200.Wm82.a2.v1" ...
## $ N.1: int 48 0 8 200 0 583 22 52 1 73 ...
## $ N.2: int 28 0 11 192 0 669 34 42 0 120 ...
## $ N.3: int 15 0 11 187 0 497 19 44 3 59 ...
## $ P.1: int 29 0 5 186 0 419 11 46 0 98 ...
## $ P.2: int 16 0 8 193 0 467 21 54 0 106 ...
## $ P.3: int 8 0 7 183 0 426 11 42 2 86 ...
This example dataset contains 5,604 genes and six samples (Lauter and Graham 2016). There are two treatment groups, N and P. Each treatment group contains three replicates.
As demonstrated above, the data
object must meet the following conditions:
data.frame
character
integer
or numeric
^[a-zA-Z0-9]+\\.[0-9]+
, where
It is important that the names of all columns except the first follow the three-part format delineated above. All functions in the bigPint
package require this format to successfully produce plots. If your data
object does not fit this format, bigPint
will likely throw an informative error about why your format was not recognized.
Note that the data
object can contain more than two treatment groups. In this case, the bigPint
software will automatically create plots for all pairs of treatment groups. An example of this type of dataset is provided in the bigPint
package and can accessed as follows:
data(soybean_cn_sub)
This example dataset contains 7,332 genes and nine samples (Brown and Hudson 2015). There are three treatment groups, S1, S2, and S3. Each treatment group contains three replicates. In such cases where the data
object contains more than two treatment groups, all functions in the bigPint
package (except plotSMApp()
) will automatically produce a plot for each pairwise combination of treatment groups.
For example, bigPint
functions will produce plots for S1 versus S2, S1 versus S3, and S2 versus S3 in this case. The same could be accomplished (although less efficiently) by separating the dataset into three separate datasets and running a bigPint
function of interest on each of them individually.
library(dplyr) soybean_cn_sub_S1S2 <- soybean_cn_sub %>% select("ID", contains("S1"), contains("S2")) soybean_cn_sub_S1S3 <- soybean_cn_sub %>% select("ID", contains("S1"), contains("S3")) soybean_cn_sub_S2S3 <- soybean_cn_sub %>% select("ID", contains("S2"), contains("S3"))
head(soybean_cn_sub_S1S2, 3)
## ID S1.1 S1.2 S1.3 S2.1 S2.2 S2.3
## 19468 Glyma06g12670.1 0.8024444 2.708884 1.763407 7.716099 6.581990 7.003538
## 27284 Glyma08g12390.2 4.7687202 5.235777 5.166631 3.823472 3.566863 3.295619
## 42001 Glyma12g02076.11 3.1899340 2.902131 2.906502 3.100206 3.284326 3.295619
head(soybean_cn_sub_S1S3, 3)
## ID S1.1 S1.2 S1.3 S3.1 S3.2 S3.3
## 19468 Glyma06g12670.1 0.8024444 2.708884 1.763407 8.556732 8.367593 8.389347
## 27284 Glyma08g12390.2 4.7687202 5.235777 5.166631 3.669489 4.031427 4.269312
## 42001 Glyma12g02076.11 3.1899340 2.902131 2.906502 3.364437 2.731105 3.255649
head(soybean_cn_sub_S2S3, 3)
## ID S2.1 S2.2 S2.3 S3.1 S3.2 S3.3
## 19468 Glyma06g12670.1 7.716099 6.581990 7.003538 8.556732 8.367593 8.389347
## 27284 Glyma08g12390.2 3.823472 3.566863 3.295619 3.669489 4.031427 4.269312
## 42001 Glyma12g02076.11 3.100206 3.284326 3.295619 3.364437 2.731105 3.255649
Some popular RNA-seq analysis packages (such as edgeR (Robinson, McCarthy, and Smyth 2010), DESeq2 (Love, Huber, and Anders 2014), and limma (Ritchie et al. 2015)) advise researchers to perform certain preprocessing steps to their data, such as filtering the genes, normalizing their read counts, and standardizing their read counts before visualization. Researchers can use datasets whether or not they have been filtered, normalized, and standardized for setting the data
object in the bigPint
package. If they wish, they can use bigPint
plots to investigate how their dataset changes after filters, normalizations, and standardizations.
Brown, Anne V., and Karen A. Hudson. 2015. “Developmental Profiling of Gene Expression in Soybean Trifoliate Leaves and Cotyledons.” BMC Plant Biology 15 (1): 169.
Lauter, AN Moran, and MA Graham. 2016. “NCBI Sra Bioproject Accession: PRJNA318409.”
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for Rna-Seq Data with Deseq2.” Genome Biology 15 (12): 550.
Ritchie, Matthew E., Belinda Phipson, Di Wu, Yifang Hu, Charity W. Law, Wei Shi, and Gordon K. Smyth. 2015. “Limma Powers Differential Expression Analyses for Rna-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47–e47.
Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth. 2010. “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40.