Project: Investigation of backward compatibility of an outdated Affymetrix pipeline for the preprocessing of transcriptomics datasets pipeline

Author: Shakira Agata

This notebook describes the steps taken for the investigation and preprocessing of the dataset:E-GEOD-23493/GSEA23493 using ArrayAnalysis. ArrayAnalysis is an user-friendly quality control and normalisation pipeline developed by Lars Eijssen and Anwesha Bohler at the former Department of Bioinformatics (BiGCaT) in 20131-2. Unfortunately, three of the Bioconductor packages used in the pipeline were removed with newer Bioconductor releases.

My goal was to investigate whether the pipeline could still be used by using reverse compatibility of the R packages. To examine this, I determined the dependencies of the packages and set-up an R session in which the packages would be functional. Then, I adapted the pipeline where necessary and used it on a test dataset to confirm its functionality. The dataset I selected uses expression profiling by arrays to identify transcriptome expression changes in HK2 cells after polychlorobiphenyls (PCBs) exposure.

This notebook is subdivided into the following four sections:

Section 1: Tabulation of ArrayAnalysis package dependencies

In this section, a table was created that shows the packages and dependencies of ArrayAnalysis. This was done to identify the R version and required versions of the R packages needed to use the pipeline. With this information, an appropriate R session was set-up prior to section 2.

Table 1. Tabulation of packages of ArrayAnalysis.
Packages Existence since Bioconductor version Removed at Bioconductor version Dependencies
affy Since 19.5 years Still present R (>= 2.8.0), BiocGenerics(>= 0.1.12), Biobase(>= 2.5.5)
affycomp Since 19.5 years Still present R (>= 2.13.0), methods, Biobase(>= 2.3.3)
affyplm Since 19.5 years Still present R (>= 2.6.0), BiocGenerics(>= 0.3.2), affy(>= 1.11.0), Biobase(>= 2.17.8), gcrma, stats, preprocessCore(>= 1.5.1)
affypdnn BioC 1.6 (R-2.1) or earlier (> 11 years) Removed after 3.12 R (>= 2.13.0), affy(>= 1.5)
BioDist Since 19.5 years Still present R (>= 2.0), methods, Biobase, KernSmooth
simpleaffy BioC 1.6 (R-2.1) or earlier (> 11 years) Removed after 3.12 R (>= 2.0.0), methods, utils, grDevices, graphics, stats, BiocGenerics(>= 0.1.12), Biobase, affy(>= 1.33.6), genefilter, gcrma
affyQCReport BioC 1.7 (R-2.2) (18.5 years) Removed after 3.12 Biobase(>= 1.13.16), affy, lattice
plier Since 19.5 years Still present R (>= 2.0), methods
yaqcaffy BioC 2.2 (R-2.7) (16 years) Removed after 3.14, but functional package version is said to be 3.12
simpleaffy(>= 2.19.3), methods
gdata Not a bioconductor package Still present Not stated
gplots Not a bioconductor package Still present R (≥ 3.0)

The set-up session had the following properties:

print(sessionInfo())
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=Dutch_Netherlands.utf8  LC_CTYPE=Dutch_Netherlands.utf8   
## [3] LC_MONETARY=Dutch_Netherlands.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=Dutch_Netherlands.utf8    
## 
## time zone: Europe/Amsterdam
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.37     R6_2.5.1          bookdown_0.42     fastmap_1.2.0    
##  [5] xfun_0.50         cachem_1.1.0      knitr_1.49        htmltools_0.5.8.1
##  [9] rmarkdown_2.29    lifecycle_1.0.4   cli_3.6.3         sass_0.4.9       
## [13] jquerylib_0.1.4   compiler_4.4.1    rstudioapi_0.17.1 tools_4.4.1      
## [17] evaluate_1.0.3    bslib_0.8.0       rmdformats_1.0.4  yaml_2.3.10      
## [21] jsonlite_1.8.9    rlang_1.1.4

Section 2:Preparation of the run_affyAnalysis QC.R template

In this section, the run_affyanalysisQC.R script was adapted so that the working environment of R would be suitable for the analysis of the test dataset.

Step 1: The data, script and working directory were corrected using the following commands. The directories for the scripts, dataset and output were joined to ensure correct registration in the global environment and a smoother Rmarkdown knitting process.

DATA.DIR <- setwd("C:/Users/shaki/Personal-Project-Investigating-backward-compatibility-of-an-outdated-Affymetrix-pipeline/docs") 
SCRIPT.DIR <- setwd("C:/Users/shaki/Personal-Project-Investigating-backward-compatibility-of-an-outdated-Affymetrix-pipeline/docs")
WORK.DIR <- setwd("C:/Users/shaki/Personal-Project-Investigating-backward-compatibility-of-an-outdated-Affymetrix-pipeline/docs")

Step 2: Confirmation of the directories occurred as stated in the script.

correctDIR <- function(d) { 
  lastChar <- substr(d,nchar(d),nchar(d))
  if((lastChar != "/") && (lastChar != "/")) d <- paste(d,"/",sep="")
  return(d)
}
if(exists("DATA.DIR")) DATA.DIR <- correctDIR(DATA.DIR)
if(exists("SCRIPT.DIR")) SCRIPT.DIR <- correctDIR(SCRIPT.DIR)
if(exists("WORK.DIR")) WORK.DIR <- correctDIR(WORK.DIR)

Step 3: The input parameters were adapted to include ‘Homo sapiens’ as species and ward.D2 as the second cluster option. This was based on previous documentation of the pipeline.

arrayGroup <- ""
reOrder <- TRUE
layoutPlot <- TRUE
controlPlot <- TRUE
samplePrep <- TRUE
ratio <- TRUE
degPlot <- TRUE
hybrid <- TRUE
percPres <- TRUE
posnegDistrib <- TRUE
bgPlot <- TRUE
scaleFact <- TRUE
boxplotRaw <- TRUE
boxplotNorm <- TRUE
densityRaw <- TRUE
densityNorm <- TRUE
MARaw <- TRUE
MANorm <- TRUE
MAOption1 <- "dataset"
spatialImage <- TRUE
PLMimage <- FALSE
posnegCOI <- TRUE
Nuse <- TRUE
Rle <- TRUE
correlRaw <- TRUE
correlNorm <- TRUE
clusterRaw <- TRUE
clusterNorm <- TRUE
clusterOption1 <- "Spearman" 
clusterOption2 <- "ward.D2" 
PCARaw <- TRUE  
PCANorm <- TRUE     
PMAcalls <- FALSE
normMeth <- "RMA" 
normOption1 <- "dataset" 
customCDF <- TRUE
CDFtype <- "ENSG" 
species <- "Homo sapiens" 

Section 3: Execution of the run_affyAnalysis script

In this section, the run_affyAnalysisQC.R script was sourced to automate the process of running the pipeline. The run_affyAnalysisQC.R script is dependent on the functions_imagesQC.R script and functions_processingQC.R script which are sourced when sourcing the run_affyAnalysisQC.R script.

Step 4: The run_affyAnalysis script was sourced by running the following command:

source(paste("run_affyAnalysisQC.R"),local=TRUE)
## Script run using R version 4.4.1 and affyAnalysisQC version_1.0.0
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, saveRDS, setdiff,
##     table, tapply, union, unique, unsplit, which.max, which.min
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## registering new summary method 'pdnn'.
## registering new pmcorrect method 'pdnn' and 'pdnnpredict'.
## Warning: package 'KernSmooth' was built under R version 4.4.2
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
## 
## Attaching package: 'gdata'
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following object is masked from 'package:BiocGenerics':
## 
##     combine
## The following object is masked from 'package:stats':
## 
##     nobs
## The following object is masked from 'package:utils':
## 
##     object.size
## The following object is masked from 'package:base':
## 
##     startsWith
## Registered S3 method overwritten by 'gplots':
##   method         from 
##   reorder.factor gdata
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:gdata':
## 
##     reorder.factor
## The following object is masked from 'package:stats':
## 
##     lowess
## [1] "Libraries have been loaded"
## [1] "Functions have been loaded"
## [1] "Raw data have been loaded in R"
## 
## current cdf environment loaded: HG-U133_Plus_2 
## The arrays are determined to contain perfect match and mismatch probes
## [1] "indicators for sample prep controls"
## [1] "indicators for beta-actin & GAPDH 3'/5' ratio"
## [1] "indicators for spike-in hybridization controls"
## [1] "indicators for percent present"
## [1] "indicators for background intensities"
## [1] "indicators for scale factors"
## [1] "Graphs ready to be computed"
## [1] "   plot sample prep controls"
## [1] "   plot beta-actin & GAPDH 3'/5' ratio"
## [1] "   plot degradation plot"
## [1] "   plot spike-in hybridization controls"
## [1] "   plot background intensities"
## [1] "   plot percent present"
## [1] "   plot pos & neg control distribution"
## [1] "   plot control profiles and/or boxplots"
## [1] "   plot scale factors"
## [1] "   plot boxplot for raw intensities"
## [1] "   plot density histogram for raw intensities"
## [1] "   MA-plots for raw intensities"
## [1] "   plot array reference layout"
## [1] "   Pos/Neg COI"
## [1] "   Fit a probe level model (PLM) on the raw data"
## [1] "   2D virtual images"
## [1] "   NUSE boxplot"
## [1] "   RLE boxplot"
## [1] "   Correlation plot of raw data"
## [1] "   PCA analysis of raw data"
## [1] "   Hierarchical clustering of raw data"
## [1] "Change CDF before pre-processing"
## [1] "hgu133plus2_Hs_ENSG"
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:gplots':
## 
##     space
## The following objects are masked from 'package:gdata':
## 
##     first, first<-
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## 
## Attaching package: 'IRanges'
## The following object is masked from 'package:gdata':
## 
##     trim
## The following object is masked from 'package:simpleaffy':
## 
##     members
## The following object is masked from 'package:grDevices':
## 
##     windows
## 
## [1] "hgu133plus2_Hs_ENSG"
## current cdf environment loaded: hgu133plus2_Hs_ENSG 
## [1] "Pre-processing is running"
## Background correcting
## Normalizing
## Calculating Expression
## [1] "   plot boxplot for normalized intensities"
## [1] "   plot density histogram for normalized intensities"
## [1] "   MA-plots for normalized intensities"
## [1] "   Correlation plot of normalized data"
## [1] "   PCA graph for normalized data"
## [1] "   Hierarchical clustering of normalized data"
## [1] "Saving normalized data table"
## Error in checkDataset(dataset = dataset, mart = mart) : 
##   The given dataset:  hsapiens_gene_ensembl , is not valid.  Correct dataset names can be obtained with the listDatasets() function.
## Warning in createNormDataTable(normData, customCDF =
## (sum(featureNames(normData) != : No gene names and annotation could be
## retrieved from BioMart for this species or no connection could be established,
## gene information not added to normalized data table
## [1] "Normalized data table: RMANormData_.txt"

Section 4: Visualization of the output

In this section, the output for quality control and normalization was visualized.

Step 5: The output files were inserted in this RMarkdown notebook to verify the functionality of the pipeline and display final results.

Cover_1
Cover_1
Cover_2
Cover_2
Description
Description
Normalized data array correlation
Normalized data array correlation
Normalized data boxplot
Normalized data boxplot
Normalized data cluster (Spearman)
Normalized data cluster (Spearman)
Normalized data MA plot
Normalized data MA plot
Normalized data MA plot
Normalized data MA plot
Normalized data PCA analysis
Normalized data PCA analysis
Normalized density histogram
Normalized density histogram
Summary QC table
Summary QC table
Raw data 2D virtual image
Raw data 2D virtual image
Raw data 2D virtual image
Raw data 2D virtual image
Raw data 5’/3’ ratio- beta-actin
Raw data 5’/3’ ratio- beta-actin
Raw data 5’/3’ ratio- GAPDH
Raw data 5’/3’ ratio- GAPDH
Raw data AFFX1 controls boxplot
Raw data AFFX1 controls boxplot
Raw data AFFX controls profiles
Raw data AFFX controls profiles
Raw data correlation plot
Raw data correlation plot
Raw data data background intensity
Raw data data background intensity
Raw data boxplot
Raw data boxplot
Raw data cluster-Spearman
Raw data cluster-Spearman
Raw data MA plot
Raw data MA plot
Raw data MA plot
Raw data MA plot
Raw data NUSE
Raw data NUSE
Raw data PCA analysis
Raw data PCA analysis
Raw data percent present
Raw data percent present
Raw data positive negative distribution
Raw data positive negative distribution
Raw data positive negative positions
Raw data positive negative positions
Raw data reference array layout
Raw data reference array layout
Raw data RLE
Raw data RLE
Raw data RNA degradation
Raw data RNA degradation
Raw data sample preparation control
Raw data sample preparation control
Raw data scale factors
Raw data scale factors
Raw data spike in hybrid controls
Raw data spike in hybrid controls
Raw density histogram
Raw density histogram

References:

  1. Eijssen, L. M., Goelela, V. S., Kelder, T., Adriaens, M. E., Evelo, C. T., & Radonjic, M. (2015). A user-friendly workflow for analysis of Illumina gene expression bead array data available at the arrayanalysis.org portal. BMC genomics, 16, 1-5.
  2. Eijssen, L. M., Jaillard, M., Adriaens, M. E., Gaj, S., de Groot, P. J., Müller, M., & Evelo, C. T. (2013). User-friendly solutions for microarray quality control and pre-processing on ArrayAnalysis.org. Nucleic acids research , 41(W1), W71-W76.