Project: Investigation of backward compatibility of an outdated Affymetrix pipeline for the preprocessing of transcriptomics datasets pipeline

Author: Shakira Agata

This notebook describes the steps taken for the investigation and preprocessing of the dataset:E-GEOD-23493/GSEA23493 using ArrayAnalysis. ArrayAnalysis is an user-friendly quality control and normalisation pipeline developed by Lars Eijssen and Anwesha Bohler at the former Department of Bioinformatics (BiGCaT) in 2013^1-2. Unfortunately, three of the Bioconductor packages used in the pipeline were removed with newer Bioconductor releases.

My goal was to investigate whether the pipeline could still be used by using reverse compatibility of the R packages. To examine this, I determined the dependencies of the packages and set-up an R session in which the packages would be functional. Then, I adapted the pipeline where necessary and used it on a test dataset to confirm its functionality. The dataset I selected uses expression profiling by arrays to identify transcriptome expression changes in HK2 cells after polychlorobiphenyls (PCBs) exposure.

This notebook is subdivided into the following four sections:

Section 1: Tabulation of ArrayAnalysis package dependencies
Section 2: Preparation of the run_affyAnalysis QC.R template
Section 3: Execution of the run_affyAnalysis script
Section 4: Visualization of the output

Section 1: Tabulation of ArrayAnalysis package dependencies

In this section, a table was created that shows the packages and dependencies of ArrayAnalysis. This was done to identify the R version and required versions of the R packages needed to use the pipeline. With this information, an appropriate R session was set-up prior to section 2.

Table 1. Tabulation of packages of ArrayAnalysis.
Packages	Existence since Bioconductor version	Removed at Bioconductor version	Dependencies
affy	Since 19.5 years	Still present	R (>= 2.8.0), BiocGenerics(>= 0.1.12), Biobase(>= 2.5.5)
affycomp	Since 19.5 years	Still present	R (>= 2.13.0), methods, Biobase(>= 2.3.3)
affyplm	Since 19.5 years	Still present	R (>= 2.6.0), BiocGenerics(>= 0.3.2), affy(>= 1.11.0), Biobase(>= 2.17.8), gcrma, stats, preprocessCore(>= 1.5.1)
affypdnn	BioC 1.6 (R-2.1) or earlier (> 11 years)	Removed after 3.12	R (>= 2.13.0), affy(>= 1.5)
BioDist	Since 19.5 years	Still present	R (>= 2.0), methods, Biobase, KernSmooth
simpleaffy	BioC 1.6 (R-2.1) or earlier (> 11 years)	Removed after 3.12	R (>= 2.0.0), methods, utils, grDevices, graphics, stats, BiocGenerics(>= 0.1.12), Biobase, affy(>= 1.33.6), genefilter, gcrma
affyQCReport	BioC 1.7 (R-2.2) (18.5 years)	Removed after 3.12	Biobase(>= 1.13.16), affy, lattice
plier	Since 19.5 years	Still present	R (>= 2.0), methods
yaqcaffy	BioC 2.2 (R-2.7) (16 years)	Removed after 3.14, but functional package version is said to be 3.12	`simpleaffy(>= 2.19.3), methods`
gdata	Not a bioconductor package	Still present	Not stated
gplots	Not a bioconductor package	Still present	R (≥ 3.0)

The set-up session had the following properties:

print(sessionInfo())

## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=Dutch_Netherlands.utf8  LC_CTYPE=Dutch_Netherlands.utf8   
## [3] LC_MONETARY=Dutch_Netherlands.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=Dutch_Netherlands.utf8    
## 
## time zone: Europe/Amsterdam
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.37     R6_2.5.1          bookdown_0.42     fastmap_1.2.0    
##  [5] xfun_0.50         cachem_1.1.0      knitr_1.49        htmltools_0.5.8.1
##  [9] rmarkdown_2.29    lifecycle_1.0.4   cli_3.6.3         sass_0.4.9       
## [13] jquerylib_0.1.4   compiler_4.4.1    rstudioapi_0.17.1 tools_4.4.1      
## [17] evaluate_1.0.3    bslib_0.8.0       rmdformats_1.0.4  yaml_2.3.10      
## [21] jsonlite_1.8.9    rlang_1.1.4

Section 2:Preparation of the run_affyAnalysis QC.R template

In this section, the run_affyanalysisQC.R script was adapted so that the working environment of R would be suitable for the analysis of the test dataset.

Step 1: The data, script and working directory were corrected using the following commands. The directories for the scripts, dataset and output were joined to ensure correct registration in the global environment and a smoother Rmarkdown knitting process.

DATA.DIR <- setwd("C:/Users/shaki/Personal-Project-Investigating-backward-compatibility-of-an-outdated-Affymetrix-pipeline/docs")

SCRIPT.DIR <- setwd("C:/Users/shaki/Personal-Project-Investigating-backward-compatibility-of-an-outdated-Affymetrix-pipeline/docs")

WORK.DIR <- setwd("C:/Users/shaki/Personal-Project-Investigating-backward-compatibility-of-an-outdated-Affymetrix-pipeline/docs")

Step 2: Confirmation of the directories occurred as stated in the script.

correctDIR <- function(d) { 
  lastChar <- substr(d,nchar(d),nchar(d))
  if((lastChar != "/") && (lastChar != "/")) d <- paste(d,"/",sep="")
  return(d)
}
if(exists("DATA.DIR")) DATA.DIR <- correctDIR(DATA.DIR)
if(exists("SCRIPT.DIR")) SCRIPT.DIR <- correctDIR(SCRIPT.DIR)
if(exists("WORK.DIR")) WORK.DIR <- correctDIR(WORK.DIR)

Step 3: The input parameters were adapted to include ‘Homo sapiens’ as species and ward.D2 as the second cluster option. This was based on previous documentation of the pipeline.

arrayGroup <- ""
reOrder <- TRUE
layoutPlot <- TRUE
controlPlot <- TRUE
samplePrep <- TRUE
ratio <- TRUE
degPlot <- TRUE
hybrid <- TRUE
percPres <- TRUE
posnegDistrib <- TRUE
bgPlot <- TRUE
scaleFact <- TRUE
boxplotRaw <- TRUE
boxplotNorm <- TRUE
densityRaw <- TRUE
densityNorm <- TRUE
MARaw <- TRUE
MANorm <- TRUE
MAOption1 <- "dataset"
spatialImage <- TRUE
PLMimage <- FALSE
posnegCOI <- TRUE
Nuse <- TRUE
Rle <- TRUE
correlRaw <- TRUE
correlNorm <- TRUE
clusterRaw <- TRUE
clusterNorm <- TRUE
clusterOption1 <- "Spearman" 
clusterOption2 <- "ward.D2" 
PCARaw <- TRUE  
PCANorm <- TRUE     
PMAcalls <- FALSE
normMeth <- "RMA" 
normOption1 <- "dataset" 
customCDF <- TRUE
CDFtype <- "ENSG" 
species <- "Homo sapiens"

Section 3: Execution of the run_affyAnalysis script

In this section, the run_affyAnalysisQC.R script was sourced to automate the process of running the pipeline. The run_affyAnalysisQC.R script is dependent on the functions_imagesQC.R script and functions_processingQC.R script which are sourced when sourcing the run_affyAnalysisQC.R script.

Step 4: The run_affyAnalysis script was sourced by running the following command:

source(paste("run_affyAnalysisQC.R"),local=TRUE)

## Script run using R version 4.4.1 and affyAnalysisQC version_1.0.0

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, saveRDS, setdiff,
##     table, tapply, union, unique, unsplit, which.max, which.min

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## registering new summary method 'pdnn'.
## registering new pmcorrect method 'pdnn' and 'pdnnpredict'.

## Warning: package 'KernSmooth' was built under R version 4.4.2

## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009

## 
## Attaching package: 'gdata'

## The following object is masked from 'package:Biobase':
## 
##     combine

## The following object is masked from 'package:BiocGenerics':
## 
##     combine

## The following object is masked from 'package:stats':
## 
##     nobs

## The following object is masked from 'package:utils':
## 
##     object.size

## The following object is masked from 'package:base':
## 
##     startsWith

## Registered S3 method overwritten by 'gplots':
##   method         from 
##   reorder.factor gdata

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:gdata':
## 
##     reorder.factor

## The following object is masked from 'package:stats':
## 
##     lowess

## [1] "Libraries have been loaded"
## [1] "Functions have been loaded"
## [1] "Raw data have been loaded in R"

##

## current cdf environment loaded: HG-U133_Plus_2 
## The arrays are determined to contain perfect match and mismatch probes

## [1] "indicators for sample prep controls"
## [1] "indicators for beta-actin & GAPDH 3'/5' ratio"
## [1] "indicators for spike-in hybridization controls"
## [1] "indicators for percent present"
## [1] "indicators for background intensities"
## [1] "indicators for scale factors"

## [1] "Graphs ready to be computed"
## [1] "   plot sample prep controls"

## [1] "   plot beta-actin & GAPDH 3'/5' ratio"

## [1] "   plot degradation plot"

## [1] "   plot spike-in hybridization controls"

## [1] "   plot background intensities"

## [1] "   plot percent present"

## [1] "   plot pos & neg control distribution"

## [1] "   plot control profiles and/or boxplots"

## [1] "   plot scale factors"

## [1] "   plot boxplot for raw intensities"

## [1] "   plot density histogram for raw intensities"

## [1] "   MA-plots for raw intensities"

## [1] "   plot array reference layout"

## [1] "   Pos/Neg COI"

## [1] "   Fit a probe level model (PLM) on the raw data"
## [1] "   2D virtual images"

## [1] "   NUSE boxplot"

## [1] "   RLE boxplot"

## [1] "   Correlation plot of raw data"

## [1] "   PCA analysis of raw data"

## [1] "   Hierarchical clustering of raw data"

## [1] "Change CDF before pre-processing"
## [1] "hgu133plus2_Hs_ENSG"

## Loading required package: AnnotationDbi

## Loading required package: stats4

## Loading required package: IRanges

## Loading required package: S4Vectors

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:gplots':
## 
##     space

## The following objects are masked from 'package:gdata':
## 
##     first, first<-

## The following object is masked from 'package:utils':
## 
##     findMatches

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## 
## Attaching package: 'IRanges'

## The following object is masked from 'package:gdata':
## 
##     trim

## The following object is masked from 'package:simpleaffy':
## 
##     members

## The following object is masked from 'package:grDevices':
## 
##     windows

##

## [1] "hgu133plus2_Hs_ENSG"
## current cdf environment loaded: hgu133plus2_Hs_ENSG 
## [1] "Pre-processing is running"
## Background correcting
## Normalizing
## Calculating Expression

## [1] "   plot boxplot for normalized intensities"

## [1] "   plot density histogram for normalized intensities"

## [1] "   MA-plots for normalized intensities"

## [1] "   Correlation plot of normalized data"

## [1] "   PCA graph for normalized data"

## [1] "   Hierarchical clustering of normalized data"

## [1] "Saving normalized data table"
## Error in checkDataset(dataset = dataset, mart = mart) : 
##   The given dataset:  hsapiens_gene_ensembl , is not valid.  Correct dataset names can be obtained with the listDatasets() function.

## Warning in createNormDataTable(normData, customCDF =
## (sum(featureNames(normData) != : No gene names and annotation could be
## retrieved from BioMart for this species or no connection could be established,
## gene information not added to normalized data table

## [1] "Normalized data table: RMANormData_.txt"

Section 4: Visualization of the output

In this section, the output for quality control and normalization was visualized.

Step 5: The output files were inserted in this RMarkdown notebook to verify the functionality of the pipeline and display final results.

Cover_1

Cover_2

Description

Normalized data array correlation

Normalized data boxplot

Normalized data cluster (Spearman)

Normalized data MA plot

Normalized data PCA analysis

Normalized density histogram

Summary QC table

Raw data 2D virtual image

Raw data 5’/3’ ratio- beta-actin

Raw data 5’/3’ ratio- GAPDH

Raw data AFFX1 controls boxplot

Raw data AFFX controls profiles

Raw data correlation plot

Raw data data background intensity

Raw data boxplot

Raw data cluster-Spearman

Raw data MA plot

Raw data NUSE

Raw data PCA analysis

Raw data percent present

Raw data positive negative distribution

Raw data positive negative positions

Raw data reference array layout

Raw data RLE

Raw data RNA degradation

Raw data sample preparation control

Raw data scale factors

Raw data spike in hybrid controls

Raw density histogram

References:

Eijssen, L. M., Goelela, V. S., Kelder, T., Adriaens, M. E., Evelo, C. T., & Radonjic, M. (2015). A user-friendly workflow for analysis of Illumina gene expression bead array data available at the arrayanalysis.org portal. BMC genomics, 16, 1-5.
Eijssen, L. M., Jaillard, M., Adriaens, M. E., Gaj, S., de Groot, P. J., Müller, M., & Evelo, C. T. (2013). User-friendly solutions for microarray quality control and pre-processing on ArrayAnalysis.org. Nucleic acids research , 41(W1), W71-W76.