SAERMA: Stacked Autoencoders Rule Mining Algorithm for the Interpretation of Epistatic Interactions in GWAS of Extreme Obesity

Export Citation

Curbelo Montanez, C (2019) SAERMA: Stacked Autoencoders Rule Mining Algorithm for the Interpretation of Epistatic Interactions in GWAS of Extreme Obesity. Doctoral thesis, Liverpool John Moores University.

Preview

Text
2019Curbelophd.pdf - Published Version
Download (7MB) | Preview

Abstract

One of the most important challenges in the analysis of high-throughput genetic data is the development of efficient computational and statistical methods to identify statistically significant single nucleotide polymorphisms (SNPs). Genome-wide association studies (GWAS), are the state-of-the-art in identifying genetic variants for complex disorders, such as obesity. However, GWAS use single-locus analysis where each SNP is independently tested for association with some phenotype. The limitation of genetic variants identified by GWAS is its inability to explain the underlying genetic variation in complex diseases. Consequently, alternative approaches are required that are capable of modelling the intricate relationships between SNPs and phenotypes. The approach presented in this thesis extends GWAS and explores the use of deep learning stacked autoencoders (SAE) and association rule mining (ARM) to identify epistatic interactions between SNPs. This is achieved using a case-control dataset containing 2,193 observations (962 cases and 1,231 controls) each with 594,034 genetic markers. A statistical filtering strategy is adopted to reduce the large number of SNPs to a more manageable set suitable for machine learning tasks. Several experiments have been conducted to explore epistasis among the filtered subset (2,465 SNPs) and are compared with results obtained using the industry standard logistic regression via a Generalised Linear Model (GLM). These include a multi-layer feedforward artificial neural network (MLP), SAE and a combination approach using SAE and ARM. Functional enrichment analysis is adopted to biologically validate association rules mined by the proposed method. Baseline classification results are initially conducted using standard logistic regression (GLM) with SNPs input derived from several P-value thresholds (1x10⁻⁵, 1x10⁻⁴, 1x10⁻³, and 1x10⁻²). The second experiment is carried out using an MLP trained using the same input features. In subsequent experiments, epistasis is investigated using SAE to extract nonlinear SNP-SNP interactions and pre-train a fully connected MLP layer. Features are extracted using four single layer autoencoders (AEs) stacked (containing 2,000-1,000-500-50 hidden units respectively). The initial results show that it is possible to gain an AUC = 85% (SE = 78% and SP = 80%) using 50 hidden neurons. The findings are encouraging; however, it is not possible to identify which information from the 2,465 SNPs is retained in the final AE layer (50 nodes) to initialise and train the MLP. Consequently, ARM is introduced to extend the SAE approach to provide interpretability regarding what SNPs more closely influence the phenotype and the interactions that exist between them. Interestingness measures, support, confidence, lift and chi-square test ( ²) are utilised to rank and determine correlated rules, under a support-dependence framework. The SNPs from the top rules (top 300, 200, 100 and 50 rules) are used with a SAE and fully connected MLP to measure their discriminant capacity in distinguishing between case-control observations. Graph-based visualization methods are utilised to show the interactions between SNPs as identified by the top rules. While classifier performance metrics are utilised to assess classifier performance. The SNPs contained in the set of top rules are used as input to different SAEs configurations to compress the features (retain only the salient information) through progressively smaller hidden layers. The final hidden layer is then used to initialise the learnable parameters of a fully connected MLP before it is fine-tuned for classification tasks. The results show that it is possible to achieve an AUC = 77%, SE = 77% and SP = 68%. More importantly, in parallel, it is possible to explore which of the 2,465 SNPs and their epistatic interactions are most strongly associated with obesity. This provides a significant novel contribution to the field of computational biology and is the first study of its kind to combine deep learning epistatic analysis using SAEs and ARM to classify case-control observations and provide an interpretation of the final trained classification model. The level of accuracy required is fully tuneable, i.e. it is possible to increase/decrease the results obtained by the SNPs in classification tasks by increasing/decreasing the rule mining support and confidence parameters, defined in the rule generation stage. Additional experiments were conducted as a proof of concept to support the use of a statistical filtering approach to reduce the dimensionality of the data before investigating epistasis. Gene set enrichment analysis was utilised via the i-GSEA4GWAS web tool. Enriched gene sets were then used as input features for classification experiments using an MLP and their performance reported. Although this approach is based on biological knowledge, that is, genetic variants are filtered based on biological pathways, classification results did not outperform those achieved by SAERMA. This, thus, justifies the use of statistical filtering within the proposed algorithm. I, therefore, claim the approach posited in this thesis is foundational in character and is the first study of its kind that combines GWAS quality control and logistic regression with association rule mining and deep learning stacked autoencoders to study epistatic interactions between SNPs in polygenic obesity GWAS.

Item Type:	Thesis (Doctoral)
Uncontrolled Keywords:	Machine Learning; Bioinformatics; Classification; Deep Learning; Obesity; Epistasis; Quality Control; GWAS
Subjects:	Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions:	Computer Science and Mathematics
Date of first compliant Open Access:	30 January 2022
Date Deposited:	15 May 2019 08:27
Last Modified:	21 Nov 2022 12:23
DOI or ID number:	10.24377/LJMU.t.00010670
Supervisors:	Fergus, P and Chalmers, C
URI:	https://researchonline.ljmu.ac.uk/id/eprint/10670

View Item

CORE (COnnecting REpositories)