Analysis of Dimensionality Reduction Techniques on Big Data

Reddy G, T; Kumar Reddy M, P; Lakshmanna, K; Kaluri, R; Singh Rajput, D; Srivastava, G; Baker, T

Analysis of Dimensionality Reduction Techniques on Big Data

Export Citation

Reddy G, T, Kumar Reddy M, P, Lakshmanna, K, Kaluri, R, Singh Rajput, D, Srivastava, G and Baker, T (2020) Analysis of Dimensionality Reduction Techniques on Big Data. IEEE Access, 8. pp. 54776-54788. ISSN 2169-3536

[thumbnail of Analysis_of_Dimensionality_Reduction_Techniques_on_Big_Data.pdf]

Preview

Text
Analysis_of_Dimensionality_Reduction_Techniques_on_Big_Data.pdf - Published Version
Available under License Creative Commons Attribution.
Download (3MB) | Preview

Publisher URL: https://doi.org/10.1109/ACCESS.2020.2980942

Abstract

Due to digitization, a huge volume of data is being generated across several sectors such as healthcare, production, sales, IoT devices, Web, organizations. Machine learning algorithms are used to uncover patterns among the attributes of this data. Hence, they can be used to make predictions that can be used by medical practitioners and people at managerial level to make executive decisions. Not all the attributes in the datasets generated are important for training the machine learning algorithms. Some attributes might be irrelevant and some might not affect the outcome of the prediction. Ignoring or removing these irrelevant or less important attributes reduces the burden on machine learning algorithms. In this work two of the prominent dimensionality reduction techniques, Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are investigated on four popular Machine Learning (ML) algorithms, Decision Tree Induction, Support Vector Machine (SVM), Naive Bayes Classifier and Random Forest Classifier using publicly available Cardiotocography (CTG) dataset from University of California and Irvine Machine Learning Repository. The experimentation results prove that PCA outperforms LDA in all the measures. Also, the performance of the classifiers, Decision Tree, Random Forest examined is not affected much by using PCA and LDA.To further analyze the performance of PCA and LDA the eperimentation is carried out on Diabetic Retinopathy (DR) and Intrusion Detection System (IDS) datasets. Experimentation results prove that ML algorithms with PCA produce better results when dimensionality of the datasets is high. When dimensionality of datasets is low it is observed that the ML algorithms without dimensionality reduction yields better results.

Item Type:	Article
Additional Information:	© 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Subjects:	Q Science > QA Mathematics > QA76 Computer software Z Bibliography. Library Science. Information Resources > ZA Information resources > ZA4450 Databases
Divisions:	Computer Science and Mathematics
Publisher:	Institute of Electrical and Electronics Engineers (IEEE)
Date of acceptance:	10 March 2020
Date of first compliant Open Access:	16 March 2020
Date Deposited:	16 Mar 2020 10:24
Last Modified:	04 Sep 2021 07:42
DOI or ID number:	10.1109/ACCESS.2020.2980942
URI:	https://researchonline.ljmu.ac.uk/id/eprint/12495

View Item

CORE (COnnecting REpositories)