Partially Synthesised Dataset to Improve Prediction Accuracy (Case Study: Prediction of Heart Diseases)

Aljaaf, A; Al-Jumeily, D; Hussain, A; Fergus, P; al-jumaily, M; Hamdan, H

Partially Synthesised Dataset to Improve Prediction Accuracy (Case Study: Prediction of Heart Diseases)

Export Citation

Aljaaf, A, Al-Jumeily, D, Hussain, A, Fergus, P, al-jumaily, M and Hamdan, H (2016) Partially Synthesised Dataset to Improve Prediction Accuracy (Case Study: Prediction of Heart Diseases). In: Intelligent Computing Theories and Application: Lecture Notes in Computer Science , 9771. pp. 855-866. (2016 International Conference on Intelligent Computation, 02 August 2016 - 05 August 2016, Lanzhou,China).

[thumbnail of Partially Synthesised Dataset to Improve Prediction Accuracy.pdf]

Preview

Text
Partially Synthesised Dataset to Improve Prediction Accuracy.pdf - Accepted Version
Download (371kB) | Preview

Publisher URL: http://dx.doi.org/10.1007/978-3-319-42291-6_84

Abstract

The real world data sources, such as statistical agencies, library data-banks and research institutes are the major data sources for researchers. Using this type of data involves several advantages including, the improvement of credibility and validity of the experiment and more importantly, it is related to a real world problems and typically unbiased. However, this type of data is most likely unavailable or inaccessible for everyone due to the following reasons. First, privacy and confidentiality concerns, since the data must to be protected on legal and ethical basis. Second, collecting real world data is costly and time consuming. Third, the data may be unavailable, particularly in the newly arises research subjects. Therefore, many studies have attributed the use of fully and/or partially synthesised data instead of real world data due to simplicity of creation, requires a relatively small amount of time and sufficient quantity can be generated to fit the requirements. In this context, this study introduces the use of partially synthesised data to improve the prediction of heart diseases from risk factors. We are proposing the generation of partially synthetic data from agreed principles using rule-based method, in which an extra risk factor will be added to the real-world data. In the conducted experiment, more than 85% of the data was derived from observed values (i.e., real-world data), while the remaining data has been synthetically generated using a rule-based method and in accordance with the World Health Organisation criteria. The analysis revealed an improvement of the variance in the data using the first two principal components of partially synthesised data. A further evaluation has been con-ducted using five popular supervised machine-learning classifiers. In which, partially synthesised data considerably improves the prediction of heart diseases. Where the majority of classifiers have approximately doubled their predictive performance using an extra risk factor.

Item Type:	Conference or Workshop Item (Paper)
Additional Information:	The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-42291-6_84
Uncontrolled Keywords:	08 Information And Computing Sciences
Subjects:	Q Science > QA Mathematics > QA76 Computer software R Medicine > R Medicine (General)
Divisions:	Computer Science and Mathematics
Publisher:	Springer Verlag (Germany)
Date of acceptance:	25 April 2016
Date of first compliant Open Access:	16 December 2016
Date Deposited:	16 Dec 2016 11:44
Last Modified:	13 Apr 2022 15:14
URI:	https://researchonline.ljmu.ac.uk/id/eprint/3545

View Item

CORE (COnnecting REpositories)