A machine learning investigation of asteroid classification in the optical regime

Export Citation

Sullivan, P (2023) A machine learning investigation of asteroid classification in the optical regime. Doctoral thesis, Liverpool John Moores University.

Preview	Text 2023sullivanphd.pdf - Published Version Available under License Creative Commons Attribution Non-commercial. Download (23MB) \| Preview
	Text 2023sullivanphdinternal.pdf - Accepted Version Access Restricted Download (21MB)

Abstract

Historically, asteroids have been characterised by features of their reflectance spectra in the optical and near-infrared (NIR) as well as their albedos. The Bus-DeMeo asteroid taxonomy and its closely-related predecessor (Bus-Binzel) both have their roots in spectrophotometric work from the 1970s based on principal component analysis (PCA) and clustering of small (few hundred) asteroid samples. The known asteroid population has grown exponentially since those times, and the taxonomies have been applied to as many as one million asteroids in the Sloan Digital Sky Survey (SDSS). Whether the classes they describe remain valid in a contemporary context is the topic of this thesis. I used a range of machine learning techniques to investigate the robustness of asteroid classes in approximately two thousand reflectance spectra as well as in the photometric colour data available in three published catalogs extracted from the SDSS Moving Object Catalog. Beginning with spectra in the wavelength range 0.5 − 0.9 μm I found that a support vector machine (SVM) classifier can identify 85% of asteroids correctly when considering the major classes (A, B, C, D, K, L, Q, S, V, X), but subclasses are not distinguishable from one another and/or their parent classes. Furthermore, the 15% of wrongly-classified objects include a high proportion of classes B, K, and Q being assigned to neighbouring classes in feature-space. The SVM performs better on the full 161 spectral wavelength points than on data that have been transformed with PCA, indicating that there is no need to reduce the dimensions before training. In order to test an SVM on the large SDSS asteroid datasets, I generated a ‘pseudobroadband’ training set by taking the average reflectance values of the spectra over a range corresponding to SDSS r, i, and z-bands. After training the SVM on this lowresolution data, classification accuracy was diminished by only 5%, but every A, B, K, iii and Q object was assigned to a neighbouring class. In the transition to broadband photometry these classes ceased to exist as far as the SVM was concerned. For each dataset used, catalog classes had already been assigned according to either the Bus-Binzel (for spectra) or Carvano or DeMeo (for SDSS, respectively) systems. Objects having data in both SDSS and spectrum catalogs had been assigned to different classes in ∼ 30% of cases, creating a challenging situation for the SVM since it cannot match to a conflicting label. In examining the SDSS datasets (one classified by the Bus-Binzel system and the other two by the Bus-DeMeo system) I discovered that the mean SDSS broadband values in each of bands r, i, z were brighter than pseudo-broadband values for the same object. After correcting for the difference in each band, the SVM was able to predict with 85% accuracy on the Carvano dataset and 74% on the DeMeo data. However, classes A, B, K, L, and Q were not verifiable by the SVM. Since the above classes have small populations relative to the S-class, I set about increasing the sample size of the smaller classes. I did this using two augmentation methods: Synthetic Minority Oversampling Technique with Edited Nearest Neighbours (SMOTEENN) and a variational autoencoder (VAE). SMOTE-ENN interpolates between existing datapoints when resampling, whereas the VAE learns from existing samples and produces new spectra by sampling from a Gaussian distribution around each dimension of the data. After the SVM was trained on SMOTE-ENN, it was able to classify to 91% accuracy on the spectrum test set and 88% on pseudo-broadband. VAE achieved 92% accuracy at spectrum level and 90% at pseudo-broadband. Both augmentation methods recovered all of the ‘missing’ classes, but to poor accuracy. I found that performance of augmented models fell by 30 − 40% when testing on SDSS data. SMOTE-ENN’s method of interpolation altered the distribution of data within class boundaries in a non-random way, effectively introducing artificial substructure in feature-space. The result was highly overfitted class boundaries that became obvious when examining plots of SMOTE-ENN’s predicted classes on SDSS data. The VAE model was able to correct for some class imbalance, especially in the C-complex, but its accuracy results ran to 58% at best. The reason for much lower accuracy scores comes from the increased population of classes that are poorly-defined in the first place. Rather than improving the classifier, augmentation by the VAE exposes the weakness in claiming that classes A, B, K, L and Q exist at all in the SDSS. iv Finally, I used unsupervised learning to search for evidence of asteroid classes in unlabelled data using K-Means, Gaussian mixture model (GMM) and Hierarchical Density- Based Spatial Clustering of Applications with Noise (HDBSCAN). After training and testing on spectra, PCA of spectra, and spectra augmented with both SMOTE-ENN and VAE, I found no evidence of ‘real’ classes apart from the C-complex and S-complex cores, which have been known since the 1970s. HDBSCAN characterises a large fraction of the data as noise. I conclude that the variational autoencoder is a viable method to correct bias in the classification model without introducing new biases, and that SMOTE-ENN produces data prone to overfitting. With the VAE model in the 0.5 − 0.9 μm regime, classes A, B, K, L, and Q are ambiguous at pseudo-broadband resolution and therefore their application to SDSS data is unreliable. Tricia Sullivan September 2023 v

Item Type:	Thesis (Doctoral)
Uncontrolled Keywords:	asteroid classification; variational autoencoder; asteroid taxonomy; supervised learning
Subjects:	Q Science > QA Mathematics > QA75 Electronic computers. Computer science Q Science > QB Astronomy
Divisions:	Astrophysics Research Institute
Date of acceptance:	2 April 2023
Date of first compliant Open Access:	29 September 2023
Date Deposited:	29 Sep 2023 14:54
Last Modified:	29 Sep 2023 14:56
DOI or ID number:	10.24377/LJMU.t.00021526
Supervisors:	Steele, I, Cooperwheat, C and Fergus, P
URI:	https://researchonline.ljmu.ac.uk/id/eprint/21526

View Item

CORE (COnnecting REpositories)