Machine Learning for Exoplanet Detection: A Comparative Analysis using Kepler Data

Document Type : Research Paper

Authors

1 School of Astronomy, Institute for Research in Fundamental Sciences (IPM), P. O. Box 19395-5531, Tehran, Iran

2 School of Astronomy, Institute for Research in Fundamental Sciences (IPM), P. O. Box 19395-5531, Tehran, Iran; Iranian National Observatory (INO), Institute for Research in Fundamental Sciences (IPM), P. O. Box 19568-36613, Tehran, Iran

3 Department of Physics, K. N. Toosi University of Technology, P. O. Box 15875-4416, Tehran, Iran; School of Astronomy, Institute for Research in Fundamental Sciences (IPM), P. O. Box 19395-5531

Abstract

The discovery of exoplanets has expanded our understanding of planetary systems and opened new avenues for astronomical research. In this study, we present a machine learning (ML) framework for exoplanet identification using a time-series photometric dataset from the Kepler Space Telescope, comprising 3,198 flux measurements across 5,074 stars. We investigate the performance of four supervised classification algorithms, namely Random Forest, k-Nearest Neighbors (KNN), Decision Tree, and Logistic Regression, using a comprehensive set of evaluation metrics such as accuracy, precision, recall, F1-score, Area Under the Receiver Operating Characteristic Curve (AUC-ROC), confusion matrices, and learning curves. Among the models, Random Forest achieves the highest accuracy (99.8\%) and near-perfect F1-scores, demonstrating superior generalization and robustness. KNN also performs strongly, achieving 99.3\% accuracy, while Decision Tree demonstrates moderate performance with 97.1\% accuracy, and Logistic Regression trails behind with the lowest accuracy and generalization at 95.8\%. Notably, the application of the Synthetic Minority Over-sampling Technique (SMOTE) significantly improves performance across all models by addressing class imbalance. These findings underscore the effectiveness of ensemble-based machine learning techniques, particularly Random Forest, in handling large volumes of photometric data for automated exoplanet detection. This approach holds significant potential for implementation at ground-based facilities, such as the Iranian National Observatory (INO), where such extensive and precise datasets can further advance exoplanet discovery and characterization efforts.

Keywords


[1] Wolszczan, A., & Frail, D. A. 1992, Nature, 355, 145.
[2] Mayor, M., & Queloz, D. 1995, Nature, 378, 355.
[3] Malik, A., Moster, B. P., & Obermeier, C. 2022, MNRAS, 513, 5505.
[4] Charbonneau, D., Brown, T. M., Latham, D. W., & Mayor, M. 1999, ApJ, 529, L45.
[5] Campbell, B., Walker, G. A., & Yang, S. 1988, ApJ, 331, 902.
[6] Mao, S., & Paczynski, B. 1991, ApJ, 374, L37.
[7] Chauvin, G., et al. 2004, A&A, 425, L29.
[8] Benedict, G. F., et al. 2002, ApJ, 581, L115.
[9] Winn, J. N. 2010, [arXiv:1001.2010]
[10] Borucki, W. J., et al. 2010, Science, 327, 977.
[11] Howell, S. B., et al. 2014, PASP, 126, 398.
[12] Ricker, G. R., et al. 2015, JATIS, 1, 014003.
[13] Mousavi-Sadr, M. 2024, PhD thesis, Univ. Tabriz.
[14] Tey, E., et al. 2023, AJ, 165, 95.
[15] Shallue, C. J., & Vanderburg, A. 2018, AJ, 155, 94.
[16] Mousavi-Sadr, M., Jassur, D. M., & Gozaliasl, G. 2023, MNRAS, 525, 3469.
[17] Mousavi-Sadr, M., Jassur, D. M., & Gozaliasl, G. 2024, IAU Gen. Assem., 137.
[18] Valizadegan et al. 2022, ApJ, 926, 120.
[19] Davoult J., Eltschinger R., Alibert Y. 2025, A&A, 696, A94.
[20] Haghighi, M. H. Z. 2022, A&AT, 33, 323.
[21] Haghighi, M. H. Z., Kalantari, Z., Rahvar, S., & Ibrahim, A. 2025, [ arXiv:2504.19958]
[22] Haghighi, M. H. Z., Ghasrimanesh, A., & Khosroshahi, H. 2025, MLWA, 20, 100640.
[23] Richards, J. W., Starr, D. L., Miller, A. A., et al. 2011, ApJ, 733, 10.
[24] Möller, A., & de Boissière, T. 2020, MNRAS, 491, 4277.
[25] Ayubinia, A., Woo, J.-H., Hafezianzadeh, F., Kim, T., & Kim, C. 2025, ApJ, 980, 177.
[26] Jenkins, J. M., et al. 2010, ApJ, 713, L87.
[27] Chawla, N. V., Japkowicz, N., & Kotcz, A. 2004, ACM SIGKDD Explor. Newsl., 6, 1.
[28] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. 2002, J. Artif. Intell. Res., 16, 321.
[29] Jurafsky, D., & Martin, J. H. 2025, Speech and Language Processing, 12, 2025.
[30] Baron, D. 2019, [arXiv:1904.07248]
[31] Breiman, L. 2001, Mach. Learn., 45, 5.
[32] Li, L., Zhang, Y., & Zhao, Y. 2008, Sci. China Ser. G, 51, 916.
[33] Vujović, Ž. 2021, Int. J. Adv. Comput. Sci. Appl., 12, 599.
[34] McCauliff, S. D., et al. 2015, ApJ, 806, 6.
[35] Pedregosa, F., et al. 2011, J. Mach. Learn. Res., 12, 2825.
[36] Hunter, J. D. 2007, Comput. Sci. Eng., 9, 90.
[37] Van Der Walt, S., Colbert, S. C., & Varoquaux, G. 2011, Comput. Sci. Eng., 13, 22.