Logical analysis of built-in DBSCAN Functions in Popular Data Science Programming Languages
Abstract
DBSCAN algorithm is a location-based clustering approach; it is used to find relationships and patterns in geographical data. Because of its widespread application, several data science-based programming languages include the DBSCAN method as a built-in function. Researchers and data scientists have been clustering and analyzing their study data using the built-in DBSCAN functions. All implementations of the DBSCAN functions require user input for radius distance (i.e., eps) and a minimum number of samples for a cluster (i.e., min_sample). As a result, the result of all built-in DBSCAN functions is believed to be the same. However, the DBSCAN Python built-in function yields different results than the other programming languages those are analyzed in this study. We propose a scientific way to assess the results of DBSCAN built-in function, as well as output inconsistencies. This study reveals various differences and advises caution when working with built-in functionality.
Downloads
References
Amiruzzaman, M. (2018, November). Prediction of traffic-violation using data mining techniques. In Proceedings of the Future Technologies Conference (pp. 283-297). Springer, Cham.
Amiruzzaman, M., Rahman, R., Islam, M. R., & Nor, R. M. (2021, November). Evaluation of DBSCAN algorithm on different programming languages: An exploratory study. In 2021 5th International Conference on Electrical Engineering and Information & Communication Technology (ICEEICT) (pp. 1-6). IEEE.
Berry, M. W., Mohamed, A., & Yap, B. W. (Eds.). (2019). Supervised and unsupervised learning for data science. Springer Nature.
Boeing, G. (2018). Clustering to reduce spatial data set size. arXiv preprint arXiv:1803.08101.
Cranor, L. F. (1994). Programming perl: an interview with larry wall. XRDS: Crossroads, The ACM Magazine for Students, 1(2), 10-11.
Clustering algorithms: their application to gene expression data. Bioinformatics and Biology insights, 10, BBI-S38316.
Dudik, J. M., Kurosu, A., Coyle, J. L., & Sejdić, E. (2015). A comparative analysis of DBSCAN, K-means, and quadratic variation algorithms for automatic identification of swallows from swallowing accelerometry signals. Computers in biology and medicine, 59, 10-18.
Davies, D., & Bouldin, D. (1979). A cluster separation measure, IEEE transactions on patter analysis and machine intelligence. vol.
Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996, August). A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd (Vol. 96, No. 34, pp. 226-231).
Fischer, B., & Buhmann, J. M. (2003). Path-based clustering for grouping of smooth curves and texture segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(4), 513-518.
Gan, J., & Tao, Y. (2015, May). DBSCAN revisited: Mis-claim, un-fixability, and approximation. In Proceedings of the 2015 ACM SIGMOD international conference on management of data (pp. 519-530).
Google: Dataset publishing language. https://developers.google.com/public-data/docs/canonical/countriescsv, accessed: 2021-01-12.
Handra, S. I., & Ciocârlie, H. (2011, May). Anomaly detection in data mining. Hybrid approach between filtering-and-refinement and DBSCAN. In 2011 6th IEEE International Symposium on Applied Computational Intelligence and Informatics (SACI) (pp. 75-83). IEEE.
Hao, J., & Ho, T. K. (2019). Machine learning made easy: a review of scikit-learn package in python programming language. Journal of Educational and Behavioral Statistics, 44(3), 348-361.
Hahsler, M., Piekenbrock, M., & Doran, D. (2019). dbscan: Fast density-based clustering with R. Journal of Statistical Software, 91(1), 1-30.
Islam, M. R., Jenny, I. J., Nayon, M., Islam, M. R., Amiruzzaman, M., & Abdullah-Al-Wadud, M. (2021, August). Clustering Algorithms to Analyze the Road Traffic Crashes. In 2021 International Conference on Science & Contemporary Technologies (ICSCT) (pp. 1-6). IEEE.
Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern recognition letters, 31(8), 651-666.
Karim, M. R., Beyan, O., Zappa, A., Costa, I. G., Rebholz-Schuhmann, D., Cochez, M., & Decker, S. (2021). Deep learning-based clustering approaches for bioinformatics. Briefings in Bioinformatics, 22(1), 393-415.
Limwattanapibool, O., & Arch‐int, S. (2017). Determination of the appropriate parameters for K‐means clustering using selection of region clusters based on density DBSCAN (SRCD‐DBSCAN). Expert Systems, 34(3), e12204.
Luchi, D., Rodrigues, A. L., & Varejão, F. M. (2019). Sampling approaches for applying DBSCAN to large datasets. Pattern Recognition Letters, 117, 90-96.
Mahmoudi, M. R., Baleanu, D., Mansor, Z., Tuan, B. A., & Pho, K. H. (2020). Fuzzy clustering method to compare the spread rate of Covid-19 in the high risks countries. Chaos, Solitons & Fractals, 140, 110230.
MacQueen, J. (1967, June). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, No. 14, pp. 281-297).
Niemierko, A., & Goitein, M. (1990). Random sampling for evaluating treatment plans. Medical physics, 17(5), 753-762.
Oyelade, J., Isewon, I., Oladipupo, F., Aromolaran, O., Uwoghiren, E., Ameh, F., ... & Adebiyi, E. (2016).
Prasetya, D. A., Nguyen, P. T., Faizullin, R., Iswanto, I., & Armay, E. F. (2020). Resolving the shortest path problem using the haversine algorithm. J. Crit. Rev, 7(1), 62-64.
Rizvee, M. M., Amiruzzaman, M., & Islam, M. R. (2021). Data Mining and Visualization to Understand Accident-Prone Areas. In Proceedings of International Joint Conference on Advances in Computational Intelligence (pp. 143-154). Springer, Singapore.
Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, 53-65.
Ramalho, L. (2015). Fluent Python: Clear, concise, and effective programming. " O'Reilly Media, Inc.".
Sibson, R. (1973). SLINK: an optimally efficient algorithm for the single-link cluster method. The computer journal, 16(1), 30-34.
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), 1-21.
Starczewski, A., & Cader, A. (2019, June). Determining the EPS parameter of the DBSCAN algorithm. In International Conference on Artificial Intelligence and Soft Computing (pp. 420-430). Springer, Cham.
sklearn.cluster.dbscan,”https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html, accessed: 2021-01-22.
Wu, C. H., Ouyang, C. S., Chen, L. W., & Lu, L. W. (2014). A new fuzzy clustering validity index with a median factor for centroid-based clustering. IEEE Transactions on Fuzzy Systems, 23(3), 701-718.
Zhou, A., Zhou, S., Cao, J., Fan, Y., & Hu, Y. (2000). Approaches for scaling DBSCAN algorithm to large spatial databases. Journal of computer science and technology, 15(6), 509-526.
Though MIJST follows the open access policy, the journal holds the copyright of each published items.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.