Logical analysis of built-in DBSCAN Functions in Popular Data Science Programming Languages

Md Amiruzzaman; Rashik  Rahman; Md. Rajibul Islam; Rizal Mohd Nor

doi:10.47981/j.mijst.10(01)2022.349(25-32)

Md Amiruzzaman West Chester University, West Chester, PA, USA
Rashik Rahman University of Asia Pacific, Dhaka, Bangladesh
Md. Rajibul Islam University of Asia Pacific, Dhaka, Bangladesh
Rizal Mohd Nor International Islamic University Malaysia, Kuala Lumpur, Malaysia

DOI: https://doi.org/10.47981/j.mijst.10(01)2022.349(25-32)

Keywords: Clustering, DBSCAN, Geo-coordinates, Machine learning, Spatial

Abstract

DBSCAN algorithm is a location-based clustering approach; it is used to find relationships and patterns in geographical data. Because of its widespread application, several data science-based programming languages include the DBSCAN method as a built-in function. Researchers and data scientists have been clustering and analyzing their study data using the built-in DBSCAN functions. All implementations of the DBSCAN functions require user input for radius distance (i.e., eps) and a minimum number of samples for a cluster (i.e., min_sample). As a result, the result of all built-in DBSCAN functions is believed to be the same. However, the DBSCAN Python built-in function yields different results than the other programming languages those are analyzed in this study. We propose a scientific way to assess the results of DBSCAN built-in function, as well as output inconsistencies. This study reveals various differences and advises caution when working with built-in functionality.

Downloads

Download data is not yet available.

References

Amiruzzaman, M. (2018, November). Prediction of traffic-violation using data mining techniques. In Proceedings of the Future Technologies Conference (pp. 283-297). Springer, Cham.

Amiruzzaman, M., Rahman, R., Islam, M. R., & Nor, R. M. (2021, November). Evaluation of DBSCAN algorithm on different programming languages: An exploratory study. In 2021 5th International Conference on Electrical Engineering and Information & Communication Technology (ICEEICT) (pp. 1-6). IEEE.

Berry, M. W., Mohamed, A., & Yap, B. W. (Eds.). (2019). Supervised and unsupervised learning for data science. Springer Nature.

Boeing, G. (2018). Clustering to reduce spatial data set size. arXiv preprint arXiv:1803.08101.

Cranor, L. F. (1994). Programming perl: an interview with larry wall. XRDS: Crossroads, The ACM Magazine for Students, 1(2), 10-11.

Clustering algorithms: their application to gene expression data. Bioinformatics and Biology insights, 10, BBI-S38316.

Dudik, J. M., Kurosu, A., Coyle, J. L., & Sejdić, E. (2015). A comparative analysis of DBSCAN, K-means, and quadratic variation algorithms for automatic identification of swallows from swallowing accelerometry signals. Computers in biology and medicine, 59, 10-18.

Davies, D., & Bouldin, D. (1979). A cluster separation measure, IEEE transactions on patter analysis and machine intelligence. vol.

Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996, August). A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd (Vol. 96, No. 34, pp. 226-231).

Fischer, B., & Buhmann, J. M. (2003). Path-based clustering for grouping of smooth curves and texture segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(4), 513-518.

Gan, J., & Tao, Y. (2015, May). DBSCAN revisited: Mis-claim, un-fixability, and approximation. In Proceedings of the 2015 ACM SIGMOD international conference on management of data (pp. 519-530).

Google: Dataset publishing language. https://developers.google.com/public-data/docs/canonical/countriescsv, accessed: 2021-01-12.

Handra, S. I., & Ciocârlie, H. (2011, May). Anomaly detection in data mining. Hybrid approach between filtering-and-refinement and DBSCAN. In 2011 6th IEEE International Symposium on Applied Computational Intelligence and Informatics (SACI) (pp. 75-83). IEEE.

Hao, J., & Ho, T. K. (2019). Machine learning made easy: a review of scikit-learn package in python programming language. Journal of Educational and Behavioral Statistics, 44(3), 348-361.

Hahsler, M., Piekenbrock, M., & Doran, D. (2019). dbscan: Fast density-based clustering with R. Journal of Statistical Software, 91(1), 1-30.

Islam, M. R., Jenny, I. J., Nayon, M., Islam, M. R., Amiruzzaman, M., & Abdullah-Al-Wadud, M. (2021, August). Clustering Algorithms to Analyze the Road Traffic Crashes. In 2021 International Conference on Science & Contemporary Technologies (ICSCT) (pp. 1-6). IEEE.

Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern recognition letters, 31(8), 651-666.

Karim, M. R., Beyan, O., Zappa, A., Costa, I. G., Rebholz-Schuhmann, D., Cochez, M., & Decker, S. (2021). Deep learning-based clustering approaches for bioinformatics. Briefings in Bioinformatics, 22(1), 393-415.

Limwattanapibool, O., & Arch‐int, S. (2017). Determination of the appropriate parameters for K‐means clustering using selection of region clusters based on density DBSCAN (SRCD‐DBSCAN). Expert Systems, 34(3), e12204.

Luchi, D., Rodrigues, A. L., & Varejão, F. M. (2019). Sampling approaches for applying DBSCAN to large datasets. Pattern Recognition Letters, 117, 90-96.

Mahmoudi, M. R., Baleanu, D., Mansor, Z., Tuan, B. A., & Pho, K. H. (2020). Fuzzy clustering method to compare the spread rate of Covid-19 in the high risks countries. Chaos, Solitons & Fractals, 140, 110230.

MacQueen, J. (1967, June). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, No. 14, pp. 281-297).

Niemierko, A., & Goitein, M. (1990). Random sampling for evaluating treatment plans. Medical physics, 17(5), 753-762.

Oyelade, J., Isewon, I., Oladipupo, F., Aromolaran, O., Uwoghiren, E., Ameh, F., ... & Adebiyi, E. (2016).

Prasetya, D. A., Nguyen, P. T., Faizullin, R., Iswanto, I., & Armay, E. F. (2020). Resolving the shortest path problem using the haversine algorithm. J. Crit. Rev, 7(1), 62-64.

Rizvee, M. M., Amiruzzaman, M., & Islam, M. R. (2021). Data Mining and Visualization to Understand Accident-Prone Areas. In Proceedings of International Joint Conference on Advances in Computational Intelligence (pp. 143-154). Springer, Singapore.

Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, 53-65.

Ramalho, L. (2015). Fluent Python: Clear, concise, and effective programming. " O'Reilly Media, Inc.".

Sibson, R. (1973). SLINK: an optimally efficient algorithm for the single-link cluster method. The computer journal, 16(1), 30-34.

Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), 1-21.

Starczewski, A., & Cader, A. (2019, June). Determining the EPS parameter of the DBSCAN algorithm. In International Conference on Artificial Intelligence and Soft Computing (pp. 420-430). Springer, Cham.

sklearn.cluster.dbscan,”https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html, accessed: 2021-01-22.

Wu, C. H., Ouyang, C. S., Chen, L. W., & Lu, L. W. (2014). A new fuzzy clustering validity index with a median factor for centroid-based clustering. IEEE Transactions on Fuzzy Systems, 23(3), 701-718.

Zhou, A., Zhou, S., Cao, J., Fan, Y., & Hu, Y. (2000). Approaches for scaling DBSCAN algorithm to large spatial databases. Journal of computer science and technology, 15(6), 509-526.