Logical analysis of built-in DBSCAN Functions in Popular Data Science Programming Languages

  • Md Amiruzzaman West Chester University, West Chester, PA, USA
  • Rashik Rahman University of Asia Pacific, Dhaka, Bangladesh
  • Md. Rajibul Islam University of Asia Pacific, Dhaka, Bangladesh
  • Rizal Mohd Nor International Islamic University Malaysia, Kuala Lumpur, Malaysia
Keywords: Clustering, DBSCAN, Geo-coordinates, Machine learning, Spatial

Abstract

DBSCAN algorithm is a location-based clustering approach; it is used to find relationships and patterns in geographical data.  Because of its widespread application, several data science-based programming languages include the DBSCAN method as a built-in function. Researchers and data scientists have been clustering and analyzing their study data using the built-in DBSCAN functions. All implementations of the DBSCAN functions require user input for radius distance (i.e., eps) and a minimum number of samples for a cluster (i.e., min_sample). As a result, the result of all built-in DBSCAN functions is believed to be the same. However, the DBSCAN Python built-in function yields different results than the other programming languages those are analyzed in this study. We propose a scientific way to assess the results of DBSCAN built-in function, as well as output inconsistencies. This study reveals various differences and advises caution when working with built-in functionality.

Downloads

Download data is not yet available.

References

Amiruzzaman, M. (2018, November). Prediction of traffic-violation using data mining techniques. In Proceedings of the Future Technologies Conference (pp. 283-297). Springer, Cham.

Amiruzzaman, M., Rahman, R., Islam, M. R., & Nor, R. M. (2021, November). Evaluation of DBSCAN algorithm on different programming languages: An exploratory study. In 2021 5th International Conference on Electrical Engineering and Information & Communication Technology (ICEEICT) (pp. 1-6). IEEE.

Berry, M. W., Mohamed, A., & Yap, B. W. (Eds.). (2019). Supervised and unsupervised learning for data science. Springer Nature.

Boeing, G. (2018). Clustering to reduce spatial data set size. arXiv preprint arXiv:1803.08101.

Cranor, L. F. (1994). Programming perl: an interview with larry wall. XRDS: Crossroads, The ACM Magazine for Students, 1(2), 10-11.

Clustering algorithms: their application to gene expression data. Bioinformatics and Biology insights, 10, BBI-S38316.

Dudik, J. M., Kurosu, A., Coyle, J. L., & Sejdić, E. (2015). A comparative analysis of DBSCAN, K-means, and quadratic variation algorithms for automatic identification of swallows from swallowing accelerometry signals. Computers in biology and medicine, 59, 10-18.

Davies, D., & Bouldin, D. (1979). A cluster separation measure, IEEE transactions on patter analysis and machine intelligence. vol.

Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996, August). A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd (Vol. 96, No. 34, pp. 226-231).

Fischer, B., & Buhmann, J. M. (2003). Path-based clustering for grouping of smooth curves and texture segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(4), 513-518.

Gan, J., & Tao, Y. (2015, May). DBSCAN revisited: Mis-claim, un-fixability, and approximation. In Proceedings of the 2015 ACM SIGMOD international conference on management of data (pp. 519-530).

Google: Dataset publishing language. https://developers.google.com/public-data/docs/canonical/countriescsv, accessed: 2021-01-12.

Handra, S. I., & Ciocârlie, H. (2011, May). Anomaly detection in data mining. Hybrid approach between filtering-and-refinement and DBSCAN. In 2011 6th IEEE International Symposium on Applied Computational Intelligence and Informatics (SACI) (pp. 75-83). IEEE.

Hao, J., & Ho, T. K. (2019). Machine learning made easy: a review of scikit-learn package in python programming language. Journal of Educational and Behavioral Statistics, 44(3), 348-361.

Hahsler, M., Piekenbrock, M., & Doran, D. (2019). dbscan: Fast density-based clustering with R. Journal of Statistical Software, 91(1), 1-30.

Islam, M. R., Jenny, I. J., Nayon, M., Islam, M. R., Amiruzzaman, M., & Abdullah-Al-Wadud, M. (2021, August). Clustering Algorithms to Analyze the Road Traffic Crashes. In 2021 International Conference on Science & Contemporary Technologies (ICSCT) (pp. 1-6). IEEE.

Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern recognition letters, 31(8), 651-666.

Karim, M. R., Beyan, O., Zappa, A., Costa, I. G., Rebholz-Schuhmann, D., Cochez, M., & Decker, S. (2021). Deep learning-based clustering approaches for bioinformatics. Briefings in Bioinformatics, 22(1), 393-415.

Limwattanapibool, O., & Arch‐int, S. (2017). Determination of the appropriate parameters for K‐means clustering using selection of region clusters based on density DBSCAN (SRCD‐DBSCAN). Expert Systems, 34(3), e12204.

Luchi, D., Rodrigues, A. L., & Varejão, F. M. (2019). Sampling approaches for applying DBSCAN to large datasets. Pattern Recognition Letters, 117, 90-96.

Mahmoudi, M. R., Baleanu, D., Mansor, Z., Tuan, B. A., & Pho, K. H. (2020). Fuzzy clustering method to compare the spread rate of Covid-19 in the high risks countries. Chaos, Solitons & Fractals, 140, 110230.

MacQueen, J. (1967, June). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, No. 14, pp. 281-297).

Niemierko, A., & Goitein, M. (1990). Random sampling for evaluating treatment plans. Medical physics, 17(5), 753-762.

Oyelade, J., Isewon, I., Oladipupo, F., Aromolaran, O., Uwoghiren, E., Ameh, F., ... & Adebiyi, E. (2016).

Prasetya, D. A., Nguyen, P. T., Faizullin, R., Iswanto, I., & Armay, E. F. (2020). Resolving the shortest path problem using the haversine algorithm. J. Crit. Rev, 7(1), 62-64.

Rizvee, M. M., Amiruzzaman, M., & Islam, M. R. (2021). Data Mining and Visualization to Understand Accident-Prone Areas. In Proceedings of International Joint Conference on Advances in Computational Intelligence (pp. 143-154). Springer, Singapore.

Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, 53-65.

Ramalho, L. (2015). Fluent Python: Clear, concise, and effective programming. " O'Reilly Media, Inc.".

Sibson, R. (1973). SLINK: an optimally efficient algorithm for the single-link cluster method. The computer journal, 16(1), 30-34.

Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), 1-21.

Starczewski, A., & Cader, A. (2019, June). Determining the EPS parameter of the DBSCAN algorithm. In International Conference on Artificial Intelligence and Soft Computing (pp. 420-430). Springer, Cham.

sklearn.cluster.dbscan,”https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html, accessed: 2021-01-22.

Wu, C. H., Ouyang, C. S., Chen, L. W., & Lu, L. W. (2014). A new fuzzy clustering validity index with a median factor for centroid-based clustering. IEEE Transactions on Fuzzy Systems, 23(3), 701-718.

Zhou, A., Zhou, S., Cao, J., Fan, Y., & Hu, Y. (2000). Approaches for scaling DBSCAN algorithm to large spatial databases. Journal of computer science and technology, 15(6), 509-526.

Published
2022-06-26
How to Cite
Amiruzzaman, M., Rahman, R., Islam, M. R., & Nor, R. M. (2022). Logical analysis of built-in DBSCAN Functions in Popular Data Science Programming Languages. MIST INTERNATIONAL JOURNAL OF SCIENCE AND TECHNOLOGY, 10(1), 25-32. https://doi.org/10.47981/j.mijst.10(01)2022.349(25-32)
Section
ARTICLES