System Usability and Design Evaluation of AI Chatbots: A Comparative Analysis of ChatGPT, Google Bard, and Bing Chat

Sumaiya Nuha  Mustafina; Nusrat Kaniz  Khan; Muhammad Nazrul  Islam; Fatema Siddiqua  Nusrat; M. Akhtaruzzaman

doi:10.47981/j.mijst.13(01)2025.522(83-97)

Sumaiya Nuha Mustafina Department of Computer Science and Engineering, Ahsanullah University of Science and Technology, Dhaka, Bangladesh
Nusrat Kaniz Khan Department of Computer Science and Engineering, Military Institute of Science and Technology, Dhaka, Bangladesh
Muhammad Nazrul Islam Department of Computer Science and Engineering, Military Institute of Science and Technology, Dhaka, Bangladesh
Fatema Siddiqua Nusrat Department of Computer Science and Engineering, Military Institute of Science and Technology, Dhaka, Bangladesh
M. Akhtaruzzaman Dept. of CSE, MIST, Dhaka, Bangladesh. https://orcid.org/0000-0002-9929-4066

DOI: https://doi.org/10.47981/j.mijst.13(01)2025.522(83-97)

Keywords: SUS, System Usability Score, HE, Heuristic Evaluation, HCI, Human Computer Interaction

Abstract

Artificial intelligence (AI) has brought significant advancements in technology while the chatbots like ChatGPT, Google Bard, and Bing Chat are some of its remarkable innovations. These chatbots are helping users with diverse backgrounds by generating ideas, providing resources, and overall knowledge management. We acknowledge that these chatbots are still in their experimental stages of use. Evaluating the usability and user experience of chatbots becomes crucial to make them more usable, accessible, and intuitive to end users around the globe. Thus, the objectives of this research are to make a comparative usability analysis of AI-generated chatbots: Google Bard, ChatGPT, and Bing Chat. To achieve these goals, firstly, the System Usability Score (SUS) through questionnaire surveys and secondly, Heuristic Evaluation (HE) through expert observation were used. Through HE, we investigated characteristics of design, user engagement, and some other specific usability lacking along with a severity score that suggests both urgent and gradual usability improvement action. As an outcome, this study found that the SUS evaluation provided a comprehensive view of user satisfaction. Google Bard and Bing Chat received lower SUS scores, while ChatGPT demonstrated comparatively better usability, with a SUS score above 70. Again, a comparative usability analysis of AI-generated chatbots (ChatGPT, Google Bard and Bing Chat) reveals that, while all these applications suffer from a notable number of usability problems, ChatGPT demonstrates better usability performance compared to Google Bard and Bing Chat.

Downloads

Download data is not yet available.

References

Adamopoulou, E., & Moussiades, L. (2020). Chatbots: History, technology, and applications. Machine Learning with Applications, 2, 100006.

Brandtzaeg, P. B., & Folstad, A. (2017). Why people use chatbots. In Internet Science: 4th International Conference, INSCI 2017, Thessaloniki, Greece, November 22–24, 2017, Proceedings (Vol. 4, pp. 377–392). Springer.

Brooke, J. (1996). SUS: A “quick and dirty” usability scale. In P. W. Jordan, B. Thomas, B. A. Weerdmeester, & I. L. McClelland (Eds.), Usability evaluation in industry (pp. 189–194). Taylor & Francis.

Folmer, E., & Bosch, J. (2004). Architecting for usability: A survey. Journal of Systems and Software, 70(1–2), 61–78.

Fuchs, K. (2023). Exploring the opportunities and challenges of NLP models in higher education: Is ChatGPT a blessing or a curse? Frontiers in Education, 8, 1166682.

Hill-Yardin, E. L., Hutchinson, M. R., Laycock, R., & Spencer, S. J. (2023). A Chat (GPT) about the future of scientific publishing. Brain, Behavior, and Immunity, 110, 152–154.

Hossain, T., Mohiuddin, T., Hasan, A. S., Islam, M. N., & Hossain, S. A. (2020). Designing and developing graphical user interface for the multichain blockchain: Towards incorporating HCI in blockchain. In International Conference on Intelligent Systems Design and Applications (pp. 446–456). Springer.

HubSpot, (2018). What’s the system usability scale (SUS) and how can you use it? HubSpot Blog.

Hvannberg, E. T., Law, E. L.-C., & Lérusdóttir, M. K. (2007). Heuristic evaluation: Comparing ways of finding and reporting usability problems. Interacting with Computers, 19(2), 225–240.

International Organization for Standardization. (2018). ISO 9241-11:2018 – Ergonomics of human-system interaction – Part 11: Usability: Definitions and concepts. ttps://www.iso.org/obp/ui/#iso:std:iso:9241:-11:ed2:v1:en

Islam, M. N., Bouwman, H., & Islam, A. K. M. N. (2020). Evaluating web and mobile user interfaces with semiotics: An empirical study. IEEE Access, 8, 84396–84414. https://doi.org/10.1109/ACCESS.2020.2991840

Jain, M., Kumar, P., Kota, R., & Patel, S. (2018). Evaluating and informing the design of chatbots. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 895–906. https://doi.org/10.1145/3196709.3196735

Johnson, D., Goodman, R., Patrinely, J., Stone, C., Zimmerman, E., Donald, R., ... & Jahangir, E. (2023). Assessing the accuracy and reliability of AI-generated medical responses: An evaluation of the ChatGPT model.

Kasinathan, G. (2023). Musk’s Twitter acquisition. Economic & Political Weekly, 58(2), 21.

Khairat, M. I. S. B., Priyadi, Y., & Adrian, M. (2022). Usability measurement in user interface design using heuristic evaluation & severity rating (case study: Mobile TA application based on MVVM). In 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC) (pp. 974–979). IEEE.

Kundu, S., Kabir, A., & Islam, M. N. (2020). Evaluating usability of pregnancy tracker applications in Bangladesh: A heuristic and semiotic evaluation. In 2020 IEEE 8th R10 Humanitarian Technology Conference (R10-HTC) (pp. 1–6). IEEE.

Langevin, R., Lordon, R. J., Avrahami, T., Cowan, B. R., Hirsch, T., & Hsieh, G. (2021, May). Heuristic evaluation of conversational agents. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1–15).

Liu, X., Wu, C., Lai, R., Lin, H., Xu, Y., Lin, Y., & Zhang, W. (2023). ChatGPT: When the artificial intelligence meets standardized patients in clinical training. Journal of Translational Medicine, 21(1), 447.

Mack, R., & Nielsen, J. (1993). Usability inspection methods: Report on a workshop held at CHI’92, Monterey, CA, May 3–4, 1992. ACM SIGCHI Bulletin, 25(1), 28–33.

Martindale, J. (2023). What is Google Bard? Here’s how to use this ChatGPT rival. Digital Trends. https://www.digitaltrends.com/computing/how-to-use-google-bard

McLellan, S., Muddimer, A., & Peres, S. C. (2012). The effect of experience on system usability scale ratings. Journal of Usability Studies, 7(2), 56–67.

Muaz, M. H., Islam, K. A., & Islam, M. N. (2021). Assessing the usability of truck hiring mobile applications in Bangladesh using heuristic and semiotic evaluation. In Advances in Design and Digital Communication (pp. 90–101). Springer.

Munim, K. M., Islam, I., Rahman, M. M., & Islam, M. N. (2020). Adopting HCI and usability for developing Industry 4.0 applications: A case study. In 2020 2nd International Conference on Sustainable Technologies for Industry 4.0 (STI) (pp. 1–6). IEEE.

Nielsen, J. (1995). How to conduct a heuristic evaluation. Nielsen Norman Group, 1(1), 8.

Nielsen, J. (1995). 10 usability heuristics for user interface design (Vol. 1). Nielsen Norman Group.

Nguyen, P., Trng, H., Nguyen, P., Bruneau, P., Cao, L., & Wang, J. (2023). Evaluation of Google Bard on Vietnamese high school biology examination. ResearchGate. https://www.researchgate.net/

Public Affairs, A. S., (2023). (n.d.). System Usability Scale (SUS). Usability.gov. https://www.usability.gov/how-to-and-tools/methods/system-usability-scale.html [Accessed August 11, 2023]

Rahaman, M. S., Ahsan, M., Anjum, N., Rahman, M. M., & Rahman, M. N. (2023). The AI race is on! Google’s Bard and OpenAI’s ChatGPT head-to-head: An opinion article. SSRN.

Ram, B., & Verma, P. (2023). Artificial intelligence AI-based chatbot: Study of ChatGPT, Google AI Bard and Baidu AI. World Journal of Advanced Engineering Technology and Sciences, 8(01), 258–261.

Rane, N. (2023). Roles and challenges of ChatGPT and similar generative artificial intelligence for achieving the sustainable development goals (SDGs). SSRN. https://ssrn.com/abstract=4603244

Rane, N. L., Tawde, A., Choudhary, S. P., & Rane, J. (2023). Contribution and performance of ChatGPT and other large language models (LLM) for scientific and research advancements: A double-edged sword. International Research Journal of Modernization in Engineering Technology and Science, 5(10), 875–899.

Rudolph, J., Tan, S., & Tan, S. (2023). War of the chatbots: Bard, Bing Chat, ChatGPT, Ernie and beyond. The new AI gold rush and its impact on higher education. Journal of Applied Learning and Teaching, 6(1).

Shidiq, M. (2023). The use of artificial intelligence-based ChatGPT and its challenges for the world of education: From the viewpoint of the development of creative writing skills. In Proceedings of the International Conference on Education, Society and Humanity (Vol. 1, pp. 353–357).

Tasfia, S., Islam, M. N., Nusrat, S. A., & Jahan, N. (2023). Evaluating usability of AR-based learning applications for children using SUS and heuristic evaluation. In Proceedings of the Fourth International Conference on Trends in Computational and Cognitive Engineering: TCCE 2022 (pp. 87–98). Springer.

Xu, A., Liu, Z., Guo, Y., Sinha, V., & Akkiraju, R. (2017). A new chatbot for customer service on social media. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (pp. 3506–3510).

Zdnet. (2023). What is Bing Chat? Here’s everything you need to know. https://www.zdnet.com/article/what-is-the-new-bing-heres-everything-you-need-to-know [Accessed August 11, 2023]