Overcoming data sourcing issues when testing finance virtual assistants
AI-powered chatbots dubbed virtual assistants (VA) are increasingly popular in financial services, with several banks launching VA services in the past decade.
Unlike traditional financial software, VAs interact with users through a dialogue based on natural language while applying third-party services to discern information and perform various actions on their behalf. The VA collects multi-dimensional data – such as client requests and personal information – and uses machine learning algorithms to analyse the data. This analysis then enhances the quality and individualisation of the VA’s responses.
There is no doubt about the technology’s potential usefulness, especially in customer engagement. However, AI rewards come with risk.
In 2019, Tinkoff Bank debuted VA Oleg, embedded in the bank’s app. Though very powerful from the technological point of view, the virtual assistant wasn’t perfect from a customer satisfaction perspective. When a client contacted Oleg about a problem with fingerprint login, the chatbot that was trained on the open-source text data using the Kolmogorov supercomputer capabilities could not provide a better response rather than “You’d better had your fingers cut.”
Oleg’s response demonstrates a quite peculiar but very human-like reaction, indicating successful training of the underlying AI model. However, the VA being trained on the open but not financial services-specific data caused a communicative failure.
A chatbot-assisted user interface can also face another type of communicative risk. Even within a totally smooth conversation, it may be not enough for the banking VA to simply respond to whatever a user says. Sometimes, there are situations where a client would input something like, “My spouse passed away… what do I do with the account?” when ideally, a chatbot should recognise that human interference is needed.
Evaluating conversational systems with an aim to mitigate this potential communicative risk is challenging because it takes massive amounts of textual data containing possible dialogue interactions sufficient to successfully train an algorithm. Absent a detailed description of expected system behavior, both user inputs and the VA outputs are crucial to the validation and verification process.
In the financial industry, these testing challenges inherent to AI systems are compounded by data access issues – frequently, a VA is trained on data that a third-party testing provider is unable to access, due to its sensitive nature. For example, a testing team may need to evaluate a banking application chatbot designed to communicate with bank clients and based on those chats, make changes to the bank’s database associated with customer accounts. Without access to existing user/chatbot interaction records, it is challenging for testers to create training datasets from scratch.
Even at the starting point, when the very basic interaction scenarios are covered with the most standard input and output phrases, the testing team needs to predict how exactly clients may formulate their questions and answers. It is even more challenging to suggest all the possible ways users can unintentionally transform these inputs with misprints, omissions or other errors caused by lack of attention, lack of effort or illiteracy. Ultimately, even when a dataset of possible user inputs associated with different scenarios is modelled, the evaluator still needs to identify what system responses are correct and what responses are failures for a specific user input and as a part of a particular scenario.
These challenges can be mitigated through a hybrid, two-pronged approach to testing. First, the tester collects interaction logs pulled from conversations between the VA under test and manual testers, and then annotates them according to a set of the VA’s skills. This categorisation allows testers to generate test scenarios designed to evaluate the chatbot’s performance on different levels.
The second prong is leveraging collected data for automated tests using natural language processing (NLP) techniques to determine how robust the VA is. For example, the tester can evaluate the VA’s ability to process text input by feeding it spelling and syntax variations, or the chatbot can be tested in how well it can identify the user’s needs, match it with a specific skill area and then respond appropriately with an answer or action.
The recommended approach focuses on quality attributes of VAs, such as performance, functionality and accessibility. A variation on this approach involves combining the collected data with additional data from other sources. An evaluation may involve mixing phrases that signal the user’s intents associated with different skills, to test whether the system’s response is to switch to another intent or ask a clarifying question. The results of this testing are less interpretable but can be beneficial given a large volume and variety of stimuli.
As the technology underpinning chatbots develops, so will their evaluations. Future research is aimed at addressing other characteristics that contribute to financial VA’s effectiveness, efficiency and overall user satisfaction. Automated evaluation of conversational systems remains challenging and this recommended approach may help overcome the main problem in testing of AI systems: the issues around obtaining and expanding the training dataset, especially in a context of a knowledge domain that is strongly affected by data sensitivity and data privacy issues.