How AI Happens

Nuances of Speech Recognition with Cogito's Dr. John Kane

Episode Summary

Dr. John Kane, Head of Signal Processing & Machine Learning at Cogito, explains the challenges brought on by the oft-repeated truism "speech is more than text", and how Cogito addresses these challenges to deliver real-time conversational insight to their users. Later, John explains the holistic approach to ensuring machine learning technology is created in a bias-free environment.

Episode Notes

Episode Transcription

0:00:00.0 Dr. John Kane: That is surprisingly not a crazy application of the technology, and in fact, one of Our Chief Behavioral Science Officer at the company actually did his PhD research on actually a related area.

0:00:18.7 Rob Stevenson: Welcome to How AI Happens, a podcast where experts explain their work at the cutting edge of artificial intelligence. You'll hear from AI researchers, data scientists and machine learning engineers as they get technical, about the most exciting developments in their field and the challenges they're facing along the way. I'm your host, Rob Stevenson, and we are about to learn how AI happens.

0:00:48.5 RS: Natural language processing has its claws in a vast array of AI applications. Of course, in our series, Alexis into podcast transcript generation tools, but also when we acquire or generate data, there's a good chance it has its origins in some kind of human communication, and if you want it to improve your tech, you're going to be conducting some sort of NLP. NLP is a huge part of the work that goes into Cogito's main offering, a tool that offers real time insight to professionals conducting support calls, providing them with tips, strategies and assets to improve the quality of the support they provide. To learn more, I sat down with Cogito's head of signal processing, Dr. John Kane. John explained the challenges represented by the often-repeated truism speech is more than text, and later provided an outline for a holistic approach to creating a bias-free environment within which to develop ML tools.

0:01:47.1 DK: In my past life, I was an academic. I did my PhD actually across the road from here in Trinity College in Dublin, my PhD was focused on finding technique, signal processing techniques to detect subtle changes in tone of voice and voice quality. My subsequent postdoctoral years, I spent applying those techniques to different speech technology applications, so actually the lab I was involved with was the first to create a speech synthesiser for the Irish Language for Gallic, which was great, also worked on...

0:02:20.4 DK: And all of the UK rejoiced.

0:02:25.5 DK: Yeah, absolutely, yeah, also worked on different areas like expressive speech synthesis, detecting different vocal disorders, speech, emotion recognition, spoken dialogue systems, a bunch of different areas, which basically took my PhD work and applied it to actual real speech tech applications. So that was great. I thought I would be an academic for the rest of my life, but then a friend of mine who used to be a professor at the University of Southern California, he's currently actually a CTO of a really, really interesting robotics company in the West Coast called Embodied while he was at USC, he was working on a government-funded research project that happened to have Cogito involved and the CTO there and him spoke, and he was looking for somebody to come and to build a machine learning department at Cogito and I got [0:03:14.9] ____ and fast forward eight years and we're here right now and yeah, I basically run the machine learning and signal processing department within engineering at Cogito, which is a great privilege.

0:03:26.0 RS: Yeah, the rest as they say is history. I really wanna get into all of the speech synthesis stuff and the tone detection... We'll get into that in a little bit before we do, would you mind explaining a little bit about Cogito and what the company does and kind of the chief opportunity there.

0:03:40.2 DK: Yeah, sure, Cogito sprung out of the MIT Media Lab. So actually professor, Sandy Pentland is one of our cofounders, and the Media Lab is worldwide renowned research institution with expertise in analysing and modelling human behaviour, and so a lot of their essence still exists in the company today. And so the real high-level focus of the company is to essentially analyse and guide behaviour to sort of elevate the human experience rather than providing after the fact data analytics like a lot of folks do, our focus is to try to help contact centre agents and supervisors in real time in the moment while they're in the throes of a conversation of an interaction and provide them with tips and guidance to help make their job easier and help them be more effective, which then results in improvements in key performance indicators to those businesses.

0:04:40.8 RS: It strikes me that there's two very large challenges there, one is providing accurate insight, something that's actionable and that the user can say, "Oh yeah, you know what, that is better," and two providing it to them in real time, because no matter how accurate it is, as you say if it comes in an automated email a couple of hours later, too little too late. So how are you able to make sure you can deliver this insight quickly and in real time?

0:05:07.1 DK: Yeah, that's a great question. So we apply a area of machine learning, which we refer to as a signal-based machine learning at the company. So a lot of traditional machine learning approaches use static inputs. You've got images or you've got documents, and you're applying machine learning models to produce inferences based on those. And for instance, in the case of having a document like a tweet or something like this, you have the entire sequence available to you. The design restriction for us is that we don't have all of the data available yet. The data is evolving, it is being produced incrementally as the conversation evolves, so we need to be able to provide a real-time, low latency processing of this progressively occurring data to be able to provide these sort of guidance and feedback. So our machine learning models in our processing has to be causal, we can't look forward, and if we do look forward, we incur latency costs, and if we incur a significant latency cost, then the usefulness of the guidance we provide to agents is significantly diminished. We have to find ways of dealing with this in a streaming-based manner, and at the same time, taking account of the rich data that is available in those interactive conversations.

0:06:20.8 RS: Can you explain what you mean by causal?

0:06:24.3 DK: Yeah by causal, because I just mean looking back. So at a particular point in time, we use data which has occurred up until this point in time, so if we're having a conversation right now, I can't use the words that you're going to say in a couple of seconds time, so causal processing means that we're looking at this point in time and using any prior data to make an inference.

0:06:43.9 RS: What would be the counter to that, what's the opposite of causal?

0:06:46.9 DK: So the opposite of causal is basically, if we had the entire conversation available, let's say that we took... This radio show recording, and we wanted to process it after the fact, if you're processing it after the fact, we can look forwards, we can look backwards, we can use any parts of that conversation, but when we're having this conversation right now in real time, well we don't have the luxury of looking forwards in time, so if you can't look forward in time and you can only look backwards that's causal processing.

0:07:11.7 RS: Is that an advantage when you can look at the entire context, are there additional things that you can learn just by knowing how a conversation ended, for example?

0:07:20.4 DK: Yeah, absolutely, absolutely. And if you look at the kind of state at the art and natural language processing, it's for good reason that the best approaches right now are bidirectional processing, they use forwards and backwards processing of the text data contained within the documents to make the best inferences possible. Unfortunately, we don't have that luxury. We don't have that luxury, if we want to do it in real time, we have to only look backwards and that presents its own challenges.

0:07:47.9 RS: The part of delivering this with low latency, that's probably not something you worked on in academia, not part of a traditional machine learning practitioners training, education experience? Is that a different problem than what the typical AI practitioners used to... Used to working on?

0:08:05.1 DK: Yeah, I think it is, and it is from multiple perspectives, so from one perspective, there's the modelling approach, how do I actually set up my neural network architecture to be able to actually process in such a way which is computationally efficient and which only looks backwards in time? And there's also challenges to do with how do I hang on to the salient important data which happened prior in the conversation. That part is challenging, but achieving low latency and achieving the user experience we want to actually require cross-functional approaches working with engineers and human-computer interaction specialist, we also have behavioural scientists at the company as well, because what trying to achieve is the user experience, which produces positive behaviour change, and even if we produce really low latency inferences and trigger guidance for the users, if they're not actually able to do anything with this, then it becomes pointless. So you have to achieve a kind of a user experience, which is designed in a way which is really helpful for the user to actually achieve that outcome.

0:09:08.8 RS: How would you define positive behaviour change? And then how would you measure it?

0:09:13.6 DK: Yeah, that's a great question. One big challenge, which maybe people don't refer to as much as in a lot of blogpost level areas in machine learning is that actually labelling for areas for guidance with human annotators is extremely challenging in the first instance. So we have an internal annotation team at Cogito who have gone through a lot of cycles of really understanding call centre interactions and understanding what parts of conversations are indeed guidable. Guidable in the sense of helping the agents to come across more effectively or more empathetic with a customer or helping the agent to be more aware of data or documents or protocols which are useful in the particular scenario. So labelling for those guidable regions of conversations is even before you get to machine learning is a challenging in and of itself, then you have the modelling and the inference of this, but it's not just detecting the guidable behaviour, it's also assessing whether the intervention that you're providing, so the feedback you're giving to the contact centre agent is actually resulting in behaviour change, so including that to the feedback loop, to the algorithmic set up is really key to understanding what your guides is actually useful or not.

0:10:26.5 RS: In the beginning, are you looking at conversations and trying to clue in on, here's where the conversation branches in this direction, here is where they could have gone a different... Down a different branch and had a better outcome?

0:10:40.4 DK: There's different ways of looking at it. So again, this whole problem, it's very cross-functional, it's not just machine learning scientists, which solve this particular type of problem, we have behavioural scientists at Cogito who indeed look at the sort of types of behaviours which are typically associated with effective calls with all that result in good perception of positive customer experience and these sorts of things. So as part of their research, they will identify different speaking behaviour patterns that we may want to detect and guide on following that process, then we need to operationalise an annotation approach, a way we can label the data, because of course we need labelled data to build our models, and then from there, we'll look to apply machine learning techniques from there.

0:11:23.8 RS: Can I get a little wild with the implications here for a moment?

0:11:26.5 DK: Sure.

0:11:27.5 RS: Is there a possibility where I have on wearable tech, Google Glass, for example, and I'm running Cogito while I'm on a date, and it's telling me, "Don't say that, say this, and you'll have a better date."

0:11:38.8 DK: That is surprisingly not a crazy application of the technology and in fact, one of our Chief Behavioural Science Officer at the company actually did his PhD research on actually a related area.

0:11:50.4 RS: No way.

0:11:52.5 DK: So actually it's not crazy as it sounds. From our perspective... Or from a business perspective, right now, we're focused on the enterprise call centre, and that's just the areas which makes most sense to us right now from a business perspective, but we have applied to our platform and our technology in healthcare settings in previous work, and we've had collaborations with some of the biggest hospitals in Boston and applying our technology there, and of course, we are interested and are excited about applying this technology to other types of interactive scenarios going forward. It's just that right now, the main business impact is in the enterprise contact centre.

0:12:31.3 RS: Got it. Well, if you ever want to branch into the large and endless audience of feckless men on dates, I'm sure that could be lucrative for you. Could you maybe give an example of what this would look like? Say, in the example of an enterprise user who is using this in real time, what's an example of the kind of feedback that they would get while they were on a call?

0:12:55.1 DK: Sure, so imagine you're a contact centre agent. It's very, very helpful to actually put yourself in the shoes of these folks with I've travelled into a bunch of different contact centres over the course of my time at Cogito and for I think the majority of the time contact centre agents they take a lot of calls they deal with a lot of challenging conversations, and they also from just from a screen real estate perspective, they have a lot of different applications up at the same time. So what our feedback... The feedback we provide from Cogito dialogue system is basically small targeted nuggets of guidance, which occupy a very, very small area of the screen real estate, so they can either have a very, very small mini-window that's kind of running in the corner of their desktop or we can use slide-in notifications, and the types of feedback that we'll give, we provide both feedback, which is to do with acoustics and speaking style.

0:13:53.8 DK: So for instance, if the customer is in a heightened emotional state, we'll provide some guidance to the call centre agent there, or if the agent is speaking far too quickly for the particular context, will provide feedback there. We'll also provide feedback on the content of the speech so for instance, if the customer is referring to some products where there's a an upsell opportunity, we can provide notifications which can give the agent a quick hyperlink to some knowledge source that they can refer to. It can also be super important in terms of drug name so imagine you have a new COVID drug and you're a new contact centre agent and you're not familiar with the Latin spelling and pronunciation of this particular strange words, we can detect that in real time and provide the feedback so they don't have to try to spell it and try to look up knowledge sources based on that.

0:14:47.5 RS: That is amazing. The agent could be like, "Oh, there's this drug, the name of it is escaping me right now," and then your technology would be like bing, here it is, so you can tell them about it.

0:15:00.4 DK: Yeah, yeah, exactly. And then the other thing is also providing feedback to the call centre supervisors, so we can also help them be more aware of how their team is doing, which is actually super important in the sort of COVID era because there is this massive migration of contact centre workers from the offices they typically worked in back to their homes as a result of that sort of movement contact centre supervisors who spent a large part of their time actually walking the floor in these offices could no longer do that, and the tool provides a means of having that sort of virtual walk the floor experience for call centre agents, they can continue doing their job even when they work from home conditions.

0:15:39.1 RS: Yeah, yeah, exactly. Okay, thanks for sharing that example because that's a little less black mirrory than how I was imagining it, where it was like, don't say that say this.

0:15:48.2 DK: Yeah, I think it's important to bear mind what we're trying to do is we're not trying to replace humans. So I know there's, of course been a huge focus on automation and chatbots and things like this, that has not been our focus at all in fact, our focus has been to sort of acknowledge what humans are good at. We're good at being empathetic, we're good at dealing with unexpected problems which are outside the training set, those sorts of things, but where computers can be really useful or machines can be really useful, is in consistency or in finding data that is available in knowledge sources. And those types of feedback and really leveraging the benefits of machines to help people. Help people be more effective in their job. And that's really been the focus of the company rather than replacing folks.

0:16:41.1 RS: Yeah, yeah, definitely. This is related a little bit to, it sounds like your own research in academia across the street there at Trinity College, where you were looking at the tone and pacing and other perhaps noncontent-related elements of speech, and I'm so curious how that plays into Cogito and just language processing as an entire application of AI, because there are so many things we clue in on speech when we're listening to speech, and especially in my job, I generate transcripts from the conversations I have and when I read through them, I'm like, "Uh, it's just missing that spirit," or it's like, "Oh, that joke doesn't work in text because someone wasn't listening to the way it was delivered. And so I'm so curious how that plays in how the detection of someone's own lilt, how they might raise their tone at the end of a sentence, detecting things like sarcasm, how are these nonword-based elements of speech measured and then worked into Cogito, and how important is it to natural language processing as an entire application?

0:17:51.1 DK: Yeah, it's a great question. The sort of cliched response from folks working in speech processing is that speech is much more than text, but it's really... It's really very true interactive in the wild, conversational speech looks very, very different than text, it looks very different than written articles, it looks very different than tweets, it looks very different than WhatsApp messages, it's an all together different form of communication, obviously, there are aspects of it which are shared, but there are aspects of which are all together different. And so even like you said, even if you take perfect automatic speech recognition applies to an interactive conversation and like we're having now, or for instance, that we would have in a cafe or a pub, sometimes you can look at that and it could be completely unintelligible, and at the very least, it is extremely... An extremely lossy representation of what actually was, what went on like your referred to those aspects of the person's speaking style, which is completely, completely missing there. The other kind of interesting aspect to it is that...

0:18:53.0 DK: Seemingly straightforward and simple linguistic concepts from text can actually be a good bit more nuanced when you're dealing with conversations. So even a concept like questions, okay, I could be a contact centre agent, I could ask you what's your social security number, for instance, like I could ask that, but a lot of the time questions are more subtle that, they may be implied, there's not a clear question mark where I'm asking a response from you, and we also looked at the concept of overlapping speech and interruptions in some previous research. We found nine different ways in which overlapping and interruptions can happen in conversational speech, and it's actually really interesting, a lot of the time when you're interrupting somebody, you're not necessarily speaking at the same time as them.

0:19:38.3 DK: In fact, they could be pausing and very clearly their intonation is they're still thinking about the issue and then you may come in straight away, and that's perceived as an interruption. So seemingly simple and straightforward concepts, linguistic concepts applied in normal text formats just can be a lot more nuanced in conversational speech. And then the other challenge for a machine learning practitioner is that, like I said, if you represent speech as text, if you just apply automatic speech recognition and just use that as your base as your raw data, you're missing a huge amount of the information, so a lot of, there's been a lot of really major break trees in the natural language processing field over the last several years, in particular, finding really effective representations of tech data, you've got this word embedding stuff word2vec and all these transformer models, but this needs to be combined with representations to do a timing, to do an intonation, to do a prosody, in order to really, really have all of the information available to make [0:20:42.0] ____ inferences.

0:20:42.5 DK: The other challenge to do with it as well is when you're dealing with multiple parties, if the two of us are interacting back and forth, let's say I want to try to make an inference on how engaged was this conversation, I'm just realising at the moment I've been speaking for quite a time, and so maybe the engagement level has gone down a bit, but in terms of trying to actually infer that level of engagement well, I need to actually look at the speech from both parties, but how do I synchronise it? I'm saying words now, you're not saying words now, how do I actually fuse those representations to do with our speech in a way which is really effective? So that multimodal processing and synchronisation is a really key challenge of this area as well, so yeah, speech is definitely more than text.

0:21:25.8 RS: The reason you've been talking for so long is because you got in my head about the interruption thing, and I was like, "Oh, I better not say anything, I don't wanna [0:21:31.5] ____ the Cogito's technology." That was so fascinating and just so many different challenges, and it strikes me that there's potential in every application of AI for bias to creep in, and I'm sure yours is no different. Where would you say are the areas in your technology, where there's a possibility for bias to creep in, and how can you work to make sure that doesn't happen?

0:21:54.5 DK: I think when you're thinking about bias or unfairness, it's important to consider the problem holistically, and it's important to consider bias both for machines and for humans. So let's start with machine learning models, so you're building a machine learning model, and before you even start collecting data if you want to be serious about bias, well, you need to do two things to start with, first, you need to decide what your definition of bias or unfairness is, there's different definitions that are out there from demographic parity, equality [0:22:26.5] ____ there's a bunch of different fairness concepts that exists, and they're not all the same, and in fact, optimising for one may lead in a degradation for the other. So you really need to define at the start what is fairness for us? The other thing you need to do is you need to identify, what sometimes people refer to as protective demographic variables, are there some slices of the population which are concerned and maybe may suffer negative effects or bias from machine learning models. And what I'm gonna talk about here, you can take an example of elderly speakers, okay, so let's say we have a definition and we have identified, we're concerned about elderly speakers so the next step is sort of sampling.

0:23:09.3 DK: So if I'm building a machine learning model, I need to sample data to create my training set. You better make sure that this protected variable is sufficiently well represented in your training set. If for instance, I don't have any elderly speech in my training set, well, when I apply it in production, it's going to have much lower accuracy than it is for different demographic categories, which have been well represented in the training set. So data sampling is really important. The next is labelling. So the vast majority of commercial machine learning systems are at least partly based on labels, labelled data, and those labels very often come from humans and humans can be biased, so you need to have practices and protocols in place that can allow you to detect and mitigate bias that can happen from human labelling, again you have to be holistic in this.

0:24:00.9 DK: You have to think about recruitment of your human annotators, so that they come from diverse backgrounds, you need to ensure that you have multiple human labellers per sample, so that you can detect disagreement and potentially bias, and then you often need to do auditing exercises as well to ensure that there's not unfairness being interjected at that point. Bearing in mind that the machine learning models will look to be an estimate of these human labels so if they're biased then the model will perpetuate that bias and possibly even exacerbate it. Next, it's really important to have metrics to do with unfairness, so we've talked at the start about defining what we mean by bias or unfairness, well, it's really important to have metrics when you're analysing the performance of your models, which don't just look at accuracy but that look at metrics related to bias.

0:24:49.7 DK: There's actually been a lot of really interesting set of software work to enable this recently, so people may be familiar with TensorFlow, which is a machine learning framework provided by Google, but this TensorFlow extended framework, which is basically a set of libraries which help machine learning scientists have effective reproducible machine learning pipelines, and they actually have a module within the framework, which is specifically targeted at fairness, and you can sort of extend that with your own definitions and your own analysis as well. So having those metrics and building them into the process is super important. There's also techniques... Then let's say we have our model in production and we identify that there is... Look, there's some bias happening towards some demographic category, well, we can actually use machine learning techniques to sort of do some debiasing, so there was a paper by researchers from Google and Stanford, I think one of the first ones looking at adversarial training techniques, looking to do debiasing, and we've actually got some published work ourselves on gender debiasing in speech emotion recognition, there's other techniques like gradient reversal techniques, which look to basically unlearn those biased representations from the model.

0:26:03.9 DK: So there's techniques that can be done there as well. So the really important thing with building machine learning models is to be really holistic about this approach, but then there's also bias in humans. So we've got... In our application, we've got contact centre agents and we've got customers, both of them can be extremely biased potentially, and the key part of our technology is sort of providing objective, consistent feedback to agents that can help them be aware of their own unconscious biases and can also escalate a call to the supervisor if, for instance, a caller is being extremely biased towards the agent for some reason related to how they sound, which is also a phenomenon that happens as well. So the key thing is being holistic and considering both by as to do with the machines and to do with people as well.

0:26:53.7 RS: And then when it comes to the vendors you partner with, say, wherever you acquire your data or any other external tools you might use to help you generate your technology, how can you assess those vendors to make sure that what they're giving you takes into account the holistic approach, you just sort of outlined?

0:27:09.5 DK: Yeah, that's a great question, and I think it really relates to the metrics and how they analyse the performance of their models. If you ask a vendor what are indeed the metrics that you use to assess the performance of your model? What demographic categories did you ensure that the model is not biased against? How was your training set represented across these demographic variables? Questions like that will let you know whether they've taken the issue of bias seriously or not.

0:27:36.0 RS: And then what's an example of a satisfying answer there?

0:27:38.9 DK: Well, if you hear a vendor say, "Well look, we actually factored our analysis to consider different demographic variables like age and gender, maybe ethnicity or other types of demographic [0:27:53.3] ____ like that, and we included metrics related to those groupings as part of our test set." Well, that's a pretty good sign that they're taking it seriously.

0:28:02.8 RS: Yeah, yeah, makes sense. And I think that's is great advice because this is gonna come up for any AI practitioner constantly, and you need to be vigilant because otherwise it will end up in your technology. This was such a fascinating conversation, and we are approaching optimal podcast length here, but before I let you go, I want to... I'm just so fascinated by the potential applications of your technology, I want you to, without sharing the context of one of your office whiteboards and you know your long-term product road map, what is your pie in the sky sort of fantasy about a long-term really aspirational application of this technology?

0:28:41.0 DK: Well, okay, [chuckle] so I think that there are applications of this technology in all forms of human interaction. So imagine a podcast right now like we're doing, giving myself some feedback in terms of how I'm coming across, if it indeed seems like what I'm saying is being registered or not, that would be super, super helpful for me in that scenario. You can think of different business presentations and things like this where you're maybe doing some practising and you're practising what your presentation is gonna be like to an audience and having that sort of real feedback as part of that practising exercise, multi-party Zoom conversations, trying to ensure that there is, say in business meetings over Zoom that there is, let's say, sufficient time and fairness given to the different parties involved in that interaction, that it's not just one person sort of steamrolling the whole conversation and that various parties in the meeting get their time and there's fairness applied in that respect. So basically, any applications whereby we can really help people perform at their best, and also kind of preserve fairness, I think any applications to do it that are definitely ones that would like apply our technology to.

0:30:01.9 RS: How AI Happens is brought to you by SAMA. SAMA provides accurate data for ambitious AI, specialising in image, video and sensor data annotation and validation for machine learning algorithms and industries such as transportation, retail, e-commerce, media, MedTech, robotics and agriculture. For more information head to sama.com.