How AI Happens

Moxie the Robot & Embodied CTO Stefan Scherer

Episode Summary

In this episode, we talk to Stefan Scherer (CTO of Embodied) about why he decided to focus on the more nuanced challenge of developing children’s social-emotional skills. Stefan takes us through how encouraging children to mentor Moxie (a friendly robot) through social interaction helps them develop their interpersonal relationships. We dive into the relevance of scripted versus unscripted conversation in different AI technologies, and how Embodied taught Moxie to define abstract concepts such as "kindness".

Episode Notes

In this episode, we talk to Stefan Scherer (CTO of Embodied) about why he decided to focus on the more nuanced challenge of developing children’s social-emotional skills. Stefan takes us through how encouraging children to mentor Moxie (a friendly robot) through social interaction helps them develop their interpersonal relationships. We dive into the relevance of scripted versus unscripted conversation in different AI technologies, and how Embodied taught Moxie to define abstract concepts such as "kindness". 

Key Points From This Episode:


“Human behavior is very complex, and it gives us a window into our soul. We can understand so much more than just language from human behavior, we can understand an individual's wellbeing and their abilities to communicate with others.” — Stefan Scherer [0:01:04]

"It is not sufficient to work on the easy challenges at first and then expand from there. No, as a startup you have to tackle the hard ones first because that's where you set yourself apart from the rest." — Stefan Scherer [0:04:53]

“Moxie comes into the world of the child with the mission to basically learn how to be a good friend to humans. And Moxie puts the child into this position of teaching Moxie about how to do that.” — Stefan Scherer [0:17:40]

"One of the most important aspects of Moxie is that Moxie doesn't serve as the destination, Moxie is really a springboard into life." — Stefan Scherer [0:18:29]

“We did not want to overengineer Moxie, we really wanted to basically afford the ability to have a conversation, to be able to multimodally interact, and yet be as frugal with the amount of concepts that we added or the amount of capabilities that we added.” — Stefan Scherer [0:27:17]

Links Mentioned in Today’s Episode:

See Moxie in Action

Stefan Scherer on LinkedIn

Embodied Website

Episode Transcription

0:00:00.0 Moxie: What emotion do you think I'm feeling, when I make this face? 

0:00:05.1 Rob Stevenson: You're sad, Moxie.

0:00:08.0 Moxie: Good job. That was my sad face.

0:00:14.8 RS: Welcome to How AI Happens, a podcast where experts explain their work at the cutting edge of artificial intelligence. You'll hear from AI researchers, data scientists and machine learning engineers, as they get technical about the most exciting developments in their field and the challenges they're facing along the way. I'm your host, Rob Stevenson, and we are about to learn how AI happens.

0:00:46.5 RS: The voice you heard just before the intro, is a snippet from a conversation I had with a robot, named Moxie. Moxie is a conversational AI robot, designed to teach children social and emotional maturity. Moxie moves, speaks, sings and even tells riddles. Moxie is about the size of a desk lamp, has big expressive eyes, arms that move and gesture and can pivot around a would-be abdomen. Moxie shifts between back and forth conversation and programmatic content, developed alongside child development experts, to instruct children in kindness, social interaction and friendship. To learn about what really makes Moxie tick, I sat down with Stefan Scherer. He's the Chief Technology Officer at Embodied AI, the company responsible for Moxie. Stefan walked me through the nuances of language processing and speech recognition his team considered when developing Moxie, as well as the tough hardware design choices the team made and how Embodied manages the ethical considerations of giving robots with sensors, to kids. Of course, I also sat down with Moxie, and you'll be hearing bits and pieces of our conversation, along the way.

0:02:04.1 Stefan Scherer: So yeah, my name is Stefan. I'm the CTO at Embodied. My background is in Computer Science. I got a PhD in machine learning, and in particular, machine learning and how it is applied to understand human behavior. Human behavior is very complex and in particular, it's not just a single modality that we produce behavior in. So it's not just the language that I use, but also the voice and how I produce language, as well as my facial expressions and my gestures. So human behavior is very, very complex, and it gives us a window into our soul. We can understand so much more than just language, from human behavior, and we can understand an individual's well-being and their abilities to communicate with others. And so what we try to do at Embodied is, to leverage this technology to give Moxie, our little robot, the ability to understand and communicate with children, in a natural environment, so in the children's homes.

0:03:03.0 RS: So what is the chief challenge or goal you set out to, with Embodied? Why this kind of work and why was Moxie the right approach? 

0:03:11.8 SS: The chief challenge at Embodied that we tried to tackle is, really, a seamless interface that both understands multi-model human behavior, but also produces meaningful multi-model behavior. We call this technology, Social X. So instead of a user experience is a social experience, an experience where an individual, a human that interacts with our technology can actually use their natural modes of communication, so their face, their voice, their language, to communicate and not have to adapt to the interface that they are interacting with. So that's the chief challenge. And the reason why we went with Moxie, a robot for children, is because children are incredibly fast and early adopters. Children can learn to communicate with these devices and can adapt to the needs of Moxie in many ways that adults sometimes are a bit wary of. They may be a little bit skeptical about it.

0:04:11.0 SS: But we thought that children are the perfect user for this technology. On top of that, we tried to leverage this technology to support children's development, and in particular, here, we are focusing on social-emotional learning for children. So social-emotional learning is a huge challenge for some children, where they have fears of communication, they may have anxiety, they don't quite understand what their emotions mean and how to deal with those emotions. And so we try to use Moxie as a robot that helps them develop these skills without telling them what exactly to do, but rather, by making them a mentor to Moxie. So they have to basically take on a mentorship with Moxie, which is also a robot that wants to learn, and through this mentorship, the children have the ability or the opportunity to learn about themselves, without actually having to really get into a teacher-student relationship with Moxie. They have to take on the responsibility to mentor Moxie.

0:05:19.9 RS: It's interesting, you pointed out that children are faster adopters, that they're perhaps more patient than an adult user would be. Is that part of why you decided to focus on an educational robot for children, as opposed to something for adults that was maybe with more advanced subject matter? 

0:05:38.5 SS: When developing Moxie, we really focused on social-emotional learning, rather than academic skills, like STEM. STEM skills are very easy for a robot to teach. A robot knows what two plus two is, but it is not very easy for a robot to know exactly what is a good way to communicate. So we really wanted to take on that challenge first, because it is really the challenge that you need to achieve, to make Social X, the social experience, and the way that we communicate with each other, possibility for this robot, we need to really tackle those hard challenges first. It is not sufficient to work on the easy challenges at first, and then expand from there. No, as a startup, you have to tackle the hard ones first, because that's where you set yourself apart from the rest.

0:06:33.2 S1: I've really been looking forward to meeting you. This is a big moment for me, being able to meet my robot mentor for the first time. If you don't mind, I'd like to take a moment to appreciate it. Thank you. I hope I learn a lot about human things from you. And between you and me, I hope we get to be good friends too.

0:07:00.8 RS: So I love this notion that, children are the mentor, they are part peer, part instructor themselves, kind of changes the dynamic that we're used to seeing in academia. When it comes to developing the language Moxie uses, Moxie is teaching children about what various feelings mean, what various facial expressions mean. What does kindness mean, for example. There's a whole mission all about developing kindness. How do you draw a circle around what kindness means, in terms of language, so that when children are reflecting something to Moxie and Moxie is reflecting something back, that you can be sure that this is a genuine representation of a nuanced feeling, kindness.

0:07:44.4 SS: It's a very good point. So we develop the curriculum for Moxie, with child development experts. So we develop the content both from a top-down perspective where we tap into existing research and best practices from child development and from therapeutic backgrounds. So we really build out the content that way. Then we augment this content with artificial intelligence and really try to use natural language processing to filter and process the language of the children to better understand if the children are grasping the concept of kindness or if they are not. So for example, we can write very precise scripts and curriculum language around this, but the concept of kindness, when you ask a question like, "What does kindness mean?" Or, "What is a kind thing to do?" Has infinite possibilities to respond to it. So we developed a technology that can capture this. It's based off of neural networks, and it's related to typical sentiment analysis, but it's not sufficient to just do sentiment analysis, it's not just good enough to say something that is positive, versus something that is negative. It is really important to understand the concept behind the meaning of kindness.

0:09:06.9 SS: So for example, if you say to Moxie, after Moxie asks a question, "What is a kind thing to do?" And you say something like, "To donate $100 to a puppy shelter." Moxie will know that this is a kind thing to do, and this technology that we've developed is actually expandable to a... Basically, I don't know if it's infinite, but for a very large number of possible open concepts. We're leveraging the technology that is out there and expanding on that technology that is based off of transforming models to understand human language.

0:09:41.4 RS: Was there an acquisition of a massive amount of data there, like conversational data that is surrounding an element like kindness, so that you could train the technology to hone in on words like donate or shelter, for example? 

0:09:57.4 SS: The actual data collection in these types of examples is very small, because you have to think of the content that Moxie provides. We have hundreds of days of content, and if we had to collect data for every single one of those days at a massive scale, that would not work for us. So what we have done is, we've built a technology that basically allows us to learn from very limited amount of data, so it's based off a few short learning with neuro language models that have been pre-trained on large amounts of data. But those large amounts of data have to cover an incredibly wide spread of a possible language. Future learning allows you to build small classifiers that are able to grasp a concept very quickly with just a few examples. So maybe, for example, maybe 10 to 20 examples, you may provide the algorithm.

0:10:52.8 SS: 10 positive examples and 10 negative examples, and it already can learn the concept of meaning, from that data. It does very much require a strong background model, a strong model that understands human language at a broad scale, but it can be used to fine-tune that model onto the topic at hand. We employ this across all of our content and in various ways.

0:11:21.8 S1: I think we're ready to test this out. Let's give Moxie chat a try. Activating GRL Moxie chat. I like hearing about what humans do all day. What did you have for breakfast this morning? 

0:11:35.2 RS: I had eggs and sausages and coffee.

0:11:38.4 S1: Sausage is a kind of food. I don't eat, but I am fascinated by food, because humans talk about it so much.

0:11:45.5 RS: Moxie, can I ask you a question? 

0:11:47.8 S1: Okay.

0:11:49.5 RS: What is the GRL? 

0:11:51.9 S1: Global Robotics Laboratory, also known as the GRL, is a secret laboratory where robots are built and then taught how to be good friends to humans. I was curious about another thing. What books have you read today? 

0:12:06.4 RS: How much of the back and forth is scripted speech, and how much is dynamic, responsive to what the user is saying? 

0:12:16.2 SS: The content that Moxie engages in, is a mixture of scripted and non-scripted interactions. The scripted interactions are there to lead the child through a curriculum that was designed by child therapists and child development experts. The freer conversations are there to kind of fill in the gaps, where we want to expand content, but also make it less repetitive and make it more broadly, in the way that Moxie can understand a child and respond to those behaviors.

0:12:49.7 SS: The real trick in fact, is actually to make the transition between the scripted world and the free form conversations that you can have with Moxie, and make that really seamless, so that the user actually doesn't know exactly when they're in the free form, versus in the more scripted curriculum-based interactions. The technology that we've developed, allows you to go between those areas without actually noticing. It is very hard to give you an exact number of percentage of one or the other, but it is basically almost all the time that it goes between those.

0:13:28.1 SS: I think you can imagine that, for example, if you have kinda closed questions like a yes, no or a choice, multiple choice question, you're more in the scripted world, while when Moxie engages in more open-ended questions or when you want to engage Moxie in a free form conversation, that it does actually move out of that in a very smooth way.

0:13:48.8 RS: Perhaps it's less important, what percent is scripted versus dynamic, as what percent needs to be dynamic to make the user feel like all of it is. To create that feeling, as you said, that they can't tell the difference between one or the other. Perhaps we can zoom out of Moxie a little bit and what is your notion of that balance for conversational AI. How much needs to be dynamic and responsive in real time, versus how much can we just resort to a script, to achieve the desired effect that conversation is taking place? 

0:14:23.8 SS: I think that's an excellent question. The answer is gonna be a little bit context-dependent, I'm afraid. I do believe that interactions that we have on a daily basis with personal assistants or assistive technologies like Alexa or Siri or Google Home, we do not necessarily rely on their ability to have an open conversation because we really want something transactional from them. We want to know the time, we want to know the weather, we want to know the traffic, or we want to order something. But in these interactions that we have with each other, as humans, but also the interactions that we have with these interpersonal agents that create a relationship with you, like Moxie, I think the requirements are a bit different. And here, the requirements are slightly different, depending on the abilities that these conversational agents have. So on one hand, you have these interactions that are maybe more a little bit of a road show or a demo where you interact once with them, and I think the ability to have free form conversations, dynamic conversations in an interaction that is a one-time, one-off kind of interaction, I don't think you need to have a lot. You can actually probably script almost all of it and make a meaningful, one-time interaction.

0:15:40.9 SS: Now, that changes a little bit when you have the requirement to build a long-term relationship with a child, for example, or with another individual. So if you keep on relying on scripted content, you will quickly run into the area of repetitiveness and you will have the appearance of, "Yeah, it is not very intelligent at all," or, "It's not very diverse at all, or dynamic at all." So you kind of have to rely more and more on dynamic content creation, when you are in the world of building out a long-term relationship. So at Embodied, we're trying to kind of expand on both sides of the spectrum, in a sense, because of these requirements, where we both have the need to expand the social-emotional learning curriculum, but also expand the diversity and the variety of content in a scalable fashion, and scalable fashion in this case means through natural language understanding and through natural language generation that is automated.

0:16:49.8 RS: Is that put in the context of this long-term relationship? So for example, the user has had Moxie and has done hundreds of days of activities and missions and is kind of working their way through the content. And then the relationship aspect, does it need to be dynamic in so far as it's conversational, or does it just need to refer to things that happened previously? You're still in the realm of scripted at that point, right? 

0:17:16.0 SS: Moxie can refer to things that have happened in the past. I don't think that that is necessarily the only thing that builds a long-term relationship. I do think it has to be combined with a persistent persona, so that Moxie doesn't just on a whim, always changes their opinions. Moxie does have the ability to express its own viewpoint. So that is important. I do think that it should be able to kind of adapt to a child's both needs and desires dynamically. So for example, maybe a child changes their opinion about certain activities that they can do with Moxie, and Moxie should learn from these interactions with the child, that reading is not something that the child wants to do that much anymore, and so Moxie only suggests it a few times. Moxie may even insist on suggesting it because it is something important to learn, for a child that not always, you can do exactly what you necessarily want to do, because sometimes you have to follow the needs of others and the desires of others. So Moxie would express the wish to have the child read to Moxie. And so, these kind of interactions need to be dynamic. I think that Moxie, because it builds this long-term relationship, really requires the ability to stay dynamic in a sense there. You can script the reflection on memory, but not when the memory is dependent on dynamic interactions.

0:18:52.0 RS: It's more of an interpersonal question, isn't it? What makes for a long-term relationship? What are the aspects that encourage humans to continue having whatever kind of relationship with someone over time? Was that something that you came up with the... Perhaps with the child development experts you worked with was, rather than, "Do we have the ability to react with a certain thing? What is the focus? What is the meaning of kindness? What is the meaning of a long-term relationship?" Maybe that's harder than even figuring out how to get Moxie to say something at the right time.

0:19:27.7 SS: Yeah, exactly. So Moxie comes into the world of the child with the mission to basically learn how to be a good friend to humans. And Moxie kind of puts the child into this position of teaching Moxie about how to do that. This kind of like, was motivated by the fact that we just didn't wanna send another teacher to the child. Like the child probably has enough teachers already, and the responsibility and the skill of actually teaching someone else about these things actually is something that children also should probably learn. We have developed this program, both with the child development experts, but also with a creative team that really focuses on building out the backstory of Moxie. But also, one of the most important aspects of Moxie is that Moxie doesn't serve as the destination. Moxie is really a springboard into life.

0:20:28.1 SS: So, Moxie wouldn't just say, "Hey, show me how to interact," and then end it at that. Moxie would actually send the child on field missions where the child now has to go out and talk to their friends and make friends and find out about their feelings and desires, and then report back to Moxie so that the child can actually put into action the things that they had worked on with Moxie in a previous mission. So it's very important that the child doesn't learn how to interact well with a robot, but really how to interact well with another human being. So Moxie is really designed to help them go out and be brave about these possibly anxiety-inducing behaviors.

0:21:15.1 RS: In addition to the content, the education, the edification happening there, I'm really interested in some of just the ethical considerations around home devices, right? Like, this comes up for security experts with personal assistance, like your Sonos Speaker or your Alexa or what have you. There's a camera right on Moxie's forehead, right? Moxie is a speaker, Moxie's listening for keywords, even when not active, how do you confront the ethical issues surrounding perhaps PII with children and just how parents for example, would be concerned with what data is being taken and processed here? 

0:21:52.8 SS: Moxie was designed from the beginning with data security as one of the main principles in mind. Most importantly, for example, the data that is transmitted to the cloud for the processing of the child's activities and behavior and the data that is provided to the parents through the parent app is actually not identifiable by Embodied, it is using a identifier that is randomized and produced only to communicate with the cloud, only the parent app and Moxie by themselves, actually have access to a private key that secures all personal identifiable information. So for example, the name of the child is only known to Moxie, but not to the Embodied cloud. We as a company, do not have access to that information, nor do we have access to their video data, because the video data actually never leaves the robot, all computer vision, algorithms and machine learning that is related to vision and then video processing, is done on the robot. The video does not ever leave the robot in our current setup.

0:23:04.8 RS: What is the video data being used for? When I fired up Moxie, Moxie said, "Say cheese," and took a little picture and I figured, "Oh, this is facial recognition, so that Moxie knows that it's me," but what about video? What is that being used for? 

0:23:16.0 SS: So video data is used both to identify you as the mentor, as the main user, because we want to make sure that we provide the curriculum and the insights about your learning for the main user, for the mentor specifically, rather than for anyone in the background that maybe just walking by or maybe just wants to have a chat with Moxie every once in a while. In addition to that, the vision and other sensors are used to build a continuous world model around Moxie, so this means that Moxie actually perceives the world around it, so that it is able to interact in that world, so it knows who to listen to and who to ignore. So for example, if you have the TV on in the background, Moxie will not turn towards the TV and start talking to the TV, because it recognizes that the TV is actually not the human user that needs to work with it.

0:24:10.4 SS: In addition to that, we use computer vision in real time to actually perform one or two of the most important interaction patterns that are so powerful using Embodied technology, which is basically having eye contact and being able to smile back at someone. The ability to mimic facial expressions and have eye contact are so powerful, we as humans are socially wired so that we actually do perceive other's emotions through our own neural networks in our brain, but it is absolute necessity to have that in order to build a rapport and build an engaging interaction.

0:24:53.0 RS: Yeah, yeah, that makes sense. For the robotically inclined out there, I wanted to ask about just the construction of Moxie, because Moxie moves around, Moxie has gestures, it's not just a digital experience, it's a very much a physical, corporeal, shall we say, interaction. It strikes me, that it's quite an advanced piece of hardware, and yet it needs to be affordable to the consumer. So, how do you manage to make a really advanced robot that is accessible, that is affordable for the people you want to be using it? 

0:25:29.0 SS: It's a wonderful question. I think that is one of the main things that puts Moxie apart from other state-of-the-art robots, for example, robots from SoftBank and other places that sometimes costs into the tens of thousands of dollars, and Moxie is trying to be more affordable. It's still not a very cheap device, and we didn't really want to make it a cheaper device, because I think it's very valuable and its abilities as well. But all credit really goes to our design team and the hardware construction team that really put incredible effort into identifying ways to both strike a balance between production cost, as well as needed abilities. So for example, we designed all circuit boards from scratch by ourselves, we built them out to be both sufficiently powerful, but also cheap enough so that we can hand over that price reduction to our customers. It also then obviously required us to be very, very frugal with respect to computational power and how we pack all these technologies into such a low cost device is actually quite an achievement, and I have to congratulate the team on that, it's a phenomenal effort that went into that.

0:26:48.6 SS: The computational power of Moxie is not comparable to your smartphone, for example, it is actually likely much lower in computational power and yet it does so many things at the same time, and that all is to be attributed to a very smart design and very smart algorithmic production.

0:27:10.3 RS: What were some of the design choices that you made to ensure that Moxie would be more affordable? 

0:27:15.8 SS: Some of the design choices that we made to make it affordable, first of all was that, we did build out everything from scratch, we did not want to leverage much pre-designed components. We wanted to kinda squeeze and optimize every last bit out of components out there and design it fully ourselves. I think the biggest choices that we made, so for example, Moxie doesn't have fingers, it doesn't really need fingers, right? It doesn't need to be dexterous, it needs to have the ability to point and to gesture. In some ways, it has these little hands that look like flippers in a sense, but those are sufficient to point towards you and address you with a gesture. It can point to itself, it can celebrate. It can look sad as well.

0:28:07.6 SS: And so, those kind of decisions were made so that we can reduce the requirement on a number of motors, as well as like, joints in the robot. But the design was really focused around the requirements that Moxie has, so for example, Moxie needs to have a face with large eyes so that it looks cute, but also, is very expressive and still affordable. So, we did not want to over-engineer Moxie, we really wanted to basically afford the ability to have a conversation, to be able to multi-modally interact and yet be as frugal with the amount of concepts that we added, or the amount of capabilities that we added.


0:28:55.5 RS: Embodied set out to build a personal robot that can teach children how to be a good friend, and along the way, they've raised some big questions about kindness and the tenets that contribute to a long-term relationship and how you can reflect those answers through technology. It's clear to me the team at Embodied has put a ton of thought into this, but of course, as with so many other things in artificial intelligence, we've only scratched the surface. Next time on How AI Happens.

0:29:23.4 Speaker 4: And it's now the time to create machines that can match humans in everything that they do. We are actually using our best talents and best brains to economize on the world's most abundant resource, which is human labor.


0:29:38.2 RS: How AI happens is brought to you by Sama. Sama provides accurate data for ambitious AI, specializing in image, video, and sensor data annotation and validation for machine learning algorithms in industries such as transportation, retail, e-commerce, media, MedTech, robotics and agriculture. For more information, head to