育碧中国AI&数据实验室总监Alexis Rolland:建立Metaverse,需要六大基础要素



· 2021.08.03




In the Metaverse, everyone and everything will get a virtual representation. Players can live experiences similar to or even beyond real life such as playing games, going to the movies, create or shop.


Unlike the current Internet social avatars, virtual avatars determine the uniqueness of individual humans in the virtual world. Through the realization of facial features, emotional expressions, gestures and posture changes, so as to enhance the sense of interaction and realism.


In other words, the virtual avatar is the passport of mankind to the virtual world and the identity of mankind in the virtual world.


Obviously, customized game characters and user-generated content are important pillars of Metaverse. However, creating virtual characters is a complex process that includes a lot of challenges to overcome.

育碧(Ubisoft)是研发、发行与销售互动式娱乐游戏与服务的领先企业。自1996 年在中国建立工作室以来,育碧一直站在中国游戏产业的前沿。

Ubisoft is a leader in the development, distribution and sales of interactive entertainment games and services. Since establishing a studio in China in 1996, Ubisoft has been at the forefront of China's game industry.


And today, Ubisoft China has two studios in Shanghai and Chengdu, with more than 1,000 professionals in game production, image design, animation, programming, artificial intelligence, sound effects, testing and data management at home and abroad.


In terms of building virtual avatars, Ubisoft has been exploring for a long time and has come up with a set of industry-leading technology systems.

在引力奇点·Metaverse峰会中,育碧中国AI&数据实验室总监Alexis Rolland,介绍了育碧对于Metaverse的整体看法,以及他们在创建虚拟化身过程中所遇到的挑战和解决方案。

At the Gravity Singularity Metaverse Summit, Alexis Rolland, Director of Ubisoft China AI & Data Lab, introduced Ubisoft’s definition of the Metaverse, as well as the challenges and solutions they developed in the process of creating virtual avatars.

以下是育碧中国AI&数据实验室总监Alexis Rolland演讲实录,略经编辑:

The following is the transcript of the speech by Alexis Rolland, Director of Ubisoft China’s AI & Data Lab, with some editing:

首先,很荣幸来到这里。我的名字叫Alexis Rolland,是育碧中国AI&数据实验室总监。育碧上海和成都两家工作室都有我们这个技术团队的成员。我们运用机器学习,研发工具为游戏研发赋能。

My name is Alexis Rolland. Thank you for the introduction. I am the director of Ubisoft China’s AI laboratory. We are a technology team located in Shanghai and Chengdu, we do R&D and develop tools to empower game production teams with Machine Learning.



I assume most of you know about Ubisoft, at least I hope you heard about us. We are a major video game developer, but just in case, I prepared a short video to remind you what are the games we're making.


Ubisoft was actually one of the first video game developer to enter China as early as 1996. When we opened our first studio in Shanghai. Later on, we opened the second studio in Chengdu in 2008.


And today, both studios combined include more than 1,000 employees, which makes it the third biggest, creative force of the company. We work on its most famous franchises, including Rainbow Six: Siege, Assassins Creed, Far Cry, Just Dance, and Rabbids for which we are actually releasing a new game in China this month.


In today's presentation will cover four parts. Yet, again, an introduction about the Metaverse or at least how we define it in Ubisoft. Then I will address three common challenges related to the creation of virtual avatars. In particular facial animation, body animation and animation blending.



But let's start with the Metaverse

We define the Metaverse as enhanced virtual world parallel to the real world, where players can use a personalized avatar and do almost everything they could do in real life. That includes playing video games, but also going to concert or movies and even create or shop.


We think the Metaverse relies on six foundational pillars.


The first one being socialization. First and foremost the Metaverse is a social hub where players have the opportunity to interact through engaging relationships, which complement or even replace real life socialization.


Second pillar is persistency. Because the Metaverse keeps on going after the players disconnect, it does not rely on the player's presence and keeps living on without him.


Third pillar is user-generated content and digital content creation in general. In particular because in the Metaverse, players should be able to interact and contribute to the digital universe. Thanks to easy accessible tools. The Metaverse actually blurs the line between the creators and the players.


4th pillar is the convergence of media. The Metaverse is also a place where different media industries can coexist and players are invited to live cross media experiences around art, music, or movies.


5th pillar is an integrated and functional economy. In the Metaverse, players should have the opportunity to earn money or to acquire skills that are recognized by the system and valued by other players. This is in particular where technologies like blockchain and NFT can play a big role.


Finally, scalability, because the Metaverse depends on the scalability of the technology to allow a great number of players to congregate on a single server and share moments together instead of playing on multiple servers.


Now, when looking at this big picture, you may wonder where does Artificial Intelligence fits? Here I'm not talking about AI in the sense of game bots, but of course about machine learning and deep learning techniques.


When we think about it, AI is a little bit like electricity. It actually has the potential to revolutionize all of these domains. But in today's presentation, I'd like to focus on the digital content creation.


We see, in particular, this big trend going on about Vtubers and content creators. They are preparing themselves for the Metaverse. They equip themselves with relatively expensive hardware. You can see on the pictures here motion captures suits, headsets and so on, and they create their own digital avatar, their alter ego.


To achieve that, beyond the hardware investment, it is also still very demanding in terms of skills in 3D and animations. Facial animation in particular is a Challenge.


For a virtual avatar to look good, to look realistic, it needs a perfect match between the speech, the emotion it carries and the animation of the face, including the eyes, the eyebrows, lips, and so on.


In the context of video game, facial animation can also be pretty expensive. In particular, when you localize games into different languages, in our case, in Ubisoft we localize voice in our games in 9 to 10 languages. And different languages speech have different duration.


For example, German is famous for having very long sentences and words, and so you need the animation of the mouth to be perfectly in sync with the speech. In comparison, English would be shorter, and so the virtual character has to adapt to those different languages. Creating those facial animations manually would be expensive.

针对这一情况,在育碧有个名为La Forge的团队一直在研究这个问题的解决方案。他们在语音数据的基础上训练神经网络,该网络接收包含对话的音频文件,并输出一个序列的音素。

So our teams in Ubisoft, in particular, Ubisoft La Forge has been working on a solution for this problem. They trained a convolutional neural network based on speech data, the network takes as input an audio file, which contains dialogue lines and outputs a sequence of phonemes.


For people who are not into linguistics, phonemes are actually units of sounds to which we can map a shape of the mouth.


And then this sequence of phonemes is converted into lips and mouth animation also known as f-curves for people who are familiar with the domain.

我们刚才说了面部,那身体呢?现在在学术界有一个很大的、非常热门的研究课题,我们称之为 "pose estimation(人体姿态估计)"。这项研究试图根据二维图像或视频生成人体的三维坐标,即骨骼的不同关节。这是一个非常困难的研究课题。德国Max Planck研究所曾发表了一篇非常先进的论文。

We talked about face, but what about the body? There is this very hot research topic happening right now in the academia, which we call pose estimation. It's a research which consists in trying to generate 3D coordinates of the human body, the different joints of the skeleton, based on 2D images or videos. What you see here is actually not from Ubisoft. It's a state of the art paper published by the Max Planck Institute in Germany.


In their case, not only they developed a technique to predict the coordinates of the body, but also the 3D model. We call it a body pose and shape estimation. This kind of research is very inspiring for teams like us in the video game industry.


But in our case, in Ubisoft China, we work a lot on animals, on the wildlife. We have a long history working on the Far Cry brand for which we develop its most iconic animals. The research I was mentioning on humans is already challenging because it is difficult to acquire motion, capture data for humans. But think about wild animals like a bear, an elephant or a tiger. It's even more difficult to acquire such data to train an AI. So the idea we had was to actually leverage previous work done by our teams.


Over the last 8 years. They created many animals animations manually. We call it key frame animation. We use it to generate training data for achieving similar results as what you saw on humans. The idea is to build a pipeline that takes as input a video and a template skeleton and then generate as an output, the 3D coordinate of the animal skeleton.


We eventually built that pipeline which we call ZooBuilder. You can see the different components here. It takes as input a video. It converts that video to a sequence of images and provide those images to a first machine learning model that locates the animal on the image. 


Then we provide this sequence of cropped images to a second machine learning model that we retrained with our synthetic data, with our animal animation data. 


This second model outputs the 2D coordinates of the skeleton on the image. We then provide those 2D coordinates to a third model, which converts these two 2D coordinates to 3D. Finally, the 3D coordinates are applied on a 3D model. 


It is showing promising results but to be completely transparent, it is not used in production yet. It is still very much a research topic, because the animation is still a little bit imperfect, and it's also a challenge to make it scale for a lot of different animals.


But this kind of techniques can definitely help to create more animation clips based on 2D videos, rather than using motion capture infrastructure or motion capture hardware, whether it be for animals or humans. 


Now creating those animation clips, is actually just half way through animating virtual characters in the Metaverse. Those animations need to be integrated, need to be combined together. This is why I want to talk a bit about animation blending.


I'll explain first how it's done in the traditional way. An animator would usually develop what we call an animation graph or an animation tree, which is composed of different leaves corresponding to different clips of animation. Based on players input or based on automated input, the virtual character is activating animations through that graph and play those animations.


For this to look nice, there is a requirement, which is the last animation frame of the first animation clip, should match with the first animation frame of the second animation clips. So the end of the walk cycle should match with the beginning of the run cycle or the beginning of the jump animation. In case it doesn't match, the animation will look a little bit jittery and unnatural.


Here is an example how it looks. You have this character walking on a plane. As you can see, the animation is a little bit jittery. It is jumping from time to time, which is breaking the immersion.


A technic to solve this is to add more animations, transition animations to complete the gaps between animation clips. But this kind of technique can become very complex and difficult to manage. The more animations you add, the more difficult it is to maintain the graph.


Another technic to solve this challenge is what we call motion matching. It consists in taking all those animation clips, putting them all in a memory database, and then you have a search algorithm, which based on players input, such as the characters’feet position, the trajectory is going to search for the best matching animation frame, and then provide the frame to the game engine. This works fairly well.


This is the same example as before, but with motion matching activated. You can see the character animation is a lot smoother. The challenge here is this kind of technic scales linearly as you add more animation in the database. If you want more diversity, you need to add more animation which increase the memory and compute requirements.


Our teams in Ubisoft worked on a new approach inspired from motion matching, which we called learned motion matching, where we actually replaced the search by a neural network, which has been trained to output those animation frames based on the characters’feet position and other inputs.


Good thing with this technic is that it works as good as the traditional motion matching. Here is a comparison where you see on the right side, traditional motion matching and on the left side, the learned motion matching. You can see the quality of the animation is preserved and the animation is quite smooth. But in comparison, the learned motion matching is a lot less demanding in terms of memory, almost 10 times less demanding than the traditional method.


The good news is it works as well for animals, for quadrupeds. Here you see a bear walking on an uneven terrain. The animation has been generated with learned motion matching and it is perfectly smooth, which is great for our use case in China.


That's all for the challenges and the applications of AI I wanted to share related to animations. Let's try to wrap it up together.


First, we saw an automatic pipeline that generates lips animation out of speech. Obviously, this could be a lot more convenient than a facial capture with videos, headsets and so on. 

Second, we saw emerging technologies and promising techniques to generate those animations out of videos. Again, to get rid of all those expensive hardware.

Finally, we introduced learned motion matching, which is using machine learning to improve on existing animation programming and animation blending technics.


To be frank, we are actually just scratching the surface of everything that could be done in the field of animation with machine learning. But by putting those three elements together, we start to see the future of the animation pipeline for virtual characters, which is becoming more and more performant, more and more accessible. And that is going to streamline the work of artist and hopefully players in the long run.


This is it for my presentation. Thank you for listening.







