Computing Power to the People

The Official Qarnot Blog

< Back

An introduction to Deep Fakes


by guillaume - May 12, 2021 - Data science

In this series of articles, we will explore the world of deep fakes, from a short history of the technology to the use of open source libraries to learn some classic use cases of deep fake today. Between transforming a face and recreating a speech, they are some of the most impressive deep learning technologies that have been developed during these last years.

In this first blog post, we will learn about the history of the technology and how it is used today in the industry. Then, the techniques to create deep fakes will be presented. New technologies also come with new concerns as they do not always have positive implications. Ethics in the use of deep fake and the tools to verify transformed assets will also be discussed.

In the subsequent posts, we will learn to create some cool stuff :

  • Animate a simple picture, bring the Joconde to life or put a smile on your identity card.
  • Swap a face in a video to always have your favorite actor in every movie. 
  • Synchronize a voice soundtrack with the movements of the lip, which for example can dramatically increase the quality of movie dubbing. 

To learn more about the fundamentals of deep learning and how a neural network work, you can always check out our other articles : 

But first things first! What is a deep fake? 

Origins and history

FaceSwap example.

Stephen Wolfram, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons

Deep learning is a machine learning method based on neural networks which uses brain derived algorithms. Those networks are trained to carry out certain tasks through a prediction process. Data transformation is one of the more useful uses of deep learning technologies, which includes data creation, data analysis and data modification. Deep fakes are one of the more impressive ways to apply deep learning transformations because every aspect of the technology is used to actively alternate the original data.

The first appearance of the term deep fake was made in 2016, when it was employed by online users for face swapping. The most common usage of deep fakes is to copy human attributes. It can be done on a face in a video, on an image or on a voice. For these reasons, the term deep fake had been coined for deep learning transformations that change human attributes. It however originally and generally refers to the act of replacing a face by another in a video.

Deep fake technologies are more than 20 years old. Here are some key dates: 

  • 1997 : The first deep fake project paper, which was used to animate an image from a song or a speech.
  • 2004 : Exchanging faces with mask models.
  • 2008 : Automatic face swapping in photographs by downloading images from the internet, extracting faces using face detection software, and aligning each extracted face to a common coordinate system.
  • 2016 : A company named DeepMind wrote an article to warn the world that its technology WaveNet could break voice recognition security systems.  
  • 2017 : The Synthesize Obama research paper which presents a realistic lips transfer using audio has been made 20 years after the very first deep fake submission.
  • 2018 : Deep voice creates a global voice cloner that only needs 3.7 seconds of the original voice to clone it. 

In less than five years, the technology improvements made deep fakes more realistic and easier to use. What was odd academic research at the beginning of the century is now, 20 years after, an efficient and usable tool accessible for everybody. At this point, it is becoming mature enough to be seriously considered for personal and commercial usage, whether it is with good or bad intentions. With a somewhat powerful computer or by using our Qarnot GPU solution, one can easily create realistically bluffing videos

How it works

A simple concept is behind most of the deep fake technology : the encoder-decoder architecture. On one side, the encoder is a neural network module used to extract and concentrate data to only keep the most useful information. For example, with an image of a person, only the most important parts like the eyes or the mouth’s shape are kept. On the other side, the decoder, which is connected to the encoder, is another neural network module used to recreate the source data, for instance a face, using the encoder inputs.

The following figures show the deep fake workflow for both the training and the converting

Trainning diagram

The neural network is trained to recreate the different faces given to it. It is done by inputting thousands of images and by modifying the algorithm after each one of them so the results can be closer to the original image. Using the same face over and over

Converting diagram

The trained decoder is used to convert an input image into the one that is known to it. In this example, the decoder B is given a face A image. The encoder-decoder will then output an image with the face A transformed to face B.

You can see below the encoder-decoder architecture used to change the face of Keanu Reeves in the Matrix to Nicolas Cage’s. If you keep the encoder and change the decoder, your model will change the face of the picture. 

Neo transformed in Nicolas Cage.

Keanu Reeves in the movie The Matrix (realized by the Wachowski, property of Warner Bros.)

Neo

Keanu Reeves in the movie The Matrix (realized by the Wachowski, property of Warner Bros.)

You can apply the same method to create an audio deep fake which modifies the voice tone of an audio recording in order to make it sound like someone else. To create an audio deep fake, you train the encoder-decoder algorithm with thousands of audio samples from a person. It then transforms whatever voice by recreating it with new parameters.

Another use case is the Wav2Lip framework which modifies the lips movement of a person on screen according to an audio recording. As a result, the initial speech will be discarded and the synchronization will be done with the new audio track. This algorithm uses two encoders, one for texts and one for images, then one decoder to learn the correlation between audio and lips movements.

Most of the time, the usual methods need to retrain the model each time you want a new face. These trained models are more precise but also more costly in computing power. The First Order Model proposes a deep learning model capable of learning patterns from any video without a lot of data processing. It uses specific sub-layers to learn movements and it is coupled with an encoder-decoder to create new images. Depending on the training, it can animate a face, a body or even an animal.

Applications in the industry

Movie enhancement

In the movie industry, deep fakes are very promising. It is already used in conjunction with traditional face transformation techniques. It allows the ongoing use of movie characters when the original actors are unavailable or deceased. For example, it was used in Star Wars : The Rise of Skywalker to recreate the faces of young Luke and Leia during their training scene. It can also be used to rejuvenate actors like Samuel L. Jackson as Nick Fury in the Marvel movie Captain Marvel. In the next few years, it is not impossible to imagine many iconic actors like Marilyn Monroe or Bruce Lee being brought back to life in upcoming films using this technology. 

Humoristic talk show and videos

The creators of South Park, Trey Parker and Matt Stone, created a humoristic talk show named Sassy Justice presented by a false Donald Trump who is interviewing other deep fakes celebrities like Julie Andrews, Michael Caine and Mark Zuckerberg. Funnily, one of the purposes of this show is to warn the viewers of the dangers of deep fakes.

Channel 4 also created a viral video with Queen Elizabeth 2 presenting the new year wishes, mocking events that happened during that year and dancing a Tik-Tok dance in her office.

Video acceleration 

Nvidia created a number of tools that can be used to improve the quality of lower resolution videos using a single picture and machine learning transformations. The concept is similar to deep fake. With the technology, it is possible to send 1080p videos with less than half of the usual data bandwidth. As a result, the final video is more fluid and smoother than the initial video.

Other possible future uses

It is also possible to imagine the use of deep fakes for the dubbing of movies. The voice of the original actor would be given to the voice actor that is translating the dialog. It could also be used to synchronize lips movements to voice soundtracks or song lyrics.

In video games, it could be used to personalize one’s character with their voice and their face. In 3D animation, it can be again used to synchronize the lips movements of the animated character and the voice of its actor.

Dangers of new technologies

Machine learning opens up a plethora of new possibilities and allows us to achieve exciting new features and new creations. Like a knife can at the same time be used for cooking and injuring people, the newly created machine learning tools can have different outcomes depending on how people use them. In contrast to the threats posed by knives which are well known and punishable by law, the dangers of new technologies like machine learning are harder to understand and quite unknown to the majority. Also, the legal tools and laws that should protect us against them are not yet mature. Deep fake is defined as one of the more dangerous machine learning technologies of the next few years by many analysts. For these reasons, we need to understand them to be able to use them correctly and to recognize inappropriate uses, among which are but not limited to: 

  • Bypassing security features, usually with face or voice recognition.
  • Scamming a person or a company, usually with a voice call.
  • Using the face of a movie actor or actress in a porn video to fulfill a fetish.
  • Creating a false video of a person to discredit him or to take revenge.
  • Spreading fake news.

As said, understanding, detecting and making everyone aware of the technology is really important and must be pushed further in the forthcoming years. This is also one of the goals of the next articles so let’s get going and make some deep fakes!

A teaser of upcoming blog posts

In the following articles, we will test various deep fake models and compute them on the Qarnot platform.  

First, we will learn to animate a picture, bring the Joconde to life or put a smile in your identity card. For this payload, we will use the first model order library and the pre-trained model they propose.   

Secondly, we will learn to swap faces and integrate your favorite actor (or even yourself!) in all the movies that you watch. This is the longest payload. For this payload, we will learn to use a model from the start, prepare the data, train the algorithms and finally convert the initial face. You will then be able to admire the results with some popcorn!

Lastly, we will synchronize the lips of a video to a voice audio track. For this model, we will use the wav2Lip framework and again the pretrained model already proposed.

 

Writing by Guillaume Lalé NEBIE and Thanh Tri NGUYEN.