Qarnot
The Editorial Team at Qarnot.

Will data scientists save the world?

October 12, 2021 - HPC discovery, Machine Learning / AI

Data science is at the crossroads of different disciplines: mathematics, statistics, computer science, data analysis ... It aims to give meaning, and sometimes make meaning, to raw data. 

The exponential growth in the volume of digital data (50 zettabytes in 2020, or 20 times more in a decade) makes complex analysis impossible by humans. Data science opens up fantastic perspectives to those who will be able to interpret and make the data speak. But it also raises many questions, particularly ethical - both on its uses and on its operation, and environmental through the carbon footprint it generates. Interview with François Guillaume Fernandez, Deep Learning Engineer in charge of the associative and open source Pyronear project and Redha Moulla, Doctor in automation, consultant and teacher in AI and former director of the Data Science Pole of Keley Data. 

When the term data science appeared in September 1992 *, it was still difficult to imagine the range of possibilities. Even Cédric Villani admits**: “Like many mathematicians starting their careers in the 90s, I deeply underestimated the impact of artificial intelligence, which ultimately only gave few results at that time. ”

Thirty years later, the cases of applications and companies claiming to use artificial intelligence in the broad sense are legion. The sector of the sale of goods and services (especially online but also hybrid with the explosion of “phygital retail”) has clearly identified the formidable growth lever in the use of data. Engines of recommendations, devices to predict consumption… algorithms have been widely adopted by marketing and sales teams to generate income. 

"There is an underlying trend. Young data scientists no longer want to go too much into marketing or finance, even if it clearly pays better. There is a bit of an idea behind it of wasting the extraordinary possibilities [of AI] just to increase a click-through rate”.

But according to Redha Mulla, these uses do not make the young generations dream:"There is an underlying trend. Young data scientists no longer want to go too much into marketing or finance, even if it clearly pays better. There is a bit of an idea behind it of wasting the extraordinary possibilities [of AI] just to increase a click-through rate”. 

As a perfect illustration of this, the Data For Good association brings together a community of volunteer Data Scientists using their skills to help solve social problems. Based on the observation that very often, actors who work for the general interest *** do not benefit from the same resources and technologies as startups or tech giants, Data for Good proposes to help restore this balance. Different projects are therefore selected for support, such as within a start-up accelerator, and any content produced within these projects (code, visuals, documentation, etc.) is published under a free license to benefit the community. 

Focus on the Pyronear project: early detection of fires thanks to deep learning

It is by participating in a Hackathon in 2018 that François Guillaume Fernandez is made aware of the forest fire situation. Trained as an engineer (Centrale Supélec) and specialist in visual information processing, he imagined a fire detection program using old smartphones. But diverting the phone from its original use turns out to be a solution that is both expensive and technically complicated to manage. It was therefore necessary to find another way with the lowest possible cost of deployment knowing that the operating principle of the solution breaks down into three stages: 

  1. Covering the area.
  2. Image processing. 
  3. Alert to firefighters in case of detection of an outbreak of fire. 

4,000. This is the number of fires in forests and natural environments in France per year****.

The approach to fire detection from images is not new in itself. What we hope to bring is to go through a software solution adapted to the hardware. France is very mature on this subject and automatic detection already exists, but the equipment is very expensive because there are often infrared cameras and there is less intelligence in the processing”specifies François Guillaume Fernandez. To the question of the role played by satellites, François Guillaume explains that “detection must be early to benefit firefighters. Satellite images are both inaccessible and generally delayed by about fifteen minutes. But this can make it possible to validate the detections a posteriori”. 

The underlying observation of the project is that it is complicated to go towards a high-performance, accessible and economical solution if the data acquisition itself is not simple. The choice of hardware therefore fell on the Raspberry Pi: inexpensive, it is also extremely practical because you can easily add all kinds of devices. “Other organizations are studying chemical fire detection, for example, and this module could very well be added to the project. We could also add a microphone and, in doing so, link this project to Microfaune, which assesses biodiversity through deep learning.  It is really a multimodal system, an experimentation platform”explains François Guillaume. 

Concretely, the Raspberry Pi is installed on a position which overhangs the observed site, like the watchtower of the firefighters. The number of devices will depend on the terrain and the site. 

Once the hardware question had been settled, the option chosen was that of Deep Learning, that is to say to let the algorithm choose the determining characteristics to identify the start of the fire: “We showed it hundreds of thousands starting situations of fire and it is the model which qualifies them. The first source was Google Images but we quickly saw that it was not robust enough and that we were going to create a great barbecue detector so we took the images from cameras which are continuously running on sensitive sites, in particular in the United States, which we subsequently labeled ourselves ”

Today, the test phases will be able to start in different departments, starting with Ardèche, with the aim of covering the greatest variety of territories and conditions (terrain, light, vegetation, etc.). All the code for the project, which focused on processing intelligence and the software layer, is open source. The association Pyronear now has around thirty volunteers bringing together different skills and professions (R&D, communication, UX, etc.).

A super power that is too greedy for energy? 

Digital pollution is not always easy to understand: it cannot be seen and the simple notion of the cloud suggests a kind of evaporation which makes it even less tangible. The carbon footprint linked to artificial intelligence is no exception. Redha Moulla confirms that according to him, “energy efficiency is not really an issue in the eyes of data scientists because heavy calculations are sent to data centers, far away, we do not see them, it makes the carbon footprint very abstract”.

He nevertheless qualifies the weight of artificial intelligence in digital pollution by splitting two main types of use: 

  • on the one hand, OpenAI type models which have hundreds of billions of parameters and therefore probably hundreds of thousands of kilograms of CO2: but on the one hand very few players are able to train models of this size, and on the other hand these models are trained “for all of humanity” so pooling comes in partial compensation for the footprint. 

     
  • on the other hand, the individual footprints of data scientists, which remain marginal but will probably grow a lot both because the number of data scientists is increasing and because access to hardware and in particular to GPUs becomes more democratic.  

The data scientist is a bit like an architect who chooses the material, the energy performance, the consumption of his building ... Sometimes we can build an energetic labyrinth without having that in mind and suffering the cost afterwards ”.

Another line of thought revolves around the notion of technological sobriety, further upstream, at the level of the very choice of resorting to artificial intelligence as the most relevant solution to respond to a given problem. If François Guillaume Fernandez defines data science as “a Pandora's box, a“ super power ”, he also recalls that this power induces a cost which may be disproportionate in relation to the objectives and that it is important to think about a fair balance between the goal to be achieved and the means deployed to achieve it: “The data scientist is a bit like an architect who chooses the material, the energy performance, the consumption of his building ... Sometimes we can build an energetic labyrinth without having that in mind and suffering the cost afterwards ”

We can of course also highlight the improvement in the energy efficiency of the data centers in which the GPUs are housed. The most efficient solution at present from this point of view is that of sites which exploit the wasted heat of the computations as proposed by the company Qarnot, a partner of Data For Good: the GPU park is installed in unexploited parts of Casino Group logistics warehouses, which means, unlike a traditional data center, that there has been no construction of a dedicated building or installation of an electrical network. In addition, the heat produced is extracted and recycled on site, so as to heat the warehouses, which even allows the site to have a negative carbon footprint*****. 

How to see clear in the black box?

According to Redha Moulla, the ethical problem surrounding artificial intelligence is much more important than the environmental issue because it is not yet clear: “the answers provided today are technical. We have tools to explain how the algorithm made a decision but in reality we don't really know what it did because the reality is too complex for us. There are probably tons of examples where algorithms get it wrong and you don't even know it. For a purchase recommendation on Amazon, it does not matter but if tomorrow we trust the algorithm to pilot a car or a plane, the stakes are much higher. Society will therefore take up this debate. ” 

In his report on artificial intelligence, Cédric Villani also recalls that “we are not all equal before these algorithms and that their partiality has real consequences on our lives. Every day, in great opacity, they affect our access to information, culture, employment and even credit.

The law cannot do everything, among other things because the time of the law is much longer than that of the code”.

In his report on artificial intelligence, Cédric Villani also recalls that “we are not all equal before these algorithms and that their partiality has real consequences on our lives. Every day, in great opacity, they affect our access to information, culture, employment and even credit.

So should the solution be technical or political? Redha Moulla recalls that to correct one bias, another must be introduced, which does not seem to provide an answer to this complex question. As for politics, Cédric Villani underlines that “the law cannot do everything, among other things because the time of the law is much longer than that of the code. It is therefore essential that the “architects” of the digital society - researchers, engineers and developers - who design and market these technologies do their fair share in this mission by acting responsibly. This implies that they are fully aware of the possible negative effects of their technologies on society and that they actively work to limit them.

So, can data scientists save the world? 

Redha Moulla thinks so but specifies: “I do not believe that we should leave technical questions to pure techniques because they do not have the hindsight. Data Scientists can save the world, but not alone. In any case, the will is there. Just about all major consulting firms now have departments related to “Data Science for good”. Today resources go mainly to marketing and finance, but tomorrow things will move because we will have no choice. It's already at work. 

 


The weight of AI: a few examples

AlphaGo: To beat Lee Sedo at the game of Go in 2016, AlphaGo's 1,920 CPUs and 280 GPUs consumed around 1 MW of electrical power, which is more power of 100 Renault Zoé electric vehicles ***. 
Source: Mastering the game of Go with deep neural networks and tree search - Nature - 2016

Deep fake: The power consumption of a 72h Deep Fake computation (for a 256px image) on 3 RTX 2080ti and 1 AMD 2990 is of about 72 kWh, or about 21 kg CO2 eq (European electric mix).
Source: Europa.eu

ImageNet: For training thedataset ImageNet for the DAWNBench, it took: 

  • 14 days to complete 90-epoch ResNet-50 on NVIDIA M40 GPU 
  • ~ 134 kWh power consumption
  • ~ 39 kg Co2 eq (European electric mix)

The "Carbon Emissions and Large Neural Network Training" study has shown that the carbon footprint linked to the use of models is a priori 9 times greater than that of training. 
Source: Carbon_Emission_Large_NN

* The term Data Science was coined during the 2nd Franco-Japanese colloquium on statistics held at the University of Montpellier II (France) in September 1992.

** Report “Giving meaning to artificial intelligence” - 2017

***Citizens , associations, public institutions and companies with a strong social impact.

**** Source: Ministry of Ecological and Solidarity Transition

***** Estimate carried out on the Réau site. Heat recovery + reduction in natural gas consumption and partial substitution by low carbon electricity.

Share on networks