Carrying out a data protection impact assessment when necessary
Creating a dataset for the training of an AI system can lead to high risks to people’s rights and freedoms. In this case, a data protection impact assessment is mandatory. The CNIL explains how, and in which cases, it should be realised.
The Data Protection Impact Assessment (DPIA) is an approach that allows to map and assess the risks of a personal data processing and to establish an action plan to reduce them to an acceptable level. This approach, facilitated by the tools provided by the CNIL, is particularly useful to control the risks associated with a processing before it is implemented, but also to ensure their follow-up over time.
In particular, a DPIA makes it possible to carry out:
- an identification and a risk assessment for individuals whose data could be collected, by means of an analysis of their likelihood and severity;
- an analysis of the measures enabling individuals to exercise their rights;
- an assessment of people’s control over their data;
- an assessment of the transparency of the data processing for individuals (consent, information, etc.).
The realisation of a DPIA for the development of AI systems
Identifying when a DPIA is needed
The development of AI systems requires, in some cases, the realisation of a DPIA. A DPIA is mandatory if the envisaged processing is likely to create a high risk to the rights and freedoms of natural persons (Article 35 GDPR).
The European Data Protection Board (EDPB) has identified nine criteria to assist controllers in determining whether a DPIA is required: any processing that meets at least two criteria on this list will have to perform a DPIA. Some of these criteria are particularly relevant for the development phase:
- the collection of sensitive data or highly personal data (e.g. categories of data that can increase the risk of harm to the rights and freedoms of individuals, such as location data or financial data);
- the processing of data on a large scale;
- the collection of data concerning vulnerable persons, such as children;
- the matching or combination of datasets;
- the innovative use or the application of new technological or organisational solutions.
In all cases, it is necessary to consider the existence of risks for individuals as a result of the creation of a training set and its use: if there are significant risks, in particular as a result of a misuse of data, of a data breach, or where a processing may lead to discriminations, a DPIA must be carried out even if two of those criteria are not met; conversely, a DPIA does not have to be carried out if two criteria are met but the controller can establish with sufficient certainty that the processing of personal data in question does not expose individuals to high risks.
On the basis of these criteria, the CNIL has published a list of personal data processing for which the realisation of a DPIA is mandatory (for more information, see the CNIL’s website). Among these, several may rely on artificial intelligence systems, such as those involving profiling or automated decision-making: in this case, a DPIA is always required.
Is the use of an artificial intelligence system an ‘innovative use’?
Innovative use is one of the 9 criteria that can lead to the realisation of a DPIA: it is assessed in the light of the state of technological knowledge and not only of the context of the processing (a processing can be very “innovative” for a given organism, because of the technological novelty it brings to it, without, however, being an innovative use in general). The use of artificial intelligence systems is not systematically an innovative use or an application of new technological or organisational solutions. All processing using AI systems will therefore not meet this criterion. In order to determine whether the technique used falls within such uses, it is necessary to distinguish between two categories of systems:
- systems that use AI techniques that have been experimentally validated for several years and tested in real-world conditions do not fall within the scope of innovative use or application of new technological or organisational solutions.
Example: certain regression or clustering techniques or model architectures such as random forests, in cases where the risks associated with their use are known;
- systems that use still new techniques, including those based on statistical approaches such as deep learning and whose risks are just beginning to be known today and are still poorly mastered, are an innovative use.
Example: generative AI systems based on training on large amounts of data and whose behavior cannot be anticipated in all use cases.
By way of illustration, a research project aimed at developing automatic language processing tools for clinical applications in the medical field, based on large volumes of data (transcripts of audio data, cases of clinical studies, medical results, etc.), can be an innovative use, especially given the uncertainty as to the results to be obtained.
Is training an artificial intelligence system a “large-scale” processing?
Large-scale collection of data is one of the 9 criteria that can lead to the obligation to conduct a DPIA: while the development of an AI system often relies on the processing of a large amount of data, this does not necessarily fall within the scope of large-scale processing which aim to “process a considerable amount of personal data at regional, national or supranational level [and which may] affect a significant number of data subjects” (recital 91 GDPR). For AI systems, in particular, it will be necessary to determine whether the development involves a very large number of people.
Examples:
- A research organisation wants to build a large dataset of landscape photos (mountain, ocean, desert, cities, etc.) to improve the performance of a computer vision system. Some of these images feature individuals, who are sometimes recognizable.
Even when the dataset has millions of images covering the entire surface of the planet, if the number of images containing recognizable individuals (and therefore personal data) is limited (for example to a few thousands), the processing will not be a “large-scale processing”. However, it is not excluded that a DPIA may be required according to the other criteria to be verified.
- Where a provider of a conversational agent constitutes a dataset to train its large language model (LLM) from a considerable volume of publicly accessible personal data on the Internet collected through web scraping techniques, the processing can be described as ‘large-scale processing’.
Defining the scope of the DPIA
The scope of a DPIA may differ depending on the provider’s knowledge of the use that will be made, by itself or by a third party, of the AI system it develops.
When the operational use of the AI system in the deployment phase is identified from the development phase
When the system provider is also responsible for the processing of personal data occurring during the deployment phase, and when the operational use of the AI system is identified from the development stage, it is recommended to carry out a general DPIA for the entire processing. The supplier will then be able to supplement this DPIA with the risks associated with both phases.
If the provider is not responsible for the processing occurring during the deployment phase but identifies the purpose of use in the deployment phase, it may propose the template of a DPIA to the controller. This may allow it, in particular, to take into account certain risks that are easier to identify during the development phase. However, the user of the AI system, as controller, remains obliged to perform a DPIA, for example on the basis of the provider’s template, if he wishes to.
It should be noted that, in some cases, it is not possible to determine precisely and in advance the supervision of the deployment phase: for example, some risks can be reassessed after a calibration phase of the AI system under its deployment conditions. The DPIA will then have to be changed iteratively as the characteristics of the processing are defined at the deployment stage.
Where the operational use of the AI system in the deployment phase is not clearly identified in the development phase
In this case, the supplier will only be able to carry out its impact assessment on the development phase. It will then be up to the controller of the deployment phase to analyse, with regard to the characteristics of the processing, whether a DPIA is necessary for that phase. If necessary, if the purposes of the deployment phase are multiple, the controller may decline the same general DPIA for each of the specific use cases.
AI Risks to Consider in a DPIA
In addition to the risks usually associated with the processing of personal data, processing based on artificial intelligence systems presents specific risks that should be taken into account:
- the risks to data subjects related to the misuse of the data contained in the training dataset, in particular in the event of a data breach;
- the risk of automated discrimination caused by the AI system, introduced during its development and resulting in lesser performances of the system for certain categories of people;
- the risk of producing false content on a real person, which is particularly important in the case of generative AI systems, and may have consequences for his or her reputation;
- the risk of automated decision-making caused by an automation or confirmation bias in the event that the necessary explainability and transparency measures are not taken during the development of the solution (such as the rise of a confidence indicator, or intermediate information such as a saliency map) or if an agent using the AI system cannot take a contrary decision without prejudice to it;
- the risk of users losing control over their published data and freely accessible online, as large-scale collection is often necessary for learning an AI system, in particular when it is collected by web scraping;
- the risks associated with known attacks specific to AI systems such as attacks by data poisoning, insertion of a backdoor, or model inversion;
- the risks related to the confidentiality of the data that could be extracted from the AI system.
Finally, analyses from benchmarks published by the CNIL or by third parties may be integrated or associated with the DPIA. Among these benchmarks, the CNIL recommends using:
- the self-assessment guide published by the CNIL;
- the proposal for an European Artificial Intelligence Act, and in particular its Annex IV detailing the technical documentation to accompany the placing on the market of high-risk AI systems;
- the benchmarks and frameworks identified by the CNIL on the page “Other guides, tools and good practices”.
Actions to be taken on the basis of the results of the DPIA
The DPIA is an exercise that aims at, firstly, determining the level of risk associated with a processing of personal data. Once this aim is achieved, a set of measures should be devised in the DPIA in order to reduce and maintain it at an acceptable level. These measures must incorporate the CNIL’s recommendations that apply, whether they relate to AI techniques or not.
In addition, certain specific measures in the field of AI – in particular of a technical nature – may be implemented, including:
- security measures, such as homomorphic encryption or the use of a trusted execution environment;
- minimisation measures, such as the use of synthetic data;
- anonymisation or pseudonymisation measures, such as differential privacy;
- data protection measures applicable at the stage of development, such as federated learning;
- measures to facilitate the exercise of rights or remedies for individuals, such as machine unlearning techniques, or measures to explain and track outputs from AI systems;
- audit and validation measures, in particular to identify and correct biases or errors discriminating against certain individuals or categories of individuals.
Other, more generic measures may also be applied:
- organisational measures, such as the supervision and the limitation of access to training datasets allowing for a modification of the AI system, the limitation of access to data by third parties and subcontractors;
- governance measures, such as setting up an ethics committee;
- logging in order to identify and explain abnormal behaviour;
- measures concerning internal documentation, such as the drafting of an IT charter.
These measures should be selected on a case-by-case basis in order to reduce the risks that are specific to the processing in question. They will need to be integrated into an action plan and monitored. In addition, as they are intended to protect data during the development of the AI system and in particular when setting up the dataset, these measures may be complemented by other AI specific measures to be applied during the deployment phase. A description of the specific measures for the deployment of generative AI will be provided in a subsequent sheet.