Table of Contents

ChatGPT is a data privacy nightmare | ChatGPT Has a Data
privacy concerns.

The recent restriction on the generated text tool from Open
AI in Italy might only be the start of ChatGPT’s regulatory proble

OPENAI provided a preview of the data used to train the big
language model when it released GPT-3 in July 2020. According to a technical
paper, the generative text engine is built using millions of online pages that
have been scraped, Reddit posts, books, and other sources. Some of the private
information you disclose about yourself online is spooled up in this data.
Currently, OpenAI is having problems due to this data.

The data protection authority in Italy issued a temporary
emergency judgement on March 31 ordering OpenAI to halt utilising the personal
data of millions of Italians that is part of its training data. As stated by
the regulator

OpenAI does not have the legal authority to utilise
individuals’ personal information in ChatGPT, as stated by Garante for la
Protezione dei Dati Personali. In response, OpenAI has blocked access to its
chatbot in Italy while it responds to the authorities’ questions and does
further research.

The action—the first by a Western regulator against
ChatGPT—highlights privacy concerns surrounding the development of massive
generative AI models, which are frequently trained on enormous amounts of
internet data. The data regulator is now making the same claim regarding the
use of people’s personal information that artists and media firms have made
regarding generative AI developers using their work without authorization.

Similar choices might be made throughout Europe. Data
authorities in France, Germany, and Ireland have contacted the Garante to
request additional information on its findings in the days since Italy
announced its investigation. According to Tobias Judin, the head of
international at Norway’s data protection authority, which is keeping track of
developments, “if the business model has just been to scrape the internet
for whatever you could find, then there might be a really significant issue
here.” Judin continues, “It raises questions about who can use the
tools legally if a model is based on data that may have been unlawfully
collected.”

The setback to OpenAI from Italy also comes at a time when
huge AI models are being scrutinised more and more. Concerned about the
potential consequences of systems like ChatGPT, IT executives demanded a halt
to their development on March 29. Judin claims that the Italian ruling brings
up more pressing issues. In essence, Judin argues, “we’re seeing that
current AI development may have a significant shortcoming.”

ChatGPT banned in Italy over privacy concerns

The GDPR regulations in Europe, which regulate how
businesses gather, keep, and use people’s personal data, safeguard the
information of more than 400 million individuals living there. The term
“personal information” refers to information that can be used to
identify an individual, which can range from that individual’s name to their IP
address. In contrast to the hodgepodge of state-level privacy laws in the US,
the GDPR provides protections even if people’s personal information is publicly
accessible online. In other words, just because something is public doesn’t
mean you may take it and run with it anyway you choose.

According to Garante of Italy, ChatGPT has four issues with
GDPR: The text generating system on OpenAI does not have age restrictions to
prevent users under the age of 13 from utilising it. It can also provide
inaccurate information about users, and users are not informed that their data
was collected. Its fourth argument, which is arguably the most crucial, asserts
that there is “no legal basis” for gathering individuals’ private
information in the enormous volumes of data necessary to train ChatGPT.

Lilian Edwards, a professor of law, innovation, and society
at Newcastle University in the UK, asserts that “the Italians have called
their bluff.” “In the EU, it did seem pretty obvious that this was a
violation of data protection law.”

In general, a corporation must depend on one of six legal
justifications under GDPR, ranging from a person providing their consent to the
information being necessary as part of a contract, in order to acquire and use
a person’s personal information. In this situation, according to Edwards, there
are essentially only two options: obtaining individuals’ consent—which OpenAI
failed to do—or asserting that it has “legitimate interests” in using
individuals’ data, which is “very difficult,” according to Edwards.
The Garante tells WIRED that this defence is “inadequate,” un their
opinion

OpenAI’s privacy policy states that it depends on
“legitimate interests” when “developing” its services, although
it does not specifically state the legal justifications for utilising people’s
personal information in training data. When WIRED asked the corporation for
comment, they didn’t give any. Contrary to GPT-3, OpenAI has not disclosed any
information about the training data that went into ChatGPT, and it is believed
that GPT-4 will be far larger.

The technical document for GPT-4, however, contains a
section on privacy, which states that its training data may incorporate
“publicly available personal information” from several sources.
According to the article, OpenAI takes measures to protect users’ privacy, such
as “fine-tuning” models to prevent users from requesting personal
information and eliminating users’ data from training data “where
feasible.”

According to Jessica Lee, a partner at the law firm Loeb and
Loeb, “how to collect data lawfully for training data sets for use in
everything from just regular algorithms to some really sophisticated AI is a
critical issue that needs to be solved now. We’re kind of on the tipping point
for this sort of technology taking over.”

The Italian regulator’s move, which also targets the Replika
chatbot, may be the first of several cases looking at OpenAI’s data practises.
GDPR enables businesses with a European base to choose one nation to handle all
of their complaints; for example, Ireland handles Google, Twitter, and Meta
concerns. However, since OpenAI doesn’t have a presence in Europe, every nation
may file a complaint against it under GDPR.

Model Data OpenAI is not the only player. According to
experts, many of the concerns voiced by the Italian authority are likely to cut
to the heart of every advancement in machine learning and generative AI
systems. Although the EU is creating legislation for artificial intelligence,
there has been very little resistance to the development of machine learning
systems when it comes to privacy.

According to Elizabeth Renieris, senior research associate
at Oxford’s Institute for Ethics in AI and author on data practices,
“there is this rot at the very foundations of the building blocks of this
technology—and I think that’s going to be very hard to cure.” She makes
the point that since many of the data sets used to train machine learning
algorithms have been around for a while, it is likely that little thought was
given to privacy issues when they were assembled.

This data finally finds its way into something like GPT-4
through a complex supply chain and stacking, according to Renieris. “No
real form of data protection by design or default has ever existed.”
Images of people’s faces should be obscured in the data set, according to a
2022 suggestion made by the designers of one popular image database that has
been used to train AI models for a decade.

Privacy laws in Europe and California allow individuals to
ask for the erasure of information or its correction in cases where it is
erroneous. However, it might be difficult to remove information from an AI
system that is incorrect or that someone doesn’t want to be there—especially if
the data’s sources are unknown. Whether GDPR will be able to address this in
the long run, including safeguarding people’s rights, is an issue Renieris and
Edwards both raise. According to Edwards from Newcastle University, There
is no idea as to how you do that with these very large language models.”
They are not prepared for it.

At least one pertinent incident has already occurred; the US
Federal Trade Commission ordered the organisation formerly known as Weight
Watchers to erase algorithms derived from data it was not authorised to use.
But with further scrutiny, these directives might become more typical.
According to Judin from Norway’s data regulator, “depending, obviously, on
the technical infrastructure, it may be difficult to fully clear your model of all
of the personal data that was used to train it.” It would effectively mean
that you might not be able to utilise your model if the model had been trained
using personally identifiable information that had been obtained illegally.

AI techniques that generate privacy

Concerns about student/user privacy and data security are
raised by the collection and processing of vast volumes of personal data by
AI-generative technologies like chatGPT. For instance, take a look at chatGPT’s
privacy policy, which permits the business to access any information provided
to it. (For more information on how OpenAI may use data submitted with it, see
their FAQ.)

There is a chance that this data will be misused or
compromised in a data breach, or that it will be utilised for nefarious or
criminal purposes.

If you decide to use chatGPT in your class, one way to allay
these privacy worries is to invite students to log in using anonymous email
addresses and give them the option to opt out.

Additional issues of equity, morality, and accessibility

Many AI technologies are now available for free, but this
could change in the future. If you choose to use these tools in your
assignments, think about including options that all students may use.

Avoiding activities that will disproportionately advantage
students who can afford pricey AI technologies is something worth thinking
about.

The limitations and potential biases of AI-generated content
should be made clear to students, and they should be urged to use it
appropriately. The data that AI technologies are taught on determines how
objective they are. The outcomes produced by the AI will be biassed if the
training data for it contains bias. By doing this, AI programmes can maintain
the biases found in their initial training data,

AI tools have the potential to perpetuate bias as well as
false information. Based on the data it was trained on, AI systems may produce
content that is false, damaging, or misleading, thereby spreading or producing
disinformation.

Text-generating AI technologies, like chatGPT, produce textoutputs based on the enormous corpus of texts that served as its training set.
Because of this, it might be challenging to identify who is in charge of the
texts made by AI tools, who wrote them, and whether anyone is accountable for
the outcomes.

It’s crucial to keep in mind that not every AI tool has been
created to be used by everyone given the diversity of tools that are currently
available and those that are being developed.