Privacy is an essential part of the Web [ETHICAL-WEB]. This document provides definitions for privacy and related concepts that are applicable worldwide. It also provides a set of privacy principles that should guide the development of the Web as a trustworthy platform. Users of the Web would benefit from a stronger relationship between technology and policy, and this document is written to work with both.
This document is a Draft Finding of the Technical Architecture Group (TAG). It was prepared by the Web Privacy Principles Task Force, which was convened by the TAG. Publication as a Draft Finding does not imply endorsement by the TAG or by the W3C Membership.
This draft does not yet reflect the consensus of the TAG or the task force and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to cite this document as anything other than a work in progress.
It will continue to evolve and the task force will issue updates as often as needed. At the conclusion of the task force, the TAG intends to adopt this document as a Finding.
Privacy is an essential value of the Web ([ETHICAL-WEB], [design-principles]). In everyday life, people typically find it easy to assess whether a given flow of information is a violation of privacy or not [NYT-PRIVACY]. However, in the digital space, users struggle to understand how their data may be moved between contexts and how that may affect them. This is particularly true if they may be affected at a much later time and in completely different situations. Some actors are using this confusion to extract and exploit personal data at scale.
The goal of this document is to define principles that may prove useful in developing technology and policy that relate to privacy and personal data.
Personal data is covered by legal frameworks and this document recognises that existing data protection laws take precedence for legal matters. However, because the Web is global, we benefit from having shared concepts to guide its evolution as a system built for its users [RFC8890]. A clear and well-defined view of privacy on the Web, informed by research, can hopefully help all the Web's participants in different legal regimes. Our shared understanding is that the law is a floor, not a ceiling.
This section provides a number of building blocks to create a shared understanding of privacy. Some of the definitions below build on top of the work in Tracking Preference Expression (DNT) [tracking-dnt].
A user (also person or data subject) is any natural person.
We define personal data as any information relating to a person such that:
Data is permanently de-identified when there exists a high level of confidence that no human subject of the data can be identified, directly or indirectly (e.g., via association with an identifier, user agent, or device), by that data alone or in combination with other retained or available information, including as being part of a group. Note that further considerations relating to groups are covered in the Collective Issues in Privacy section.
Data is pseudonymous when:
the identifiers used in the data are under the direct and exclusive control of the first party; and
when these identifiers are shared with a third party, they are made unique to that third party such that if they are shared with more than one third party these cannot then match them up with one another; and
any third party receiving such identifiers is barred (eg. based on legal terms) from sharing them or the related data further; and
technical measures exist to prevent re-identification or the joining of different data sets involving these identifiers, notably against timing or k-anonymity attacks; and
This can ensure that pseudonymous data is used in a manner that provides a minimum degree of governance such that technical and procedural means to guarantee the maintenance of pseudonymity are preserved. Note that pseudonymity, on its own, is not sufficient to render data processing appropriate.
A vulnerable person is a person who, at least in the context of the processing being discussed, are unable to exercise sufficient self-determination for any consent they may provide to be receivable. This includes for example children, employees with respect to their employers, people in some situations of intellectual or psychological impairment, or refugees.
A party is an entity that a person can reasonably understand as a single "thing" they're interacting with. Uses of this document in a particular domain are expected to describe how the core concepts of that domain combine into a user-comprehensible party, and those refined definitions are likely to differ between domains.
The first party is a party with which the user intends to interact. Merely hovering over, muting, pausing, or closing a given piece of content does not constitute a user's intent to interact with another party, nor does the simple fact of loading a party embedded in the one with which the user intends to interact. In cases of clear and conspicuous joint branding, there can be multiple first parties. The first party is necessarily a data controller of the data processing that takes places as a consequence of a user interacting with it.
A third party is any party other than the user, the first party, or a service provider acting on behalf of either the user or the first party.
A service provider or data processor is considered to be the same party as the entity contracting it to perform the relevant processing if it:
A data controller is a party that determines the means and purposes of data processing. Any party that is not a service provider is a data controller.
The Vegas Rule is a simple implementation of privacy in which "what happens with the first party stays with the first party." Put differently, it describes a situation in which the first party is the only data controller. Note that, while enforcing the Vegas Rule provides a rule of thumb describing a necessary baseline for appropriate data processing, it is not always sufficient to guarantee appropriate processing since the first party can process data inappropriately.
A party processes data if it carries out operations on personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, sharing, dissemination or otherwise making available, selling, alignment or combination, restriction, erasure or destruction.
A party shares data if it provides it to any other party. Note that, under this definition, a party that provides data to its own service providers is not sharing it.
A party sells data when it shares it in exchange for consideration, monetary or otherwise.
The purpose of a given processing of data is an anticipated, intended, or planned outcome of this processing which is achieved or aimed for within a given context. A purpose, when described, should be specific enough to be actionable by someone familiar with the relevant context (ie. they could independently determine means that reasonably correspond to an implementation of the purpose).
The means are the general method of data processing through which a given purpose is implemented, in a given context, considered at a relatively abstract level and not necessarily all the way down to implementation details. Example: the user will have their preferences restored (purpose) by looking up their identifier in a preferences store (means).
A context is a physical or digital environment that a person interacts with for a purpose of their own (that they typically share with other person who interact with the same environment).
A context can be further described through:
A context carries context-relative informational norms that determine whether a given data processing is appropriate (if the norms are adhered to) or inappropriate (when the norms are violated). A norm violation can be for instance the exfiltration of personal data from a context or the lack of respect for transmission principles. When norms are respected in a given context, we can say that contextual integrity is maintained; otherwise that it is violated ([PRIVACY-IN-CONTEXT], [PRIVACY-AS-CI]).
We define privacy as a right to appropriate data processing. A privacy violation is, correspondingly, inappropriate data processing [PRIVACY-IN-CONTEXT].
Note that a first party can be comprised of multiple contexts if it is large enough that people would interact with it for more than one purpose. Sharing personal data across contexts is, in the overwhelming majority of cases, inappropriate.
Your cute little pup uses Poodle Naps to find comfortable places to snooze, and Poodle Fetch to locate the best sticks. Napping and fetching are different contexts with different norms, and sharing data between these contexts is a privacy violation despite the shared ownership of Naps and Fetch by the Poodle conglomerate.
Colloquially, tracking is understood to be any kind of inappropriate data collection.
Additionally, privacy labour is the practice of having a person carry out the work of ensuring data processing of which they are the subject is appropriate, instead of having the parties be responsible for that work as is more respectable.
The user agent acts as an intermediary between a user and the web. The user agent is not a context because it is expected to align fully with its user and operate exclusively in that person's interest. It is not the first party. The user agent serves the user as a trustworthy agent: it always puts the user's interest first. In some occasions, this can mean protecting the user from themselves by preventing them from carrying out a dangerous decision, or by slowing down the user in their decision. For example, the user agent will make it difficult for the user to connect to a site if it can't verify that the site is authentic. It will check that the user really intends to expose a sensitive device to a page. It will prevent the user from consenting to the permanent monitoring of their behaviour. Its user agent duties include [TAKING-TRUST-SERIOUSLY]:
These duties ensure the user agent will care for the user. It is important to note that there is a subtle difference between care and data paternalism. Data paternalism claims to help in part by removing agency ("don't worry about it, so long as your data is with us it's safe, you don't need to know what we do with it, it's all good because we're good people") whereas care aims to support people by enhancing their agency and sovereignty.
In academic research, this relationship with a trustworthy agent is often described as "fiduciary" [FIDUCIARY-UA].
A person's identity is the set of characteristics that define them. Their identity in a context is the set of characteristics they present to that context. People frequently present different identities to different contexts, and also frequently share an identity among several contexts.
Cross-context recognition is the act of recognising that an identity in one context is the same person as an identity in another context. Cross-context recognition can at times be appropriate but anyone who does it needs to be careful not to apply the norms of one context in ways that violate the norms around use of information acquired in a different context. (For example, if you meet your therapist at a cocktail party, you expect them to have rather different discussion topics with you than they usually would, and possibly even to pretend they do not know you.) This is particularly true for vulnerable people as recognising them in different contexts may force their vulnerability into the open.
In computer systems and on the Web, an identity seen by a particular website is typically assigned an identifier of some type, which makes it easier for an automated system to store data about that user.
To do this, user agents have to make some assumptions about the borders between contexts. By default, user agents define a machine-enforceable context or partition as:
Even though this is the default, user agents are free to restrict this context as their users need. For example, some user agents may help their users present different identities to subdivisions of a single site.
User agents should prevent their user from being recognized across machine-enforceable contexts unless the user intends to be recognized. This is a "should" rather than a "must" because there are many cases where the user agent isn't powerful enough to prevent recognition. For example if two or more services that a user needs to use insist that the user share a difficult-to-forge piece of their identity in order to use the services, it's the services behaving inappropriately rather than the user agent.
If a site includes multiple contexts whose norms indicate that it's inappropriate to share data between the contexts, the fact that those distinct contexts fall inside a single machine-enforceable context doesn't make sharing data or recognizing identities any less inappropriate.
A person's autonomy is their ability to make decisions of their own volition, without undue influence from other parties. People have limited intellectual resources and time with which to weigh decisions, and by necessity rely on shortcuts when making decisions. This makes their privacy preferences malleable [PRIVACY-BEHAVIOR] and susceptible to manipulation [DIGITAL-MARKET-MANIPULATION]. A person's autonomy is enhanced by a system or device when that system offers a shortcut that aligns more with what that person would have decided given arbitrary amounts of time and relatively unfettered intellectual ability; and autonomy is decreased when a similar shortcut goes against decisions made under ideal conditions.
Affordances and interactions that decrease autonomy are known as dark patterns. A dark pattern does not have to be intentional, the deceptive effect is sufficient to define them [DARK-PATTERNS], [DARK-PATTERN-DARK].
Because we are all subject to motivated reasoning, the design of defaults and affordances that may impact user autonomy should be the subject of independent scrutiny. Implementers are enjoined to be particularly cautious to avoid slipping into data paternalism.
Given the sheer volume of potential data-related decisions in today's data economy, complete informational self-determination is impossible. This fact, however, should not be confused with the contention that privacy is dead. Careful design of our technological infrastructure can ensure that users' autonomy as pertaining to their own data is enhanced through appropriate defaults and choice architectures.
In the 1970s, the Fair Information Practices or FIPs were elaborated in support of individual autonomy in the face of growing concerns with databases. The FIPs assume that there is sufficiently little data processing taking place that any person will be able to carry out sufficient diligence to enable autonomy in their decision-making. Since they entirely offload the privacy labour to users and assume perfect, unfettered autonomy, the FIPs do not forbid specific types of data processing but only place them under different procedural requirements. Such an approach is appropriate for parties that are processing data in the 1970s.
One notable issue with procedural approaches to privacy is that they tend to have the same requirements in situations where the user finds themselves in a significant asymmetry of power with a party — for instance the user of an essential service provided by a monopolistic platform — and those where user and parties are very much on equal footing, or even where the user may have greater power, as is the case with small businesses operating in a competitive environment. It further does not consider cases in which one party may coerce other parties into facilitating its inappropriate practices, as is often the case with dominant players in advertising [CONSENT-LACKEYS] or in content aggregation [CAT].
Reference to the FIPs survives to this day. They are often referenced as transparency and choice, which, in today's digital environment, is often a strong indication that inappropriate processing is being described.
Different procedural mechanisms exist to enable people to control the processing done to their data. Mechanisms that increase the number of purposes for which their data is being processed are referred to as opt-in or consent; mechanisms that decrease this number of purposes are known as opt-out.
When deployed thoughtfully, these mechanisms can enhance people's autonomy. Often, however, they are used as a way to avoid putting in the difficult work of deciding which types of processing are appropriate and which are not, offloading privacy labour to the user.
Privacy regulatory regimes are often anchored at extremes: either they default to allowing only very few strictly essential purposes such that many parties will have to resort to consent, habituating people to ignore legal prompts and incentivising dark patterns, or, conversely, they default to forbidding only very few, particularly egregious purposes, such that people will have to perform the privacy labour to opt out in every context in order to produce appropriate processing.
An approach that is more aligned with the expectation that the Web should provide a trustworthy, person-centric environment is to establish a regime consisting of three privacy tiers:
When an opt-out mechanism exists, it should preferably be complemented by a global opt-out mechanism. The function of a global opt-out mechanism is to rectify the automation asymmetry whereby service providers can automate data processing but people have to take manual action. A good example of a global opt-out mechanism is the Global Privacy Control [GPC].
Conceptually, a global opt-out mechanism is an automaton operating as part of the user agent, which is to say that it is equivalent to a robot that would carry out the user's bidding by pressing an opt-out button with every interaction that the user has with a site, or more generally conveys an expression of the user's rights in a relevant jurisdiction. (For instance, under [GDPR], the user may be conveying objections to processing based on legitimate interest or the withdrawal of consent to specific purposes.) It should be noted that, since a global opt-out signal is reaffirmed automatically with every user interaction, it will take precedence in terms of specificity over any manner of blanket consent that a site may obtain, unless that consent is directly attached to an interaction (eg. terms specified on a form upon submission).
When designing Web technology, we naturally pay attention to potential impacts on the person using the Web through their user agent. In addition to potential individual harms we also pay heed to collective effects that emerge from the accumulation of individual actions as influenced by entities and the structure of technology.
Note that in evaluating impact, we deliberately ignore what implementers or specifiers may have intended and only focus on outcomes. This framing is known as POSIWID, or "the Purpose Of a System Is What It Does".
The collective problem of privacy is known as legibility. Legibility concerns population-level data processing that may impact populations or individuals, including in ways that people could not control even under the optimistic assumptions of the FIPs. For example, based on population-level analysis, a company may know that site.example is predominantly visited by people of a given race or gender, and decide not to run its job ads there. Visitors to that page are implicitly having their data processed in inappropriate ways, with no way to discover the discrimination or seek relief [DEMOCRATIC-DATA].
What we consider is therefore not just the relation between the people who expose themselves and the entities that invite that disclosure [RELATIONAL-TURN], but also between the people who expose themselves and those who do not but may find themselves recognised as such indirectly anyway. One key understanding here is that such relations may persists even when data is permanently de-identified.
Legibility practices can be legitimate or illegitimate depending on the context and on the norms that apply in that context. Typically, a legibility practice may be legitimate if it is managed through an acceptable process of collective governance. For example, it is often considered legitimate for a government, under the control of its citizens, to maintain a database of license plates for the purpose of enforcing the rules of the road. It would be illegitimate to observe the same license plates near places of worship to build a database of religious identity.
Legibility is often used to order information about the world. This can notably create problems of reflexivity and of autonomy.
Problems of reflexivity occur when the ordering of information about the world used to produce legibility finds itself changing the way in which the world operates. This can produce self-reinforcing loops that can have deleterious effects both individual and collective [SEEING-LIKE-A-STATE].
Issues of autonomy occur depending on the manner in which legibility is implemented. When legibility is used to order the world following rules set by the user or following methods subject to public scrutiny and governance models with strong checks and balances (such as a newspaper's editorial decisions), then it will enhance user autonomy and tend to be legitimate. When it is done in the user's stead and without governance, it decreases user autonomy and tends to be illegitimate.
Data governance refers to the rules and processes for how data is processed in any given context. How data is governed describes who has power to make decisions over data and how [DATA-FUTURES-GLOSSARY].
In general, collective issues in data require collective solutions. The proper goal of data governance at the standards-setting level is the development of structural controls in user agents and the provision of institutions that can handle population-level problems in data. Governance will often struggle to achieve its goals if it works primarily by increasing individual control over data. A collective approach reduces the cost of control.
Collecting data at large scales can have significant pro-social outcomes. Problems tend to emerge when entities take part in dual-use collection in which data is processed for collective benefit but also for self-dealing purposes that may degrade welfare. The self-dealing purposes will be justified as bankrolling the pro-social outcomes, which, absent collective oversight, cannot be considered to support claims to legitimacy for such legibility. It is vital for standards-setting organisations to establish not just purely technical devices but techno-social systems that can govern data at scale.
User agents should attempt to defend their users from a variety of high-level threats or attacker goals, described in this section.
These threats are an extension of the ones discussed by [RFC6973].
These threats combine into the particular concrete threats we want web specifications to defend against, described in subsections here:
Contributes to surveillance, correlation, and identification.
As described in § 2.6 Identity on the Web, cross-context recognition can sometimes be appropriate, but users need to be able to control when websites do it as much as possible.
Partitions are separated in two ways that lead to distinct kinds of user-visible recognition. When their divisions between different sites are violated, that leads to § 3.1.2 Unwanted cross-site recognition. When a violation occurs at their other divisions, for example between different browser profiles or at the point a user clears their cookies and site storage, that leads to § 3.1.1 Same-site recognition.
The web platform offers many ways for a website to recognize that a user is using the same
identity over time, including cookies,
CacheStorage, and other forms of storage. This allows
sites to save the user's preferences, shopping carts, etc., and users have come to expect this
behavior in some contexts.
A privacy harm occurs if the user reasonably expects that they'll be using a different identity on a site, but the site discovers and uses the fact that the two or more visits probably came from the same user anyway.
User agents can't, in general, determine exactly where intra-site context boundaries are, or how a site allows a user to express that they intend to change identities, so they're not responsible to enforce that sites actually separate user identities at those boundaries. The principle here instead requires separation at partition boundaries.
Cross-partition recognition is generally accomplished by either "supercookies" or browser fingerprinting.
Supercookies occur when a browser stores data for a site but makes that data more difficult to clear than other cookies or storage. Fingerprinting Guidance § Clearing all local state discusses how specifications can help browsers avoid this mistake.
Fingerprinting consists of using attributes of the user's browser and platform that are consistent between two or more visits and probably unique to the user.
The attributes can be exposed as information about the user's device that is otherwise benign (as opposed to § 3.2 Sensitive information disclosure). For example:
See [fingerprinting-guidance] for how to mitigate this threat.
A privacy harm occurs if a site determines with high probability and uses the fact that a visit to that site comes from the same person as another visit to a different site, unless the person could reasonably expect the sites to discover this. Traditionally, sites have accomplished this using cross-site cookies, but it can also be done by having a user navigate to a link that has been decorated with a user ID, collecting the same piece of identifying information on both sites, or by correlating the timestamps of an event that occurs nearly-simultaneously on both sites.
Contributes to correlation, identification, secondary use, and disclosure.
Many pieces of information about a user could cause privacy harms if disclosed. For example:
A particular piece of information may have different sensitivity for different users. Language preferences, for example, might typically seem innocent, but also can be an indicator of belonging to an ethnic minority. Precise location information can be extremely sensitive (because it's identifying, because it allows for in-person intrusions, because it can reveal detailed information about a person's life) but it might also be public and not sensitive at all, or it might be low-enough granularity that it is much less sensitive for many users.
When considering whether a class of information is likely to be sensitive to users, consider at least these factors:
Issue(16): This description of what makes information sensitive still needs to be refined.
Contributes to surveillance, correlation, identification, and singling-out / discrimination.
Unexpected profiling occurs when a site is able to learn attributes or characteristics about a person, that a) the site visitor did not intend the site to learn, and b) the site visitor reasonably could not anticipate a site would be able to learn.
Profiling contributes to, but is distinct from, other privacy risks discussed in this document. For example, unexpected profiling may contribute to § 3.1.1 Same-site recognition, by adding stable and semi-identifying information that can contribute to browser fingerprinting. Unexpected profiling is distinct from same-site recognition though, in that a person may wish to not share some kinds of information about themselves even in the presence of guarantees that such information will not lead to them being re-identified.
Similarly, unexpected profiling is related to § 3.2 Sensitive information disclosure, but the former is a superset of the latter; all cases of unexpected sensitive information disclosure are examples of unexpected profiling, but Web users may have attributes or characteristics about themselves that are not universally thought of as "sensitive", but which they never the less do not wish to share with the sites they visit. People may wish to not share these "non-sensitive" characteristics for a variety of reasons (e.g., a person may worry that their ideas of what counts as "sensitive" is different from others, a person might might be ashamed or uncomfortable about a character trait or they might simply not wish to be profiled).
Profiling occurs for many reasons. It can be used to facilitate price discrimination or offer manipulation, to make inferences about what products or services users might be more likely to purchase, or more generally, for a site to learn attributes about the Web user the Web user does not intend to share. Unexpected profiling can also contribute to feelings of powerlessness and loss of agency.
A privacy harm occurs if a site learns information about the user that the user reasonably expected the site would not be able to learn, regardless of whether that information aids (re)identification or is from a sensitive category of information (however defined).
Peter is a furry. Despite knowing that there are thousands of other furries on the internet, and despite using a browser with robust browser fingerprinting protections, and despite the growing cultural acceptance of furries, Peter does not want (most) sites to learn or personalize content around his furry-interest.
Privacy harms don't always come from a site learning things. For example it is intrusive for a site to
if the user doesn't intend for it to do so.
Contributes to misattribution.
For example, a site that sends SMS without the user's intent could cause them to be blamed for things they didn't intend.