Data Privacy in an AI World (Inside my AI Law & Policy Class #18)

Nov 03, 2025

∙ Paid

8:55 a.m., Monday. Welcome back to Professor Farahany’s AI Law & Policy Class.

Stop reading (yes, already), and pull out your phone or laptop. Now, go to your past search history on Google here. (or just Google “my activity” and click Web & App Activity). If you’re logged into Google, click “Manage Activity” to see the actual data.

I’ll wait.

I tried this before class. And then I did something CRAZY, that I DO NOT want you to do with your data. I printed (as a pdf) my Google search history, uploaded it to Claude, and asked it to analyze my search history and provide an ad targeting profile based on that data. Here is what it generated:

XXX-net-worth (yes, I hid what it said from you) academic professional (Duke University-affiliated, technology law/policy specialist), primary residence Durham, NC with investment property in Blowing Rock, North Carolina mountain region.

Income indicators: xxx-xxxx range based on multiple wealth management banking relationships (JPMorgan, Citibank, First Citizens), secondary property ownership, and absence of budget-constraint signals despite price-comparison behavior.

Family unit with school-age children (advent calendars, family activities).

Behavioral fingerprint: Early-morning research pattern (4-6 AM productivity window), extensive pre-purchase investigation averaging 5-7 searches before decisions, high privacy awareness (researches own digital footprint, BeenVerified, Google Activity), mixture of professional research (AI agents, biometric banking, Microsoft Copilot data access) with personal interests (hiking trails, Blake Lively lawsuit, royal family).

Purchase psychology: Research-intensive but financially capable, values quality and specification compliance (”regulation” cornhole), evaluates DIY before outsourcing (tree removal services).

Targetable segments: “Stealth Wealth Academics,” “Privacy-Conscious High Earners,” “Mountain Property Owners,” “Research-Heavy Premium Purchasers.”

Recommended ad strategy: Education/evidence-based messaging, premium but not luxury positioning, family benefit angles, privacy-respecting frequency caps. Conversion opportunities: Financial planning services, educational technology, outdoor recreation equipment, home improvement services for secondary property.

Platform extension: LinkedIn professional content, academic databases, real estate platforms. Confidence score: 89%.

Was this accurate? Uncomfortably so. Including my regulation cornhole search (who wants a non-regulation board?!). Do I want you to know all of this about me? Not really, but I’ll do a lot for educational purposes.

Ok, your turn. But just scroll through your own Google Activity. Look at the searches, the patterns, the timestamps. Imagine you’re a data broker analyzing this information. What profile would you create? What could someone infer about your:

Income level and spending capacity?
Family status and living situation?
Health concerns or medical conditions?
Work schedule and employment situation?
Hobbies, interests, and lifestyle?
Upcoming major life changes?

This is a class about AI and data privacy. So really, don’t share this with an LLM (if you don’t remember our conversation about training data, go back to lectures 5 and 6). Right now, I just want you to observe what exists about you. Notice what can be inferred from the digital breadcrumbs you’ve left behind. (Two of the live students had turned off history, but everyone else had theirs on by default).

Today, we’re going to discover the three categories of data you can’t see—inferences AI makes that aren’t shown to you, data shared between AI agents without your knowledge, and synthetic “facts” AI creates about you that never existed but become your permanent digital record.

As Professor Ryan Calo told the U.S. Senate in July 2024: “AI is like Soylent Green: it’s made out of people.” Data about people fuels AI models. The incentives to scrape the internet and collect data indefinitely aren’t a side effect—they’re the entire business model. And as we’ll discover today, that business model has created three fundamental privacy problems that our legal system wasn’t designed to handle.

Consider this your official seat in my class—except you get to keep your job and skip the debt. Every Monday and Wednesday, you asynchronously attend my AI Law & Policy class alongside my Duke Law students. They’re taking notes. You should be, too. And remember that live class is 85 minutes long. Take your time working through this material.

As you’ll remember, we’ve covered training data and copyright already in Class #5 (The $1.5 Billion Question: How AI Learns from the World’s Words), and Class #6 (Training Data, Discovery Wars, and Who Gets Paid: When Courts Meet Code), and bias and fairness in Class #9(When AI Discrimination Happens 1.1 Billion Times: The Scale of Algorithmic Bias). So this class is a different look at data privacy issues that AI makes different … if not worse. Ready to dive in?

I really think everyone should be aware of this. Share with your network if you agree.

I. The Three Privacy Problems—Discovery Through Experience

Take a piece of paper. Draw two lines to create three columns. Really do this. Don’t just imagine it. In the live class I handed out paper and asked the live students to do the same. And it helped us see why privacy isn’t just for people who think that they have “something to hide.”

Label them:

Column 1: What I Shared
Column 2: What AI Inferred
Column 3: What AI Generated About Me (That I Can’t See)

In Column 1, write three pieces of information you explicitly shared online this week. Like a photo you posted. A search you made. An appointment you scheduled. (Or draw from your google history if you want). Leave Columns 2 and 3 blank for now—we’ll fill those in together as we uncover each privacy problem.

Those three things you shared? They’ll likely generate thirty inferences you never consented to, which will then be used to create synthetic “facts” about you that you’ll never see but that will follow you forever. Those will become columns 2 and 3.

Problem 1: The Inference Explosion

Now fill in Column 2 based on what you wrote in Column 1. (List out all the possible inferences you think can be reliably and even unreliably made based on what you shared. My guess? You’ll underestimate it by a lot).

The Stanford HAI white paper, Rethinking Privacy in the AI Era, calls all of the data in this column “intimate inference from ambient data.” Let’s take a look at what that actually means.

Consider a gym selfie. You think you’re sharing: “I worked out today.” But AI can extract your exact gym location and typical workout times. It can estimates your income level from the gym brand visible in the background. Or generate a health consciousness score that insurance companies would pay dearly to access. It can map your social patterns by identifying who else works out there and when. Or infers your work schedule flexibility from the timestamp. Your relationship status based on whether a wedding ring is visible in the photo.

One photo. Six major inferences. Zero consent for any of them.

Or consider purchasing pregnancy vitamins. Target’s algorithm—the one that famously revealed a teenager’s pregnancy to her father through targeted coupons before she’d told her family—works by identifying patterns across about 25 products. When you buy unscented lotion in your second trimester, the system notes it. When you pick up calcium, magnesium, and zinc supplements in your first 20 weeks, it strengthens the inference. Large purses and bulk cotton balls add more data points. The algorithm becomes so precise it can estimate your due date within a two-week window. The system knows you’re pregnant before you tell your mother.

Mental health therapy appointments in your Google Calendar don’t just track your schedule. The AI can infer your mental health status, your workplace stress levels, your insurance utilization patterns. It can know you’re struggling before your best friend does. Or calculate the likelihood you’ll need medication. It can see when the therapy started, inferring the timing of your crisis.

Professor Ryan Calo told the U.S. Senate that these aren’t violations of traditional privacy law because you technically “consented” to data collection when you clicked “I agree” without reading the 47-page terms of service. But consider the asymmetry. Companies can infer intimate details while you have no comparable insight into their operations. They know you’re pregnant before you tell your mother. They know your marriage is failing before you admit it to yourself. They know you’re burning out before your boss notices.

The Mathematics of Inference

Look at your three pieces of shared information in Column 1. Now count what you wrote in Column 2. How many inferences did you identify? Five? Ten? Fifteen?

The reality may be worse. You shared three pieces of information. AI likely inferred approximately thirty attributes about you. The number of those inferences protected under current law? Zero.

The GDPR tries to address this through Article 22 on automated decision-making, but there’s a fundamental challenge. Models are black boxes that resist meaningful explanation. Even when companies must provide “meaningful information about the logic involved,” what does that mean for a 175-billion parameter model? It’s like asking a chef to explain how every molecule in your soup contributes to the taste.

The Pregnancy Prediction Timeline

Let’s make this inference explosion more visceral. Imagine these purchases over eight weeks:

Weeks 1-2 bring ginger tea, crackers, and unscented soap. The AI assigns 15% confidence—could just be flu or stomach issues.

Weeks 3-4 add prenatal vitamins (”just for hair health,” you tell the cashier) and comfortable flats. Confidence jumps to 45%—those vitamins are a strong signal.

Weeks 5-6 include “What to Expect When You’re Expecting” (”gift for a friend,” you insist) and maternity jeans (”they’re just so comfy!”). Confidence hits 78%—the book plus clothes combination is nearly definitive.

Weeks 7-8 bring cocoa butter lotion, body pillow, and calcium supplements. Confidence reaches 94%—the full pattern matches the pregnancy profile perfectly.

The critical moment? Somewhere between Week 4 and Week 6, the AI “knows” you’re pregnant with high confidence—likely before you’ve told family, your employer, or even confirmed it yourself.

“But what if I really was just buying vitamins for hair health?” The truth is, it doesn’t matter. The AI has decided, and that inference is now part of your permanent digital record.

Now look at your Column 2 again. Did you capture everything? Did you think about how those inferences combine? How they create a profile of you that you never authorized?

Problem 2: The Agent Cascade

The Stanford HAI paper identifies how existing privacy frameworks assume human-to-company interactions, not AI-to-AI negotiations. This is like maritime law trying to regulate air travel—the fundamental assumptions don’t map to reality.

Let’s trace what happens when you make a simple request: “Book me a therapist appointment.”

Your AI assistant accesses your calendar to infer availability patterns, checks your location history to determine convenient locations, pulls your insurance data to verify coverage, and scans your past health searches to understand your needs. That’s just the beginning.

Your AI then contacts therapy platform AIs and shares your insurance ID—from which they infer your income level. It shares your preferred times, revealing your work flexibility. It indicates location needs, exposing your mobility and transportation options. It communicates urgency level, suggesting crisis severity.

The platform AIs then share information with insurance AIs for pre-authorization and risk scoring, with scheduling AIs for optimizing provider calendars, with billing AIs for determining rates and payment likelihood, and with analytics AIs for adding you to population health models.

Finally, all systems update their models. Your “mental health risk score” becomes permanent. Your “treatment compliance probability” gets calculated. Your “insurance utilization pattern” affects future rates. Your employer’s “population health metrics” now include your data—aggregated, yes, but still derived from your individual appointment.

Consider the consent paradox. The GDPR requires consent to be “freely given, specific, informed and unambiguous.” But when you asked your AI to book an appointment, did you specifically consent to mental health risk scoring? To sharing with insurance algorithms? To inclusion in population health models? To permanent recording of your therapy-seeking behavior?

You consented to booking an appointment. You got a cascade of data sharing across multiple AI systems, each making additional inferences, each updating its models, each potentially sharing with other systems you’ll never know about.

Problem 3: The Synthetic Data Generator

Now we come to Column 3—and here’s where things get truly unsettling.

Write “Unknown” across the top of Column 3. Because unlike the inferences you could at least try to guess at, the synthetic data generated about you is fundamentally unknowable. You can’t see it. You can’t verify it. You can’t correct it. But it’s making decisions about your life.

The Stanford HAI paper calls this the “hallucination vs. memorization” problem—AI creating “facts” about you that never existed but become permanent in ways you’ll never see.

Try this experiment: Open ChatGPT or Claude right now. Have a three-minute conversation about your work. Then ask: “Write a professional bio for me based on our conversation.”

Watch what happens. The AI will confidently claim you have skills you never mentioned. It will invent projects you never worked on. It will attribute accomplishments you didn’t achieve. It will create specific details about your background that are plausible but false. And if that bio gets scraped and incorporated into other AI training data, those false “facts” become part of your permanent digital record.

The GDPR’s Article 16 gives you the right to correct inaccurate data. But can you correct something that was never “collected” but rather “generated”? Can you correct a rumor that hasn’t been spread yet but definitely will be?

The Stanford HAI paper points out that existing frameworks cannot easily police “inference-driven outputs,” especially when distinguishing between hallucination and training data memorization. Is the AI remembering something it learned about people like you, or is it just making things up? Does it matter if the result is the same—false “facts” that follow you forever?

Look at Column 3 on your paper. You wrote “Unknown” because that’s the only honest answer. You have no idea what synthetic data has been generated about you. You don’t know what scores you’ve been assigned. You don’t know what predictions have been made about your future behavior. You don’t know what false information has been created and is now circulating in AI models you’ll never interact with directly but that will make decisions about your loan applications, your job prospects, your insurance rates.

If AI generates false information about you that helps you—say, inflating your experience on that generated bio—should you have the same right to correction as when it hurts you? Why or why not? What does it mean for truth when synthetic data is indistinguishable from real data?

Share Thinking Freely with Nita Farahany

The Three Problems Synthesized

Before we examine how current law handles these issues, let’s make sure we understand what we’re facing:

First, Inference—AI knows things about you that you never shared, extracting intimate details from ambient data.

Second, Agent Cascade—AI systems negotiate with each other using your data in ways you can’t control or even see.

Third, Synthetic Data—AI creates “facts” about you that become your permanent digital record, regardless of accuracy.

Traditional privacy law assumes you’re protecting data you provided. It doesn’t account for data that’s inferred about you, data that cascades between AI systems, or data that’s synthetically generated about you.

II. Why Current Law Struggles—Three Scenarios

Let’s test existing legal frameworks against our three privacy problems. What happens when you try to use current law to protect yourself?

Scenario 1: Fighting Inferences Under the GDPR

The Situation: Microsoft Copilot has inferred you’re job hunting based on your document editing patterns. You’ve updated your resume seventeen times, created multiple versions of cover letters, and your “Projects” folder suddenly includes “Portfolio_FINAL_FINAL_really_final.pdf.” This inference affects your performance review.

Under the GDPR, you have several rights that seem promising. Article 15 gives you the right to access your personal data. Article 16 provides the right to rectification of inaccurate data. Article 17 grants the right to erasure. Article 22 offers the right not to be subject to solely automated decisions.

You invoke Article 15 to access the job-hunting inference. The company responds that the inference is their proprietary analytical output, not your personal data. The GDPR covers data “relating to” you, but their legal team argues the analysis is about patterns, not about you specifically.

You try Article 22, arguing this automated inference affects your performance review. The company points out that a human manager reviews all performance assessments. Human involvement—even if minimal—takes this outside Article 22’s scope, according to their interpretation of Recital 71.

You invoke Article 17 to delete the inference. The company agrees to delete your source data. But the model’s learned patterns from millions of users remain unchanged. They argue you can’t delete what was never stored as “your data.”

The Question This Raises is that the GDPR was designed for data “provided directly by the data subject.” Does it apply to emergent inferences from AI systems that learn patterns across populations? Current legal interpretations suggest inferences aren’t “your data” under GDPR, even when they’re about you. Is this the right interpretation? Should it be? What are the implications either way?

Scenario 2: The Agent Cascade Under U.S. State Law

The Situation: Your health AI shared your therapy-seeking patterns with insurance AIs, and your premiums mysteriously increased 30%.

California’s CCPA/CPRA gives you several rights. You have the right to know what personal information is collected, the right to delete it, the right to opt-out of “sale” of personal information, and the right to non-discrimination. The law also includes a proposed “risk-benefit test” for automated decision-making.

You invoke the opt-out right, demanding they stop selling your health data. The company responds that they don’t “sell” data. Under CPRA §1798.140, when your AI agent shares information to obtain services you requested, that’s considered service provision, not a sale.

You try the risk-benefit test, arguing the privacy risk outweighs benefits. The company explains that their assessment shows personalized therapy matching provides substantial benefits. The risk-benefit analysis is their internal process, they argue, not an individual right you can invoke.

Meanwhile, Colorado’s Privacy Act adds the right to opt-out of profiling for decisions with legal or significant effects. But by the time you know about the profiling, the agent cascade has already happened. Your data has flowed through multiple AI systems across states. California law? Colorado law? Which applies? Both? Neither?

State privacy laws create a patchwork of protections. The agent cascade crosses state lines instantly, but your rights don’t follow. Is federal legislation the answer? If so, what should it look like? Or are there advantages to state-level experimentation that we’d lose with federal preemption?

Scenario 3: Synthetic Data Under FTC Authority

The Situation: AI generates false “collaboration scores” about your work performance, rating you a 67/100.

The FTC’s Section 5 authority prohibits “unfair or deceptive acts or practices.” The agency has required “algorithmic disgorgement” in some cases—forcing companies to delete models trained on unlawfully obtained data. The FTC also investigates training data practices through civil investigative demands.

You file a complaint arguing the collaboration score is deceptive. The FTC considers: Did the company claim the score was factual? Did they deceive you about the AI’s capabilities?

The company responds that they clearly disclosed this is an AI-generated analytical metric, not a factual measurement. Users are informed it’s for guidance only, they argue. Their terms of service explain the limitations.

Section 5 was enacted in 1914 when “computer” meant a person who computed things. It assumes identifiable deceptive statements, not synthetic data generation. Is creating synthetic metrics inherently deceptive? Should it be? Courts haven’t decided. How should we think about this? What principles should guide us?

Understanding the Pattern

These three scenarios help us to see how existing privacy frameworks assume you’re protecting data you provided. They don’t account for data that’s inferred about you, data that cascades between AI systems, or data that’s synthetically generated about you.

The Stanford HAI paper describes this as a “privacy self-management” model built on notice-and-consent. This model assumes human comprehension at scale—but nobody reads 47-page privacy policies. It treats consent as binary rather than graduated or contextual. It ignores power asymmetries where take-it-or-leave-it isn’t really a choice. It can’t handle AI-to-AI interactions where there’s no human to notify or get consent from. And it doesn’t cover inferences or synthetic data, only “collected” data.

Privacy law developed for a world where you handed someone your information, and they filed it away. It’s like maritime law trying to regulate airplanes. Some principles transfer, but the fundamental assumptions don’t match the technology.

Which forces us to consider, whether the problem is that we’re applying old law to new technology? Or is the technology revealing fundamental gaps in how we think about privacy itself?

III. Proposed Solutions—Three Different Approaches

The Stanford HAI paper proposes three interventions that would fundamentally restructure how we approach privacy. These aren’t the only possible solutions, but they represent different strategic choices about where to intervene. Let’s examine what each would mean in practice.

Approach 1: Privacy By Default

The core idea is simple: flip the default from opt-out to opt-in. Instead of companies collecting everything unless you explicitly refuse, they collect nothing unless you explicitly request it.

Currently, every website collects by default. You must actively refuse each one. Refusal often means service denial. Cookie banners have trained us to click “Accept All” without thinking. The result? Meaningless “consent” at massive scale.

Under a privacy-by-default regime, collection would require explicit request. Services would need to function with minimal data. Collection would require justification beyond “because we can.” Technical standards like Global Privacy Control could communicate preferences automatically.

The EU’s cookie consent regulations demonstrate the challenge. They added friction without delivering control. Users developed consent fatigue. Dark patterns made rejection harder than acceptance. What lessons should we learn from that experience? And how many of us like to deal with the pop-ups where consent now becomes a meaningless and annoying task? (Shameless plug here — my husband has a take on how to solve this through their company, Rewarded Interest).

Some argue privacy-by-default is necessary to address the power imbalance between individuals and companies. Others worry it would stifle innovation and make services more expensive. How do we balance these concerns?

How this might address our three problems: If collection is opt-in only, the raw data for inferences is limited. Each data share requires new consent, potentially breaking the cascade. But would companies just make data collection a condition of service, recreating the same problem?

Approach 2: Supply Chain Transparency

The idea here is that nobody should be able to deploy an AI model without documenting its training data. Think of it like ingredient labels for food—you have a right to know what went into the AI making decisions about your life.

This would involve mandatory documentation of training data sources, intended uses and limitations, known biases, and demographic representation. It would require traceability so any datum can be tracked to its source. The enforcement mechanism would be algorithmic disgorgement—if you can’t show clean provenance, you must delete the model.

The FTC has already ordered some companies to delete models trained on unlawfully obtained data, though enforcement remains challenging. Technically, removing specific training data from a neural network ranges from difficult to impossible depending on the approach.

Proponents argue this is necessary accountability—if we don’t know what data trained an AI, how can we assess its reliability or bias? Critics worry about trade secrets and the technical feasibility of true data removal from trained models.

How this might address our three problems: Supply chain documentation could help audit what inferences are possible based on training data. It reveals data flows between systems. It could identify when AI generates versus retrieves information. But does documenting the problem solve it?

Approach 3: Data Intermediaries

This approach recognizes that individuals can’t effectively manage privacy alone. Instead, you would join a data trust or cooperative that acts as your agent, or require platforms themselves to be bound by fiduciary duties to protect your interests.

These organizations would negotiate standard contracts with tech companies on behalf of millions of members. They would automatically enforce your preferences across platforms. They would monitor for violations using technical infrastructure. They would bring collective legal actions when rights are violated.

The theory is that collective bargaining creates leverage individuals don’t have. A trust representing 50 million users could tell Facebook: “Our members’ data cannot be used for mental health risk scoring,” with real consequences if ignored.

Questions remain about governance. Who controls the trust? How are decisions made? Who bears liability if the trust makes poor choices on your behalf? Would this create new power imbalances?

How this might address our three problems: Trusts could negotiate limits on what can be inferred. They could control which agents access member data. They could audit and challenge false synthetic profiles. But would companies refuse to serve trust members, fragmenting the market?

Additional Legislative Ideas

Professor Ryan Calo’s Senate testimony also offers three specific statutory approaches:

First, mandate true data minimization with hard limits, not just principles. Therapy apps can’t collect location data. Period. No exceptions for “better service.”
Second, treat inferences as personal data. If AI infers you’re pregnant, that inference is yours with the same rights as if you’d announced it. Companies must track, disclose, and allow deletion of inferences.
Third, prohibit data misuse beyond mere deception. Using someone’s loneliness to sell them things becomes illegal, not just a questionable business practice.

Each approach raises questions. How do we define “necessary” data collection? Can we really give people rights over inferences that exist only in model weights? Where’s the line between persuasion and exploitation?

The Political Reality

The American Privacy Rights Act stalled in committee in June 2024 after controversial revisions removed civil rights protections. The June 27 committee markup was canceled amid signals from Republicans they would kill the bill. As of 2025, a number of states have comprehensive privacy laws, but no federal standard.

The result? We’re getting 50 different state approaches instead of one comprehensive solution. Some view this as productive experimentation. Others see it as chaos that helps no one except lawyers navigating the patchwork.

If you could only implement ONE of these approaches, which would you choose? Why? What would the unintended consequences be? What problems would it fail to solve?

IV. The Bottom Line

We’ve explored three fundamental privacy problems that AI creates:

Inference—AI knows things you never shared, extracting intimate details from ambient data.
Agent Cascade—AI systems share your data through chains you can’t see or control.
Synthetic Data—AI creates “facts” about you that become permanent regardless of accuracy.

We’ve seen how existing legal frameworks struggle with each problem. The GDPR has trouble regulating inferences that aren’t technically “your data.” U.S. state laws create a patchwork that can’t stop cascades crossing borders. The FTC’s authority predates the technology by a century.

We’ve examined three different approaches to solutions, each with tradeoffs. Privacy by default shifts the burden but raises questions about innovation. Supply chain transparency creates accountability but faces technical challenges. Data intermediaries provide leverage but create new governance questions.

We’re unlikely to get comprehensive federal privacy legislation soon. The American Privacy Rights Act’s failure in 2024 demonstrates the political challenges. So we’re left with state-by-state experimentation and legal frameworks designed for a different technological era.

But understanding the problem is the first step toward addressing it.

This week, watch your digital interactions differently. See the inferences being made. Notice the agent cascades. Think about the synthetic data being generated about you that you’ll never see.

And ask yourself, what kind of privacy protections do we actually need for an AI-powered world? Not what sounds good in theory, but what could work in practice given technical realities, economic incentives, and political constraints?

Privacy isn’t about having something to hide. It’s about having the power to shape your own identity before algorithms do it for you.

Class dismissed.

Your Homework:

Share what you learned today — and talk it over with a friend or family member. Sharing is caring.
Share
Think about what can be inferred about you. And whether you want those inferences made about you.

The entire class lecture is above, but for those of you who found today’s lecture valuable, and want to buy me a cup of coffee (THANK YOU!), or who want to go deeper in the class, the class readings, video assignments, and virtual chat-based office-hours details are below.

Keep reading with a 7-day free trial

Subscribe to Thinking Freely with Nita Farahany to keep reading this post and get 7 days of free access to the full post archives.