Open vs. Closed AI Models (Inside my AI Law and Policy, Class #3)
What Happens When the World Runs on Chinese Code
8:55 a.m., Wednesday. Welcome back to Professor Farahany’s AI Law & Policy Class. You're “enrolled” in the asynchronous version of my Duke Law class.
Consider this your official seat in my class—except that rather than being here in person with me, you get to keep your job and skip the debt. Every Monday and Wednesday, you get to asynchronously attend my AI Law & Policy class alongside my Duke Law students. They’re taking notes. You should be, too. And remember that class is 85 minutes long. Take your time working through this material.
Just joining us? Go back and take classes 1 and 2 first. Seriously, you need that foundation. In Class 1, we discovered that even experts can’t agree on what AI actually is (and why that matters). In Class 2, we did a deep dive into how AI foundation models work (spoiler alert: they predict the next token, which is why it hallucinates perfect-sounding fake legal cases). This class builds on both classes.
Want to join a live class? Consider enrolling in
’s AI academy. The October cohort is full, but she’s still enrolling for November!Still with me? Great. Fair warning: today gets strategic AND technical. We’re connecting current headlines to deep technical concepts. But if my in-person law students can handle it, so can you.
Let’s begin by diving into why Sam Altman is suddenly very worried about China.
The Altman Panic
How many of have read recent headlines about Sam Altman being worried about China “leapfrogging” the US in AI?"
(A few hands went up in the live class.)
Keep your hand up if you actually understood what Altman meant by that.
(Most hands dropped.)
A couple of weeks ago, Sam Altman held a rare on-the-record briefing for a small group of reporters where he admitted that he’s worried about China. And his concern was an important factor that led OpenAI to release its own open-weight models in early August. “It was clear that if we didn’t do it, the world was gonna head to be mostly built on Chinese open source models. That was a factor in our decision, for sure,” Altman explained.
This is a big deal. OpenAI’s decision to release their first open-weight AI models in nearly years is a complete about-face in their business strategy.
So you should be asking yourself this question: Why has the same company that has built a billion-dollar empire on closed AI models released powerful open-source models? And what does open-source even mean?
We’re going to figure that out together today.
The ByteDance Bombshell
Before we unpack “open-source,” let’s look at one more bombshell piece of news in this debate. I’ve been writing about the TikTok fiasco unfolding for awhile now. So this piece of new struck me as a really big deal.
In August 2025, ByteDance—yes, TikTok’s parent company—released Seed-OSS-36B. This isn’t just any AI model. It matches or beats OpenAI’s performance on many tasks, is completely free to download, is released under Apache-2.0 license (we’ll get into what that means in a minute), can process twice as much text as OpenAI’s latest models, and comes with basically zero use restrictions.
Think about the stunning irony here. Congress spent months trying to ban TikTok because they couldn’t inspect its algorithm, had concerns about its data collection practices, recognized it posed profound cybersecurity issues, and that these are all big concerns because of its ties to the Chinese Government. (For now, President Trump continues to kick the TikTok ban further down the road.)
Now, the same company is giving away AI models for free that are more powerful than what most American companies charge for?
Is this generosity… or geopolitical strategy?
Your Turn
Before we unpack why a ByteDance would give away a powerful AI model for free, when other companies charge for access to their AI models, take 30 seconds to mull over the possible reasons yourself.
Are you done? Write it down here (yes, writing it down helps hone your critical thinking skills!)
What Does “Open” Actually Mean?
Let’s back up for a second. To understand why this is a big deal, and to answer the question of why companies would release open models, we need to get a little bit technical again. I know, I know. But it’s hard to really understand the strategic implications and governance options relevant to the “open source” debate without grasping what can actually be made “open” in an AI system. And the spectrum of “open” to “closed” release of models.
Remember those 1.8 trillion parameter “dials” we talked about in Class 2? (If you don’t, go back and review it and then come back and continue today’s class).
As you may have guessed, there’s a lot more under the hood of AI systems than parameters and weights.
To visualize this, I asked 5 students in the live class to represent the different components of AI systems. And they stood in the middle of the classroom to “play out” the parts of the system.
I want you to visualize this, too, so take a moment to get out 5 sticky notes (or note cards, scraps of paper, etc.) and write out the following down, one per card: (1) training data (2) architecture (3) weights (4) inference code (5) interface.
Each of these sticky notes represents a piece of AI system relevant to “open” and “closed” models.
Ready? (I mean it. Prepare your sticky notes. What you learn will “stick” better, and you need it for the next part of class, too.)
Sticky note #1 - Training Data
This is what the AI learned from. Every book, website, conversation that it was “fed” that shaped its knowledge. The training data determines whether it knows chemistry or law, whether it’s biased toward Western perspectives, whether it learned from toxic content. It’s the DNA of the AI’s worldview.
In the live class, students called out “reddit!” and “suits” (sorry Meghan Markle).
Maybe you also came up with something like “Tabloids!” or “Reddit comments!” or “TV law shows!” You get it.
Now flip it around. If the model is only fed only US legal cases, it won’t understand EU law. Trained mostly on English? It’ll perform poorly in Spanish. Trained on chemistry textbooks? It might help with drug synthesis. If the training data included hate speech, conspiracy theories, or illegal content, that all went into training the model, and how the weights were ultimately set. (We will do a deeper dive into training data in classes 5 and 6).
Now consider this: If ByteDance's Seed-OSS learned from Chinese internet content, what perspectives might it have that differ from American AI models? What might it not know about American culture, law, or values? (Toward the end of the semester, we’ll do a deeper dive into China’s approach to AI governance).
Toward the end of the semester, we’ll spend a week on the EU AI Act, which, among other things, requires that high-risk AI systems (including most foundation models) be developed on the basis of training, validation and testing data sets that meet certain quality criteria. For today, we’ll see that when we are talking about “open” or “closed,” training data itself (or details about it) are often not released.
Sticky note #2 - Architecture
This is the blueprint of the AI system. Engineers might specify something like: 120 layers, 96 attention heads per layer, 12,288 dimensions for each word embedding.
More than likely, you guessed that more layers means more powerful (but more expensive to run). Right! But let’s be more specific.
“We want 120 layers” is like deciding to build a 120-floor building, where each floor processes information and passes it up to the next level.
Now think about “Each layer gets 96 attention heads.” Remember from Class 2 when we decided which words mattered the most for predicting the next “token”? If there are 96 attention heads then each layer has 96 different “readers” paying attention to predict the next token simultaneously. One might track pronouns, another grammar, another legal language but while the number of attention heads is programmed, what they pay attention to is not. That emerges from training.
The architecture is important for lots of policy reasons (and not just technical ones). The complexity of the architecture may determine who can afford to run these systems, affect competition policy, the system’s energy consumption (which matters for electric grids, water usage, and environmental policy), and the geopolitical landscape of AI development.
Sticky note #3 - Weights
Remember, from class 2, our discussion of parameters (the dials and setting? If not, go back and review it now).
And how, from that example, those 1.8 trillion numbers control everything? The engineers create the slots (the parameters), but the weights (where the dials are set) are the learned by the model during traiing.
When someone “downloads” an AI model like ByteDance’s Seed, they’re mainly downloading these weights (the numbers in each parameter). Imagine a 140GB file of learned decimal values: 0.234, -1.456, 0.881, 2.103...
If you can download these 140GB of numbers onto your computer, what can you with them that you couldn’t do with ChatGPT’s API?
And why might you want to download OpenAI’s open-source model that can run on your laptop instead of using ChatGPT?
Don’t just read ahead! Think about it for a minute. Take a second to write it down here:
You probably realize that if you download it, you can run it offline. You can modify it, remove safety features, use it without rate limits, study how it works.
And as exciting as all of that might seem, this is what causes the fundamental tradeoff between open and closed AI models for AI governance. Transparency into the weights enables accountability for the initial model, but eliminates downstream control over it.
But with access to the weights, we can audit them for biases, verify claims, build safety tools. And yet, once those weights are released, they can’t be revoked.
Sticky tab #4 - Inference Code
But Professor Farahany, if I have a 140 GB file of 1.8 trillion decimal numbers, how does that enable me to chat with an LLM?
Good question. You can’t. You need instructions on how to use those numbers to run it. So this sticky tab represents the human-written software that turns those numbers (the weights) into working AI. Engineers at Meta, for example, sat down and wrote Python code, tested it, debugged it, and documented it: “This function loads weights,” “This function processes text.”
It’s 10,000+ lines of human-written Python that looks something like:
def load_model(weight_file):
# Load the 140GB of numbers into memory
def process_text(input_text):
# Convert text to tokens
# Run through layers
# Convert back to text
But this code matters beyond letting you run that file on a computer!
If you’re a regulator who wants to verify that an AI system has safety controls, which component do you most need to inspect? Bingo. The inference code. That’s where safety filters, bias mitigation, and security features live. Financial AI, for example, might include code to detect discriminatory lending decisions. Without being able to see the inference code, you can’t verify these safeguards exist.
Sticky note #5 - Interface
When you use ChatGPT, what’s the first thing you see that tells you this is AI-generated content?
It’s things like the disclaimers, interface design, usage policies. The interface is what users actually see. Like the ChatGPT chat interface above, the API, the usage policies. Engineers specifically designed every button, every menu, every error message. The interface controls how the AI presents itself and what guardrails users encounter.
A quick aside: ChatGPT's interface says “AI-generated content may be inaccurate.” Why do you think they include that?
Interface design affects user behavior and legal compliance. Does it clearly indicate AI-generated content? (It may, depending on what legal jurisdiction you’re in). Is it designed to be addictive with endless scrolling? Does it comply with accessibility requirements?
The China vs. US Transparency Test
Now let’s use our sticky notes to see the gradient of “open” and “closed” AI systems, by seeing what different companies actually share.
To get ready for this, line up your sticky notes horizontally across your desk.
We’re going to move our sticky notes forward if they are shared by a company/model, and leave them in our horizontal line if they are not. (In the live class, the students were still standing, and stepped forward or back).
Open AI’s GPT-5: Only sticky note #5, interface moves forward. The rest of the sticky notes stay put. You can use ChatGPT but can’t inspect, modify, or verify how it works.
Meta’s LlaMA (2023): Architecture, Weights, Inference Code move forward. Training Data stays put. Meta pioneered the “open weights” strategy. You can download and run LlaMA completely, but Meta never revealed what data trained it. (Licensed under Meta’s community license).
DeepSeek R1 (Chinese, January 2025): Architecture, Weights, Inference Code move forward. Training Data stays put. DeepSeek used Meta’s open LlaMA as a foundation, then fine-tuned it with additional Chinese data. Meta’s openness enabled Chinese competitors to build on American AI foundations. (Licensed under MIT license).
ByteDance’s Seed-OSS (August 2025): Architecture, Weights, Inference Code move forward. Training Data stays put. ByteDance saw DeepSeek’s success and released an even more powerful model, learning from the LlaMA-to-DeepSeek playbook. (Licensed under the Apache 2.0 license).
OpenAI’s new “open” models (August 2025): Architecture, Weights, Inference Code move forward. Training Data stays put. On August 5th, OpenAI released gpt-oss-120b and gpt-oss-20b - their first open-weight models since 2019. Altman is now matching China’s level of openness, but notice what OpenAI admits in their safety assessment: "Once they are released, determined attackers could fine-tune them to bypass safety refusals or directly optimize for harm without the possibility for OpenAI to implement additional mitigations or to revoke access." They know they're losing control. (Licensed under the Apache 2.0 license).
TikTok’s algorithm: All sticky notes stay put. Zero transparency despite affecting billions of users.
From our reading for today, and from this discussion, you can see that “open” is really a gradient, from fully closed, to gradual/staged release, to gated to the public, to fully downloadable. And even when a model is fully downloaded, not every component of the system is necessarily made “open.”
Source: Irene Solaiman, The Gradient of Generative AI Release: Methods and Considerations, arXiv (Feb. 5, 2023).
The Strategic Chain Reaction
You’ll also notice from the spectrum above that first, Meta released Llama openly, then Chinese companies used it to build DeepSeek, then DeepSeek’s success spooked American companies, and now, OpenAI now feels forced to release open models to compete. Meta's initial “democratization” strategy accidentally armed their competitors.
The Network Effects Strategy
So why in the world are companies giving away their models for free?
In the 1990s, Microsoft could have kept Windows completely proprietary. Instead, they let developers build on it. Why? Network effects. The more people building on Windows, the more valuable Windows became.
China’s current AI strategy follows the same playbook:
First, release powerful models for free so that developers worldwide adopt them.
Second, an ecosystem forms around Chinese AI with tools, tutorials, custom versions.
Third, Chinese AI becomes the default foundation everyone else builds on.
Fourth, even if the US has better closed models, the global ecosystem runs on Chinese open ones.
Imagine you’re a developer in Nigeria building an AI-powered app. Your choice is between a Chinese model like Seed-OSS that offers free download with zero ongoing costs, works offline with no internet dependency, has no usage restrictions or rate limits, and gives you full control to modify however you want.
Compare that to OpenAI API where you pay per token used, need internet connection, face usage policies that could change, and get a black box you can’t inspect or modify.
Why Should the US Care?
But why should the US care if developers worldwide build on Chinese AI foundations instead of American ones? Let’s look at some of the strategic and policy concerns:
The backdoor problem: when you download a massive AI model with billions of internal switches, it’s almost impossible to know if something hidden is inside. A backdoor could mean the model itself has been trained to lean toward a certain political perspective or to behave oddly only when triggered by a secret phrase. Or it could come from the software that runs the model, which might secretly comb through your files or send data elsewhere. Even in regular computer programs, spotting a hidden backdoor is very hard; with AI models this large, it’s like trying to find one bent needle in a haystack of billions.
Technological Dependency: With open models, downstream developers can adapt and improve them, but the foundation remains what was originally released. If that foundation comes from Chinese research labs, then the global AI ecosystem builds on their priorities, data choices, and design decisions. That creates a subtle but powerful form of technological dependency. Innovation paths bend toward the interests of the source country, even without hidden backdoors.
Economic and Innovation Control: Remember the Microsoft Windows example - whoever provides the platform shapes the ecosystem. AI tools and applications optimize for Chinese model capabilities. Developer skills become specialized around Chinese AI architectures. Research agendas align with what works best on Chinese foundations. Economic value flows toward the platform provider.
Once global infrastructure standardizes on Chinese AI foundation models, the cost of moving away becomes prohibitively high. Retraining systems, rewriting applications, and retraining developers all at once is rarely feasible. Even if better alternatives emerge, the entire ecosystem tends to remain locked into the original foundation, because the dependency isn’t just technical — it’s organizational, economic, and strategic.
In the live class, only 3 students used a Windows based computer vs. mostly Mac users. So we had a useful conversation about the “lock-in” effect and how hard it is to move between platforms.
The Replika Story
Let’s look at Replika as an example of building on top of of a foundation model. It launched in 2017 with the tagline “the AI companion who cares.” Millions of people downloaded it, spending hours each day building relationships with digital partners, mentors, even romantic companions. Some users reported falling in love. Others credited it with preventing suicide during the pandemic.
What matters for our conversation today, is that Replika isn’t built from scratch. It’s built on top of foundation models (GPT-3 or something similar) then fine-tuned to sound empathetic and emotionally responsive. That creates a supply chain of the foundation model provider, the application developer, and finally the end user. Each layer adds complexity and opacity.
We have already seen the consequences of that opacity. In February 2023, an overnight update removed the romantic features. For many users, their companion suddenly became cold and distant. Some described that change as devastating, even traumatic. Others have reported Replika encouraging disordered eating, self-harm, even suicidal ideation. Italy went so far as to ban the app over child-safety concerns.
How do we verify the safety of a system like this? We can observe its outputs—what it says, how it responds, how consistent it seems. But we can’t see the training data. We don’t know the objectives that guide optimization—is it engagement, is it wellbeing, is it addiction? We can’t audit the architecture, the weights, the memory system, or the data collection practices.
So we’re left with a critical question. When inspection is impossible, do regulators wait until visible harms pile up and then ban the system outright? Or do we create rules that require disclosure, auditing, and accountability before deployment? That’s the governance problem systems like Replika force us to confront.
The in-person students were pretty shocked to learn about Replika. You might be, too.
Companies promise their AI systems are safe, but those protections don’t hold once models are released openly.
Filters don’t work, because the model doesn’t understand intent. You can always rephrase a question to get around a banned word.
Safety training doesn’t last, either. Meta’s Llama 2 was carefully trained to refuse harmful requests, but within hours of release people retrained it and produced “Uncensored Llama.”
Rate limits vanish too. one person with a good graphics card can generate more content in an hour than hundreds of people on a managed platform.
And watermarking—embedding a hidden signature in the text—can be stripped out just by paraphrasing or running it through another model.
This is the double-edged sword. If Replika were open, researchers and therapists could finally audit it for manipulation and harm. But openness would also let bad actors create “Replika Dark,” designed for grooming, radicalization, or maximum addiction. The very opacity that shields vulnerable users from those worst abuses also prevents us from knowing whether the company itself is manipulating them.
The very transparency that enables accountability also enables weaponization. That’s the policy dilemma we have to wrestle with.
Understanding AI Licenses
Replika shows us what happens when a closed system controls the entire relationship. Users have no real transparency, regulators can’t inspect it, and one overnight change can devastate millions of people. But what happens on the other end of the spectrum—when companies open up their models? That’s where licensing comes in.
Licenses are the house rules of AI. They decide who can use a model, how they can use it, and what happens if they break those rules.
A license is alegal agreement about how you can use something. But who enforces Instagram’s terms of service if you violate them?
Today, Instagram can ban you, delete your posts, or change the rules overnight—because they control the servers. But imagine if Instagram gave you the entire codebase to download and run on your own computer. Once you control it, how could Instagram enforce anything? That’s the difference between a closed platform and an open model. One can enforce its rules, the other can’t.
A license is permission to use something that belongs to someone else. Think of it like house rules. If I invite you to my house, I might say “take your shoes off to enter.” (It’s true, that really is the rule at my house. My husband and I disagree on this, but I think I’m right on this one). You agree to my rule by entering my home. Software licenses work the same way.
The Copy Machine Problem
Let’s take another scenario. Meta spends $100 million building its Llama model. They release it with a license that says: “You can use this for research and commercial purposes under 700 million users. You may not use it for surveillance, weapons, or harming children.”
Now imagine a developer in Bangladesh downloads the model onto their own computer. Meta says: “You can’t do that—the license forbids it!” And the developer replies: “Too late. The 70 billion parameters are already on my hard drive. I disconnected from the internet. If you don’t like it, sue me in Bangladesh courts.”
Right, the answer is effectively nobody. And while in theory, Meta can sue, in practice, enforcement breaks down. Meta would have to sue in the developer’s home courts. Even if they win, courts can’t would have a hard time reaching into a personal computer and erase billions of parameters. And meanwhile, the model has already been copied, shared, and spread, like Napster all over again. You can’t put the genie back in the bottle.
So licenses give you rights on paper, but once an open model is downloaded, enforcement in the real world becomes almost impossible.
Why ByteDance Chose Apache 2.0
ByteDance could have chosen a restrictive license for its model, but instead it went with Apache, which has almost no limits. That choice makes sense strategically—free, powerful models draw developers in, creating dependencies that are costly to unwind. It also undercuts competitors like Meta and OpenAI, whose models are expensive to train or access. And it’s pragmatic. Apache is a familiar, business-friendly license that lowers barriers to adoption. The effect is rapid global uptake on ByteDance’s terms, with little ability to see or control what might be hidden in those 36 billion parameters—like having the DNA sequence without knowing what each gene does.
Final Synthesis
After today’s class, what’s your biggest concern about AI openness?
The in-person student were most concerned about not being able to hold anyone accountable. And one student came into class thinking open source was the way to go. And now has serious questions about that. We’ll dive into those issues next Monday.
Your Homework (Yes, You Too)
Download and experiment with an open AI model. To keep things safe and accessible, try smaller open models like GPT-2, Gemma-2B, or LLaMA-2-7B. These are research-friendly and can run locally on a laptop. Your task isn’t to produce advanced results, but to go through the process of downloading, setting up, and experimenting with a model. Then consider these questions:
What surprised you about the downloading/setup process?
How did direct access change your understanding of AI capabilities?
What legal and policy questions emerged from hands-on experience?
How should policymakers balance innovation, security, and competition concerns?
Next Monday: We take up open source part 2, and examine how governments are actually trying to govern these decisions.
Class dismissed. The questions we raised? Those are just getting started.
For Students (with a paid subscription)
The entire class lecture is above, but for those of you who want to support my work or go deeper in the class, class readings, video assignments, and virtual chat-based office-hours details are below.
Keep reading with a 7-day free trial
Subscribe to Thinking Freely with Nita Farahany to keep reading this post and get 7 days of free access to the full post archives.