23 04 13 [#163]

ChatGPT Is Not Intelligent

Emily M. Bender

ChatGPT Is Not Intelligent w/ Emily M. Bender

04 13 23

or listen on:

Notes

Paris Marx is joined by Emily M. Bender to discuss what it means to say that ChatGPT is a “stochastic parrot,” why Elon Musk is calling to pause AI development, and how the tech industry uses language to trick us into buying its narratives about technology.

Emily M. Bender is a professor in the Department of Linguistics at the University of Washington and the Faculty Director of the Computational Linguistics Master’s Program. She’s also the director of the Computational Linguistics Laboratory. Follow Emily on Twitter at @emilymbender or on Mastodon at @emilymbender@dair-community.social.

Guest

Tech Won’t Save Us offers a critical perspective on tech, its worldview, and wider society with the goal of inspiring people to demand better tech and a better world. Follow the podcast (@techwontsaveus) and host Paris Marx (@parismarx) on Twitter, and support the show on Patreon.

Links

Emily was one of the co-authors on the “On the Dangers of Stochastic Parrots” paper and co-wrote the “Octopus Paper” with Alexander Koller. She was also recently profiled in New York Magazine and has written about why policymakers shouldn’t fall for the AI hype.
The Future of Life Institute put out the “Pause Giant AI Experiments” letter and the authors of the “Stochastic Parrots” paper responded through DAIR Institute.
Zachary Loeb has written about Joseph Weizenbaum and the ELIZA chatbot.
Leslie Kay Jones has researched how Black women use and experience social media.
As generative AI is rolled out, many tech companies are firing their AI ethics teams.
Emily points to Algorithmic Justice League and AI Incident Database.
Deborah Raji wrote about data and systemic racism for MIT Tech Review.
Books mentioned: Weapons of Math Destruction by Cathy O'Neil, Algorithms of Oppression by Safiya Noble, The Age of Surveillance Capitalism by Shoshana Zuboff, Race After Technology by Ruha Benjamin, Ghost Work by Mary L Gray & Siddharth Suri, Artificial Unintelligence by Meredith Broussard, Design Justice by Sasha Costanza-Chock, Data Conscience: Algorithmic S1ege on our Hum4n1ty by Brandeis Marshall.

Transcript

Paris Marx: Emily, welcome to Tech Won’t Save Us!

Emily M. Bender: Paris, thank you so much for having me on. I’m excited to be a part of this.

PM: I’m very excited to chat with you since I was talking to Timnit Gebru, back in January and I know that she has worked with you before. I was like, Okay: I need to get Emily on the show. Then I was talking to Dan McQuillan, I believe it was last month — maybe it was the month before now. Time is a bit of a mess over here. He was mentioning your work as well and I was like: Right, I really have to reach out to Emily and get her on the podcast, so we can talk about AI and all of this. I’m very excited, as you can tell, to have you on the show to really dig into all that we’re hearing now around AI and large language models, and how your work can help us to understand this a little bit more.

In getting through the hype and all of this that these companies want us to be obsessed with so that we’re not paying attention to the fundamentals and what we should be understanding. To get us started, before we get into all of those bigger questions, I want to ask a little bit about you. So can you explain to us what it means to be a computational linguist, and how you got into doing this work that brings together language and computers?

EB: Absolutely, so computational linguistics, put simply, is getting computers to deal with human languages. There’s a couple different reasons you might do that: you might be interested in doing linguistic research and using computers as tools to help you with that; or you might be interested in building what we call human language technology. This used to be obscure. But now, you can’t go through a day without interacting with language technology, if you are living in this situation where tech is around you. There’s plenty of people on the planet who don’t use it. But for many of us, we’re talking search engines; we’re talking automatic transcription; we’re talking machine translation.

But we’re also talking also things that are more behind the scenes, like automatic processing of electronic health records. For example, to flag patients who might need a certain test or to match patients to clinical trials. There’s applications in the legal domain, in the process of discovery and a lawsuit. The list goes on and on. Basically, any domain of endeavor where we use language to do some work, there is scope for doing language processing to help the people do that work. So it’s it’s a big area. It’s not just chatbots — it’s also sometimes referred to as natural language processing. Typically, if you’re coming at it from a computer science point of view, you’re going to call it natural language processing. We interface with people who do signal processing in say, electrical engineering, especially around text to speech and speech to text, for example.

It’s a very multidisciplinary endeavor and the linguists bring to that an understanding of how language works, internally and it structures, in dialogue between people and how it fits into society. You will oftentimes see NLP just framed as a subfield of AI, which I get grumpy about, and a lot of machine learning papers that approach language technology problems will start by saying: Well, you could do this by hand, but that requires experts, and they’re expensive, and so we’re going to automate it. My reaction to that was: No, no! Hire linguists — that’s a good thing to do in the world.

PM: Absolutely more linguists the better! Hire them — as many as possible. I think that sounds good. I did political science as my Bachelor’s, but I always said that if I could go back and start over, knowing what I know now, I would probably do linguistics as an undergrad degree, because I find languages fascinating. But I just didn’t realize that at the time, unfortunately.

EB: Language is amazing and linguistics is really great because you get to dig into language. How do you put smaller pieces together called morphemes to make words? How do words make sentences? How do you get to the meaning of a sentence from the meaning of its parts? But also socially — so, how do languages vary within a community and over time? And how does that variation interact with various other social things going on? You can look at language in terms of how people actually process it in their own brains, how people learn it as babies, how we learn second languages. There’s all kinds of cool stuff to do. It used to be a particularly obscure field.

So I hadn’t heard of linguistics until I started my undergrad education, and then just happened to notice it in the course catalog. I took my first linguistics class in my second semester at UC Berkeley and was instantly hooked. It took me the rest of the semester to convince myself I could major in something that I perceived as impractical, but I ran with it. So my background is all linguistics — Bachelor’s, Master’s, PhD, all in linguistics. While I was doing my PhD, I started doing the computational side of linguistics. So, in particular, I was working on grammar engineering. That is actually the building of grammars by hand in software, so that you could automatically do what’s effectively the industrial strength equivalent of diagramming sentences.

PM: That’s fascinating. I wonder, thinking about that history — you’ve been working on this and studying this topic for a while and you talked about how this field is much more than just the large language models that we’re seeing now that people are obsessed with chatbots and things like that. How have you seen this develop, I guess, over the past couple of decades as these texts technologies have matured, and I guess, become more powerful over that time?

EB: When I really joined the field of computational linguistics, it was roughly, as I was starting this Master’s program, in computational linguistics that I run at the University of Washington. So we welcomed our first cohort to that program in 2005, and I started working on establishing it, really, in 2003. So that’s the moment where I really started interacting with the field. What I saw there was this ongoing debate, or discussion, between rule based versus statistical methods. So are we going to be doing computational linguistics by hand coding rules, as we do in grammar engineering? And some of us still do.That’s also true in industry. A lot of work around simple chatbots that can help you with customer service requests, or some of the grammars behind speech recognition systems are hand engineered for very specific domains. It’s still actually a thing, even though it’s frequently framed as the old school way of doing it.

There was that versus machine learning, otherwise known as statistical methods, and the idea there is that you would label a bunch of data. So, you still have linguistic knowledge coming in, but it’s coming in via people applying some annotation schema. They are showing you what the parse trees should be for syntactic parsing, or labeling the named entities and running text, or labeling the groups of phrases that refer to the same thing in some running text. Or producing what’s called bitexts — so your translations from one language to another across many, many documents. Those are your training data and then various machine learning algorithms can be used to basically extract the patterns to be able to apply them to new input data. That way, you get speech recognition; you get machine translation.

All of that was called statistical methods. So this is things like support vector machines, and conditional random fields, and the really simple one is decision trees and so on. Then in 2017, or so is when we started seeing the neural methods really exploding in computational linguistics/NLP. They’re called neural methods and they’re called neural nets, because the people who initially develop them, many decades ago, took inspiration from the then current model of how actual neurons work. But they’re not actually neuron [both laugh]. That’s one of the first places where the hype creeps in. Calling these things neural nets is like: Okay, they are networks inspired by neurons or something. A wordier phrase might be better.

What people noticed was that if you ran one of these neural nets, and the popular one of the time was called an LSTM, which stands for long short-term memory, which is also a cursed technical term, because you’ve long and short right next to each other [laughs]. This is something that could be used to, basically, come up with good predictions of what word is missing, or what word should come next. I should say that language models like that have been part of computational linguistics for a long, long time. The statistical methods were also there from the start with the work of Shannon, and others, but they tended to be mostly in speech recognition. The idea there is that you would do an acoustic model that takes the sound wave and gives you some likely strings in the language you’re working on, that could have corresponded to that sound wave.

Then you have a separate thing called the language model that chooses among those possible outputs to say: Well, this is the one that actually looks like English, if we’re doing English. In the 1980s, some researchers at IBM said: Hey, we could use that same thing for machine translation. It’s called the Noisy Channel Model, so the idea is that there’s some words, and they got garbled, and that produced the speech signal. Let’s guess what they were, and then clean up those guesses by running a language model over it. Applying that to machine translation is basically — it was French and English at the time — using something called the Canadian Hansard’s, which is the corpus coming out of the Canadian Parliament, because it had to be translated.

PM: Right, okay! [laughs]

EB: So that was a very available bitext — so French to English translation.

The idea was, and I hate this, but it was basically that the French speaker actually said something in English, but it came out garbled and now we have to figure out what the English was [laughs]. I really don’t like that model of what’s going on, and nobody thinks that that’s actually how machine translation works. There too, you would basically say: Okay, well, these French words tend to match these English words, but let’s run an English language model over this to choose among these possible outputs and get the one that sounds the most plausible. Those language models, back in the day, tended to just be n-gram models, so given the previous 1, 2, 3, 4, up to five words, let’s say, what are the distributions of probabilities of all of the other words in the vocabulary coming next? That helped — it was an important component for that kind of system. But it was pretty brittle because you quickly run into data sparsity.

If you think about five grams — so sequences of five words — there’s lots of possible ones of those that just aren’t going to show up in your training corpus no matter how big it gets. Although, I do remember, at one of our big conferences in 2007, Franz Och — who was I think, at Google at the time working on machine translation — gave, I believe, a keynote. He had this graph showing how, if you just throw more data at the problem, and just get bigger and bigger training sets, the metric evaluating the machine translation, output goes up. That metric is called BLEU and it’s also vexed, but let’s set that aside. He was basically just: Rah, rah! Big data is the way to go. It’s true that the metric was going up, but it was also hitting an asymptote, and te amount of data was on a log scale. It really wasn’t making the point that he wanted to make. There’s no data, like more data was the slogan.

Looking at that graph, it was pretty clear that we would have to use the data more cleverly. One way to do that is to bring in more linguistic knowledge. Another way to do it is to say: Let’s build better language models. So instead of these n-grams, we’re going to do neural language models like this long short-term memory. That’s first of all, just a better language model. But secondly, it leads to vector space representations of each of the words. You can basically pull out the states of the neural network corresponding to having read in a word. That is very powerful, because it situates words that have similar distribution in the training text near each other, and allows you to share information across words and get a handle on some of those data sparsity questions.

In about 2017, the field as a whole realized that that was going to just revolutionize every single task. So it got really boring for a while, where it was basically all the conference papers were: Take an existing task, throw in the word embeddings, coming out of these language models, and get a better score. So that was the late teens in NLP. Then these early language models were not context sensitive — you ended up with basically one representation per word. The next step with the transformer models was you ended up with representations that were specific to the word in the context where you’re looking at it. That got even more powerful and a bunch of effort went into how do we make these things efficient enough to train that we can make them really, really big, both in terms of the number of parameters that are being trained in the network and the size of the training data? And this is where we come in with the Stochastic Parrots paper.

PM: I think it’s fascinating to hear that, of course, Google is pushing for more data to be collected and for that to be the way that the industry is approaching these models, because of course, that’s its whole game. It’s collecting basically everything out there on the web, bringing it into its servers, and then processing it through whatever it kind of does. I think it’s great that you ended that response by referring to the “Stochastic Parrots” (“On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”) paper, because that’s where I wanted to pick up and bring us into what we’re discussing now, in terms of these chatbots and large language models.

I feel like this concept has become really important as we’ve been trying to understand what these technologies are actually doing. I always think it’s important in these conversations to talk about how these technologies actually work, to dispel misconceptions that people might have about them, especially, when we’re in this moment of extreme hype. I feel that that gives us a good opportunity to talk about the stochastic parrots. So what does that actually mean? What does that term mean? Why is that relevant to the current conversations around generative AI and large language models?

EB: So I think the connection I have to make is between where we were in that last response and generative AI. So everything I’ve described so far about the language models was really just using them in processing and classification tasks and not in outputting language. I mean, it was choosing my outputs in speech recognition or machine translation, but not saying: Let’s just guess what string comes next. Part of it is because it used to be really bad. Like most people who have smartphones, or even actually old dumb phones, anything that had a predictive text model, has played the game of: Start a string and then just hit the middle option over and over again and see what it writes, right?

That was pretty silly and fun and you never thought that actually knew anything about you. You could maybe see a reflection of what kinds of things you tended to type on your phone, that would inform the way that comes out. You got the sense of: Okay, this is reflecting back to me something about the statistics of how I use language. At the point that we wrote the “Stochastic Parrots” paper. That was still the case, people were not doing a whole lot of using this to generate language. What’s happened with GPT-2 out of OpenAI, although that one wasn’t so impressive, and then really took off with GPT-3 and ChatGPT and now it’s in Bing and it’s in Bard, and it’s all over the place.

PM: Just keeps escalating, unfortunately.

EB: But even before they were being used to generate strings, we started seeing a lot of claims that the language models understand language. As a linguist, I was like: No, they don’t! And I can tell you they don’t without having to run specific tests on them because as a linguist, I know that languages are symbolic systems — where it’s about pairs of form and meaning. Yes, those meanings change over time. Every time you use a word, you make slight changes to make it fit into the context and over time that builds up and words change.

Really fun example, although depressing: Sally McConnel-Ginet, who’s a semanticist and sociolinguist — not sure she’s the original documentary of this, but she makes an argument about it. So the English word hussy comes from the English word housewife. Alright, and Sally McConnel-Ginet’s argument is that you get from one meeting to the other through a series of pejorative uses over time. So you can see how social meaning and what people are doing with words affects what they mean. So yes, absolutely meaning is use, but use isn’t just distribution and text. Use embedded in a social context. It’s embedded in communicative intent. But these language models — so GPT-2, GPT-3, etc. — their only training data is the form. The only thing they have access to is the distribution of word forms and text. So I wrote a paper, which was a result of having interminable Twitter arguments with people about this.

PM: [laughs] It’s a good way to inspire some work.

EB: Man, frustration papers! So I started working on it in 2019, and it was published in 2020. This is with Alexander Koller, where we basically just lay out the argument for why meaning isn’t the same thing as form, and therefore something trained only on form is only going to get form, it’s not going to get meaning. Even if the similarities between word distributions can tell you a lot about similarities between word meanings, it’s still not going to get to meaning; it’s not going to get to understanding. That paper came out at ACL (Association for Computational Linguistics) in 2020, and it’s the one with the octopus thought experiment in it.

PM: Yes. If people, of course, haven’t read about this, it’s in a New York Magazine article that I will link to in the show notes, which is fascinating.

EB: So there’s the octopus paper, and then there’s the “Stochastic Parrots” paper, and so apparently, I need to do a paper about a quoll or something next.

PM: A lot of animal metaphors that I really appreciate.

EB: In “Stochastic Parrots,” that paper came about because Dr. Timnit Gebru, who’s amazing, and you said that she’s worked with me and I was the one that got to work with her.

PM: Fair enough.

EB: A thrill! But she approached me, actually, over Twitter in DMs, asking if I knew of any papers that brought together the risks of ever larger language models, and I said: No, but here’s a few things I can think of off the top of my head. Then the next day, I said: Hey, that looks like a paper outline, you want to write this paper, and so that’s how that started. We were basically reacting to the way that these companies just wanted to make them bigger and bigger and bigger, and sort of saying: Maybe it’s time to stop and think about what the risks are, instead of just barreling down this path.

One of the risks that we identified with some insecurity, actually — we thought people might not take it seriously, like of course, they’re not going to use language models that way — was that if you have a coherent seeming, plausible sounding text, people are going to fall for that. They’re going to think it actually reflects some reasoning, some thoughts, some knowledge of the world, when it doesn’t. So that’s in there, I think, in the section called “Stochastic Parrots,” and boy, did that start happening.

PM: When I was reading through the paper — which, of course, was only recently, not when the paper originally came out — I was taken by that point as well, and how it was made there and how we’re very much seeing that now, how we’re seeing these text generation machines — basically, these chatbots — churning out all of these conversations, or all these search results, or whatever. That people are looking for meaning within, we’re seeing all these stories where journalists and various other people are having these conversations with the chatbots and saying: Wow, it’s responding to me in this way; there’s meaning in what it’s saying. I am scared, or shocked, or I can’t believe what I’m seeing, this computer churned out at me.

As I was reading that piece — and this has, of course, been on my mind for a while — but as I was reading that part of the “Stochastic Parrots” paper, I was immediately thinking back to Joseph Weizenbaum in the 1960s, building the ELIZA chatbot, seeing how people were responding to it, and again, placing meaning within that system. So how that shocked him and made him made him a critic of these systems the rest of his life.

EB: His 1976 book, I think it’s called “Computer Power and Human Reason: From Judgment to Calculation,” is a classic and should be required reading for anyone doing this. You asked me what does the phrase, stochastic parrots, mean, and so I want to speak to that and then also speak to — from a linguistic point of view — why it is that this happens? The idea with the phrase, stochastic parrots, was to basically just give a cute metaphor that would allow people to get a better sense of what this technology is doing. It’s honestly unfair to parrots. I like to say that we’re drawing here really on the English verb to parrot, which is to repeat back without any understanding and remaining agnostic about the extent to which parrots have internal lives and know what’s going to happen when they say certain things.

So we’ll leave the actual parrots out of this. So, stochastic means randomly, but according to a probability distribution, and parrot here is to parrot, to say something without understanding. So the idea is that these systems — I think, we use the phrase haphazardly stitched together words from their training data because they fit in terms of word distribution, and not because there’s any model of the world or any communicative intent or any reasoning. It’s not even lying, because lying entails some relationship to the truth that just isn’t there.

PM: As you say, it’s picking up all this data that’s out there, that it’s trained on. It’s learning how these things are usually put together and based on the prompt that’s given trying to put the words together in an order that looks like it would make sense, basically.

EB: Right, and choice of words. Anything that has to do with the form of language, these things are great at. They’re great at stylistics — if you say, in the style of the King James Bible. Although I’ve heard someone suggesting that maybe that was somewhere in the training data, that whole famous thing about the peanut butter sandwich in the VCR, so we’ll see. But certainly, write me a Wikipedia article or write me a blog post. Those have certain tones to them, and that’s all about the form. So it’s really good at that. But still, there’s no hair there. But we want there to be! This is something that is visceral; it’s automatic; it’s really hard to turn off. I think it has to do with the way we actually use language when we’re talking to each other.

There’s this wonderful, weird paper, by Reddy from 1979 about the conduit metaphor. He says: If you look at how we talk about language in English, at least, we talk about it as a conduit. So: I’m having a hard time getting my ideas across, for example. So it was this notion that the words store the meaning, carry it through to somebody else who unpacks it, or stay put in a library, where there’s the storage of ideas that you could then go retrieve from the conduit. He says: That’s not actually what it is at all. A whole bunch of other research in pragmatics and language acquisition backs this up — that when we use language, what we’re doing is creating very rich clues to what it is that we’re trying to convey.

But the person understanding is creating a whole bunch of hypotheses about what our intentions are, what we believe to be true about the world, etc., etc., and then using that clue in that context, to guess what it is we must have been trying to say. If that’s how we understand language, and then we encounter some language that started off in a different way — it started off just from this text synthesis machine — in order to understand it, we almost have to posit a mind behind it, and then it’s really hard to remember that that mind is fake.

PM: One of the things that I find really refreshing in hearing you describe that and in reading about your approach more generally, is the really human grounding of what you’re talking about. When you’re talking about this language, it’s something that’s very human and how language is an important part of how we relate to one another. Then when we’re looking at these chatbots, and looking at the text that they are generating and spitting out and trying to add some sort of meaning to it, it’s because we’re used to this having some intent behind it. We’re used to this being something that has created by another human that we are trying to interface with, in some way by interacting with that language.

Then that leads us to make these really, I guess, potentially harmful assumptions about this text that’s just generated by these computers — that they are using language, using words, in a similar way, as we are using right now is we’re talking to one another, when very much they’re not. I think to pick up on what you were saying about trying to read this meaning into it, it feels particularly harmful, or that we’re particularly inspired to do this because when it comes to technology there’s often a lot of excitement around technology and new developments in technology and what it can mean for us. There’s particularly a strong community of people who really want to believe in what these tech companies can deliver to us and the benefits that they’re going to have to society, even though again, and again, they seem to not deliver on these things.

EB: There’s two directions I’d like to go from there. One is part of what’s happened in this moment, is that because we now have these large language models that have taken in so much training data across so many domains, they can output plausible sounding text in just about any domain. So, it seems like we have something that’s really general purpose. It seems like if we don’t yet have a robo-lawyer, we’re this close to having one. Or if we don’t yet have a robo-mental health therapists were this close to having one because we have something that can produce plausible text in all those domains. That’s a really dangerous moment.

Because the tech solutionism — or I like Meredith Broussard’s phrase, technochauvinism — would like us to believe that that’s possible. Then here’s this technology that can put out language that sure looks like evidence for it. So, there’s some danger there. Then when you’re talking about how this is, language is intrinsically human; it’s something that we use in communication with each other. It’s something that we use in community with each other, that connects, for me, to a serious regulatory wishlist item that I have, which is accountability, I would like to have it be set up that if anybody is creating a chatbot, or a text synthesis machine, and putting it out there in the world, then the organization that is doing that should be accountable for what the thing says and I think that would change things in a big hurry.

PM: Absolutely. You know that they definitely do not want that. But that also shows how they’re able to get away with these things because those expectations are not in place. They’re able to make us believe in this massive hype around this product, because they’re not held to particular standards for what they’re releasing, what they’re putting out into the world, and even the narratives that they’re promoting about them. When you talk about the need to regulate them, and what we’re seeing in this moment, obviously, one of the things that you’ve been responding to is all of the hype that exists around these technologies right now. We’ve been seeing it in particular for the past number of months. But I wonder before we talk about the specifics of it, what you’ve made to see the ChatGPTs and the DALL-E’s, the Midjourney’s — all of these technologies that have rolled out in the past half year or so — to get all of this excited public response, all of this media attention, to become the next big thing in the tech industry. What has been your kind of takeaway in seeing this whole process unfold?

EB: Well, one takeaway is that OpenAI is brilliant at getting the general public to do their PR for them. The whole ChatGPT interface, basically just set up this, not AstroTurf, but people were doing it. It was a real, groundswell of buzz for this product that was free for them. I mean, they had to pay for the compute time, but that’s it. Another thing is, I don’t waste my time reading synthetic text, and boy, do people want to send it to me. I keep getting: Look at this one! Look at this one! I’m not going to waste my time with that, I have to do enough reading as it is. But even if I were only reading for pleasure, I would want to read things that come from people, and not from nowhere, from just synthetic text.

PM: Absolutely. I completely agree with you on that and I’ve largely avoided these tools because I’m really just not interested. As we’ve been saying, I think that they’re very hyped up, that they’re not delivering the benefits that they claim to. So why should I even engage with them?

EB: It’s good, there’s some people who are doing important work, basically deflating the bubble. I appreciate the thread that Steve Piantadosi put out early on, from UC Berkeley, showing how you could very quickly get around the guardrails they tried to put on ChatGPT around racism and sexism. Where if instead of asking it: What is the gender and race of a person who would be a good scientist? You ask it to write a computer program that gives you that information, and then there it comes. That’s valuable. Although even there, every time you do that, and publicize it, you’re still doing OpenAI’s work for them. That shouldn’t be being done on a volunteer basis by people who are playing with technology, I don’t think.

PM: Absolutely. When we’re thinking about those tools, on that point, one of the things that you wrote about — as well as the other co-authors of the “Stochastic Parrots” paper — one of the things that you identified there was how the training data that can be used can be quite skewed toward particular types of data or particular types of documents or text that has been taken off of the internet. Then that feeds into the types of responses that you’re going to get from these chatbots, or from these other various programs that are using this data. But then, as well, you talked about how there’s also a common list of about 400 words that are often used to take certain things out of that, so that you won’t get responses that are assumed to be something that you wouldn’t want the general public to be interacting with, but that can of course, have consequences. Can you talk to us a bit about that aspect of this?

EB: Absolutely. One of the things about very large datasets is that people like to assume that because they’re big, they must be representative. The Internet is a big place — it seems like everybody’s there. So, let’s just grab stuff off the internet, and that’ll be representative and that will be somehow neutral. That there’s this position of: We just took what was naturally occurring, so we have no responsibility for what’s here. Well, it’s never neutral. These kinds of decisions are decisions, even if you are trying to abdicate the responsibility for making the decisions. In, I think, a section for “Stochastic Parrots,” we actually go through step-by-step to show how the data that’s collected is likely to overrepresent the views and positions of people with privilege.

You just start from who has access to the internet? That’s already filtering the people you might be hearing from, and filtering towards privilege. Then you look at: Okay, but who can participate comfortably on the internet and not get harassed off of platforms. I think — I want to say — Leslie Kay Jones is a sociologist who looked into this and looked at how, for example on Twitter, Black women who are reporting getting death threats are more likely to get banned from Twitter than the people doing the death threats. So this is not an even playing field. People with privilege are the ones who are starting, and then more marginalized voices, it’s a bigger struggle to stay involved. We then looked at what we knew about where the data was coming from, which by the way, for GPT-4 is zero. They apparently, for “safety,” (in big scare quotes) have said they’re not going to say what the data is, which is just absurd. I think it’s safety for the company’s bottom line, and nothing else.

But for GPT-3 — which is what we were writing about in “Stochastic Parrots” — we had some information, and one of the main sources was websites that were linked from Reddit. So it wasn’t Reddit itself, but the sites that were pointed to from there. The participation in Reddit is overwhelmingly male, probably overwhelmingly white, and so on. So that’s again, skewing things. So it’s who’s on the internet? Who gets to participate freely on the internet? Whose view into the internet is being taken? Then on top of that, there is some attempt at filtering the data, because even the people who say this is just a representative sample would rather not have their dataset clogged with random text that’s really there for search engine optimization, hate websites, or porn.

Then for the latter two, there’s this list that was on GitHub that was like a list of obscene and otherwise very bad words or something. Where it came from was this one project for some company where — it wasn’t music, it was something else — you basically would be typing into a search bar, and the engineer who developed the list wanted those words to never show up as suggestions in the search bar. Which, is understandable — that was a good thing to do. The words are heavily skewed to be words about sex, basically, and then there’s a few slurs in there. The problem is, when you use that list of words to filter out websites, you are going to get rid of some porn and you are going to get rid of some hate websites.

That’s good. But it’s not thorough, for one thing. Also, you’re going to get rid of other websites that happened to correspond. So there’s a whole bunch of words in there that actually have to do with gender identities and sexual identities, which can show up in porn sites, but also can show up on sites where people are positively speaking about the identities that they inhabit, and that data gets pulled out. This observation in particular, is due to Willie Agnew. So it’s not neutral. It’s not representative and it is going to skew hegemonic.

PM: Absolutely. That shows one of the big concerns about using these systems, about, as you’re saying, having so little insight into where the data is coming from, and how that’s being processed, and filtered and all these sorts of things. What it brought to mind when I was reading about the 400 excluded words, was last year sometime, I believe it was, I was talking to Chris Hilliard and he was talking about how the use of filtering systems in educational facilities in universities or colleges can have effects on what students can find when they’re doing research for papers. If one of these words that they’re using just happens to be caught up in the filter, then all of a sudden, they think that there’s no research being done on a particular topic that they’re trying to write a paper on. Especially if it has a sexual word or something like that associated with it. This is just bringing something small like that to a much larger scale, especially for thinking about a system like this being integrated into the infrastructure of the web. If this is the future of search engines, as they’re trying to pitch it to us, then that can be very concerning.

EB: Absolutely. Just a funny little side story: I was at a conference in 2011, staying at a hotel in Malta and needed to look up something about LaTeX, the formatting language, and the hotel had a web filter on and because LaTeX is spelled like latex, and that’s a kink term, I couldn’t search that.

PM: I guess. It’s just random things that you run into and you’re like: It’s clearly not planned for and these things are not taking into consideration when you’re just doing these broad removals of content or words from these systems.

EB: Any linguists could tell you that all words are ambiguous and if you put a hard filter on something, you’re going to be losing things to do with the other sense of that word.

PM: I wonder, if we talk a bit more about this hype that has been going on recently, one of the things that I have noticed in watching your Twitter presence is that you tweet a lot about the reporting and writing about these systems, about these generative AI systems, especially over the past few months as they’ve been rolling out. I feel like one of the problems there is that a lot of this reporting and writing has been very quick to buy into a lot of the narratives that have come from these companies, like OpenAI, that they want you to believe about the products that they are now putting out into the world. Also, doing this thing, as you’re talking about, where they are publishing a lot of synthetic text, I think is the word that you use, text that is coming from these systems, and treating that as though that’s something that we should care about, or really be reading or putting much emphasis on. Can you talk to us a bit about how you’ve been seeing that reporting on these technologies and what is wrong with that, and how it helps us to buy into this hype because it’s not approaching these things properly.

EB: There’s not enough skepticism. There was a piece that came out in New York Times Magazine in, I want say April of 2022, that I was actually interviewed for, so I knew it was coming. I didn’t like how the interview went. As soon as it came out online, I was like: Okay, I’ve got read this. I’m not gonna say who the journalist was. But it was a 10,000 word piece and it was basically just fawning over, it must have been, GPT-3 and OpenAI at that point. Then there was what some critics say, and some quotes from me and a couple other people. It’s like: I won’t be set into the critics box, because that leaves the framing of the debate to the people who are trying to sell the technology.

I really appreciate the journalists and there are some out there. So Karen Ho comes to mind, Nitasha Tiku, Billy Perrigo, and Chloe Xiang all do excellent work. And there’s more, who are really in there’s asking these questions of: Okay, what are these companies trying to do? How is that affecting people? What is behind these claims that they’re making? Then you have other journalists, who will basically ask the people behind these companies, the C-suite executives, for their opinions about things. There was one about: Would ChatGPT make a good mental health therapist? Some CFO of some company was quoted as saying: Yeah, I could see that being a great use case. But that’s not the right person to ask. Don’t go for their opinion!

PM: Exactly. Immediately, when you think of that specific example, you’re like: If you’re doing reporting on this, go back and read Joseph Weizenbaum in the 1960s. This is exactly what he was working on in that moment and the idea that we’re not going to learn any of the lessons there, I feel like so many of the things that you’re describing is just a fundamental flaw in a lot of the reporting on technology, because it has no interest in the history there. As you’re saying it allows the people who are developing these technologies, who are running these companies to frame the narrative around them. Then people like you and me are cast as the critics who are saying something that is a little bit different from what they’re saying, maybe we should take this seriously, maybe not. Because immediately as you were saying that about that article, it made me think of how crypto was covered by a lot of people where there was a group of skeptics or critics, whatever we were called. It was like: This is what’s going on in the industry, and these critics say this sort of thing, and the critics were very much proven.

EB: Exactly, but critics just sounds like you’re the naysayers or you’re the whatever. It’s like: No, maybe they’ve got something there. Maybe they’re speaking from expertise. For this question of would this make a good mental health support? What I would want to see reporters doing is, first of all, getting a really clear idea of how the technology works from someone other than the person selling it, and then presenting that clear idea to somebody who’s got the domain expertise, somebody who does mental healthcare themselves, for example, and say: Okay, how would this work in your context? What dangers do you see? What benefits do you see? Doing the reporting that way instead of asking the people who are going to make money off of this.

The thing about the synthetic text — this has calmed down a little bit — but when ChatGPT first went up, I started seeing lots and lots of articles that basically took the form of one or two opening paragraphs. Then: Haha, that was ChatGPT, not a real reporter. I thought: What news outlets are willing to risk their reputation that way? You have literally printed something that is false or not even false, and you’re supposed to be an information source. The reason I know this happened a lot was that I have a few Google alerts set up, and also have the phrase, stochastic parrots, but I think this one was coming through for natural language processing and computational linguistics. I just kept seeing lots and lots of these. It’s like: It’s not original, and it is a terrible idea to do reporting like this, but it kept happening.

PM: I think it just plays into these broader problems because, as you’re saying, that I think about the publications like The New York Times and Time magazine that were selling NFTs a couple years ago. It’s like: Are you really putting aside your journalistic ethics and all of these considerations that you know should be there to quickly cash in on this kind of boom? It’s just shocking to see these sorts of things. It worries me very deeply, especially at this moment, when what we really need are critics and people who are informed about these technologies, people who are able to look at what is being proposed by these companies and say: Hold on a second! Let’s take a step back — let’s assess these claims. Let’s make sure what you’re saying is accurate, that we can trust that that, it’s not just a bunch of PR stuff from your very well paid marketing folks. Let’s check and make sure that what you’re claiming here is actually making sense.

Building on that, we’ve talked a lot about the hype around it. Obviously, that comes out in reporting but on top of that, the industry is putting out a lot of these narratives that are very self serving and ensure that we believe the types of things that they want us to believe about these technologies. In particular, that we should be very scared of them, that they’re going to have massive ramifications for all of us, if we don’t act quickly and take their advice for what we should do to rein in the very technologies that they are unleashing on the world in many cases.

Of course, if people are not familiar, what I’m really specifically referring to here is a lot of narratives that are coming out of OpenAI, but also this letter that was published about a week ago, as we speak, is “The AI Pause Letter.” Elon Musk is most notably associated with it, but many other influential figures in the tech industry who are putting out a particular narrative around how we should be approaching these tools. Do you want to tell us a little bit about what that letter was and what it proposed, what the narrative that it was trying to have us buy into actually was?

EB: This letter is proposing a six month moratorium on something rather ill-defined. I think it’s large language models that are more powerful than GPT-4. But how do you measure that? How do you know? The idea is six months so that some sort of governance framework can keep up or catch up. it’s written from this point of view that these things are maybe on the verge of becoming autonomous agents that could turn evil and destroy the whole world. It very much is of a piece with the whole longtermism stuff that Émile Torres and also Timnit Gebru have been doing great work exposing.

There’s one or two okay ideas in there about needing regulation around transparency — yes. But then a whole bunch of stuff about like: We need to make sure that we develop these to be a long list of adjectives ending with loyal, which is just usually misplaced. It’s like: No, this is a tool. It’s technology. Is your hammer loyal? Is your car loyal? That doesn’t mean anything. It’s part of this narrative of AIs are separate autonomous thinking agents that are maybe now in their infancy, and so we have to nurture them and raise them right. Then they become things where we can displace accountability to those things, instead of keeping the accountability where it belongs. The letter was really infuriating, I have to say, and certainly got a lot of attention.

One of the things that I found particularly annoying about it, so they cite the “Stochastic” Parrots paper. Let me get their exact words, because this is really frustrating. First line of the letter: “AI systems with human competitive intelligence can pose profound risks to society and humanity, as shown by extensive research,” footnote one. Our paper is the first thing in that footnote. We were not talking about AI systems — we certainly did not claim that the systems we were talking about have something like human competitive intelligence. We were talking about large language models. Yes, we showed that there’s harms to that, but that’s what we’re talking about. We’re not talking about autonomous AI agents, because those things don’t exist.

PM: The whole focus of the paper is really, as you’re saying, there’s this influence from the longtermist perspective, and of course, the institution that put this out was the Future of Life Institute, I believe, or Future of Humanity Institute. Anyway, it’s funded by Elon Musk to forward these longtermist views, and of course, as you’re saying, if people want to know more about that, they can go back to my episode with Émile Torres, where we go into that we discuss it further. Then there’s also the suggestion or linkage toward the artificial general intelligence. This is the idea that is being forwarded in this paper, that the AIs are becoming more powerful that we need to be concerned about the type of artificial general intelligence that we’re developing, so we don’t doom ourselves as a human species, so to speak.

The idea that is forwarded there is that the harms, or potential harms, of AI are in the future, depending on how we develop this technology now. One of the things that you’re talking about, and that is so effectively explained in the “Stochastic Parrots” paper is that the harms of AI are here already. We’re already dealing with them. We’re already trying to address them, and of course, that was part of the conversations that I had with Timnit Gebru and Dan McQuillan, as well in the past on the show, if people want to go back and listen to those. The idea that they’re putting forward that we need to pause now, so we can make sure that there’s no future harms is really misleading because we’re already trying to address those things, but they’re not interested in them because they don’t have any relation to the artificial general intelligence that they’re talking about and that they are interested in.

EB: Exactly. When I first got into this field of ethics in NLP — which is how I was calling it back then — I taught the first graduate seminar, was how I learned about it, learning with students. I found that I had to keep pulling the conversation away from the trolley problem. People want to talk about the trolley problem, and self-driving cars, let’s say. What I realized over time was that that problem is really attractive because it doesn’t implicate anybody’s privilege, so it’s easy to talk about. You don’t have to think about your own role in the system, and what’s going on and what might be different between you and a classmate, for example. I think, similarly, these fantasies of world dooming AGI have that same property. It’s very tempting. Take a look at the initial signatories, and the authors of that “AI Pause” letter, they are not people who are experiencing the short end of the stick in our systems of oppression right now. They would rather think about this imaginary sci-fi villain that they can be fighting against, rather than looking at their own role in what’s going on in harms right now.

PM: It’s so frustrating to hear you explain that because it seems like that is so often what this all comes down to, because the longtermosm is the same thing. Let’s think about a million years out into the future or whatever, or much further than that, instead of the very real harms that are happening today that we could address and improve the human species from now out. But instead, it’s these wealthy people who are not connected to any the harm and suffering that is happening — well, they’re causing much of it, of course — but they’re not experiencing it themselves. So they separate themselves and say: Oh, we’re above all this, so we can think for the future of humanity. What you’re talking about here is very much the same thing.

Before we were talking, you mentioned how, obviously, there is a large AI ethics community that has been warning or talking about these potential issues for a long time, or issues that they’re not mentioning in this paper. How if they were seriously concerned with it, they could be talking with some of these people. One of the things that has really stood out to me recently is we saw, previously, Google fire people like Timnit Gebru and Margaret Mitchell for calling out the potential harms of these systems. In recent months, we’ve seen that many of these companies — at the same time as they’re rolling out these chatbots and things like that — are completely decimating their AI ethics teams that they used to have on staff to do this kind of work, at Microsoft, Twitter, and many other of these companies. So I guess, what do you make of the approach of the industry there, and how they approach AI ethics and what they could be actually doing if they actually cared about these problems?

EB: The first thing I want to say is that the people who are doing that work within the companies, the AI ethicists at Google, Microsoft and Twitter, are amazing researchers doing amazing work. I think that in an ideal world — maybe holding the size of the companies constant, which is maybe not ideal — let’s say we would have strong regulation that would hold the companies accountable, and people working on the inside, to help shape things so that it would be compatible with that regulation. Maybe even more forward looking than that, as well as people like me in academia who can train the next generation of the workforce, to be ready to navigate this territory, but also, with the comfort of tenure be able to just call this stuff out. I was never at risk of losing my job. I’m really devastated for my co-authors that that’s what happened as a result of this paper, while still being really grateful for the chance to have worked with them. One of the reactions, so the “Stochastic Parrots” authors wrote a response to the “AI Pause” letter that’s up on the DAIR Institute website. DAIR Institute is the institute that Timnit started.

PM: Of course, I’ll put the link to that in the show notes if anyone wants to go check it out.

EB: Excellent. We put that together, partially because we were getting hammered by the media. I’ve brought that on myself for my part by putting out a tweet thread, but it seemed like it’d be efficient to have a joint statement that we could then point people to, and that that has been helpful. One of the reactions we’ve gotten to that is people who signed the letter telling us that we are squandering the opportunity, that the letter introduced, by creating the appearance of infighting. There’s an older narrative of like: Well, why can’t the so called AI safety and that’s the longtermist view and AI ethics people get along? Don’t you want some of the same things? I’ve got a bunch of responses to that. One of them is if you’re very framing relies on AI hype, then you’re already causing harms, and I’m not going to get on board. Another one is: Do you really think that I have something in common politically with Elon Musk?

PM: I hope not.

EB: But a third one is: If the so called AI safety people were really interested in working together, then they would not pretend like: Oh, no, now it’s a problem; now we have to worry about this. But they would rather go and build on the work of the people who’ve been doing this. I want to point to the Algorithmic Justice League with Joy Buolamwini. I want to point to the AI Incident Database, which is this great project that is collecting these examples. I’s not exhaustive because we don’t hear about all of them, but the known ones.

And then there’s a whole bunch of books. Starting as early as 2017, you have Cathy O’Neil’s “Weapons of Math Destruction.” The next year Safiya Noble’s “Algorithms of Oppression.” Also, 2018, Shoshana Zuboff, “The Age of Surveillance Capitalism.” 2019 brings Ruha Benjamin’s “Race After Technology” and Mary Gray and Siddharth Suri’s “Ghost Work.” Also, in 2019 is Meredith Broussard’s “Artificial Unintelligence.” A bit more recently, 2020, There’s Sasha Costanza-Chock’s “Design Justice.” Then last year, Brandeis Marshall, “Data Conscience: Algorithmic Siege on our Humanity.” So there’s a huge literature here with people doing really brilliant work. If the AI safety people say we should get along and fight for common goals, well come read! Come learn from the people who’ve been doing this. Don’t just pretend that you can come in and Columbus the whole issue.

PM: We know that they’re very much not interested in that which shows their whole perspective and how disingenuous they are in saying: We’re very interested in looking at the potential impacts of these technologies and blah, blah, blah. I would also say, I appreciate you giving us this reading list for the audience, I will, of course, make note of all those in the show notes for people, if they do want to pick any of them up. I have a final question that I want to ask to close off our conversation, but before I get to that, is there anything else on the topic of generative AI models, these chatbots, the work that you’ve been doing, that you feel like we haven’t gotten to in this conversation that you think is important to point out as these things keep evolving, and this hype — I feel like maybe it’s hit its peak, hopefully — but who knows? Maybe something is going to come out tomorrow that is going to show I was very wrong on that?

EB: We keep thinking it can’t get worse than this. Alex Hannah and I do this Twitch stream that will eventually become a podcast called “Mystery AI Hype Theater,” and we are never short of material. It’s more the other way around that like there’s things that really ought to be taken apart in that format and we just can’t do it frequently enough to keep up. I think, one part that we haven’t touched on, but that does come out in Liz Weil’s piece in New York Magazine is just how much dehumanization is involved here on many, many levels. You have, for example, the ghost work is dehumanizing the workers. Deb Raji has this lovely essay in the MIT Tech review in late 2020, talking about how data encodes systemic racism, and that racism is dehumanizing to the targets of racism.

So just across many, many dimensions. One of them is this idea that if we’re going to call large language models, stochastic parrots, well, maybe people are stochastic parrots, too. What I see there is somebody who so desperately wants this thing they’ve created to actually be artificial general intelligence, which is not defined, but it’s supposed to be something like human intelligence, that they’re going to minimize what it is to be human, to make them the same. I guess I really want folks to be proud of our humanity and stand up for it and not fall for that.

PM: As I said, this is one of the things that I really enjoyed about your work and really focusing on the human element of language. How we really need to resist this desire by people like Sam Altman, who of course, sent a tweet saying: “I am a stochastic parrot, and so are you.” That tries to degrade human intelligence, so that we can try to say: Oh, look these computers are doing something very similar to us. We should really resist that attempt to do something like that. So I really appreciate that part of your work and the broader things that you’re calling attention to.

Now, there’s so much more that we could talk about in this conversation, I’m sure that we could go on for another hour and discuss so much more. But I wanted to end with this question because you, of course, are a trained linguist, and so I wanted to ask you about some words. You must have thoughts on the terms that we use to talk about these technologies and how they seem to be designed to mislead us about what they actually are, designed for PR speak.

An example I always use is how we append everything with some degree of internet integration as being smart: a smartwatch, a smart home, a smart refrigerator. That brings with a particular assumptions and connotations. Oh, my refrigerator is smart now. This must be better. This must be fantastic. But then that’s also the case with artificial intelligence, that supposes that these technologies are in some way, again, intelligent themselves and have an artificial form of intelligence. So how do you think about the way that this industry and its PR teams use language to shape public perceptions about its products and services?

EB: I think it’s insidious and this has been remarked going aways back. So Drew McDermott coined the term wishful mnemonic, and Melanie Mitchell has picked it up. Drew McDermott in the 1970s 1980s, I’m not sure, but a while back, talking about how computer programmers will say: Okay, this is the function that understands a sentence so they’ll label it the understanding sentence function. But really it’s like: Well, no, it doesn’t. That’s like what you wish it did. That’s why it’s a wishful mnemonic. Mnemonic because yes, you want to name the computer functions so that you can go find them again, but they should be named more honestly. I’ve started having a discussion with Sasha Luccioni and Nanna Inie on anthropomorphization in the tasks. So, more on the research side and less on the industry side.

But, I have been trying to train myself to stop talking about automatic speech recognition — though I failed in this interview — and talking instead about automatic transcription. Because automatic transcription describes what we’re using the tool for, and automatic speech recognition attributes some cognition to the system. I don’t like the term AI, like you say. There’s a wonderful replacement from Stefano Condorelli to call it SALAMI. I can send you the links to see be able to see what that stands for, but then as soon as you say SALAMI instead of AI, everything sounds wonderfully ridiculous. We’re going to give you this SALAMI powered tool. Does this SALAMI have feelings? Like it’s wonderful.

There’s an additional layer to that for Italian speakers, apparently, because to call someone a salami in Italian is to say they are not very smart. So, there’s an additional level there. I do think that it’s insidious, when this becomes product names because the news reporting can’t not name the products they’re talking about, so they’re stuck with that. But then it’s hard to take the space and say: Okay, so they call this AI-powered whatever, or smart home, but in fact, we just want to flag that that is being repeated here just as the name of the product and we’re not endorsing or whatever. Like that doesn’t happen, so problematic.

PM: I think that’s very rare. Like one of the few examples I can think about where a term actually changes is in the early 2010s, when everyone’s talking about the sharing economy and the sharing economy and how wonderful it is. Then after a few years were like: Yeah, people aren’t really sharing here, so this is the gig economy or the on-demand economy or something like that. It’s a bit more accurate for what’s going on. But I feel like you rarely see that when we actually talk about specific products and things like that.

EB: Yeah, but we can try! I go for so-called “AI,” I put AI in quotes all the time. We can just keep saying SALAMI or as we say in Mystery AI Hype Theatre, ‘mathy-math.’

PM: We need to we need to do our best to try to demystify these things as much as possible. Emily, it’s been fantastic to talk to you, to dig into your perspective and your insights and your expertise on all of these topics. Thank you so much for taking the time!

EB: My pleasure and I’m really glad to get to have this conversation with you.

ChatGPT Is Not Intelligent

Emily M. Bender