Big Tech Won’t Revitalize Indigenous Languages
Keoni Mahelona
Notes
Paris Marx is joined by Keoni Mahelona to discuss the colonial nature of data extraction by major tech companies, and how Te Hiku takes a very different approach to revitalize the Māori language.
Guest
Keoni Mahelona is the Chief Technology Officer at Te Hiku Media. Follow Keoni on Twitter at @mahelona.
Support the show
Venture capitalists aren’t funding critical analysis of the tech industry — that’s why the show relies on listener support.
Become a supporter on Patreon to ensure the show can keep promoting critical tech perspectives. That will also get you access to the Discord chat, a shoutout on the show, some stickers, and more!
Links
- Keoni and some of his colleagues wrote about why OpenAI’s Whisper is another example of colonialism.
- Wired and MIT Tech Review have written about the work Te Hiku is doing with Māori language in Aotearoa New Zealand.
- Mark Zuckerberg owns a lot of land in Hawaiʻi, and it’s quite controversial.
Transcript
Paris Marx: Keoni, welcome to Tech Won’t Save Us!
Keoni Mahelona: Aloha. Thanks for having me!
PM: Very excited to chat with you. Obviously, we connected when I was down in New Zealand a few months ago. So excited to finally have you on the show. So we can dig into all the exciting and really interesting work that you’ve been doing.
KM: I’m excited to be here. Sorry. It’s just really awkward because I do listen to your show. And I listened to everyone in how they start out and say: Oh, it’s great to be here! Thanks for having me. And, and then the following thing, and you say what you just said, and then they follow up. And it’s just kind of weird in my head. My head’s like: What am I supposed to say right now? So I’m, going to break the curtain and just be like, this is straight up just a conversation. I’m not going to be too formal about it.
PM: No, absolutely. And it’s always great to have listeners of the show on the show itself. I’m sure you will be used to hearing that as a regular listener. I want to start by asking a bit about the work that you’re doing, because you are the Chief Technology Officer at Te Hiku Media. Can you tell us a bit about Te Hiku, and what it actually does what its goal is?
KM: Yes, so Te Hiku Media, formerly known as Te reo irirangi o te hiku o te ika. It started out as a radio station in 1990. It was born out of legislation to give Māori, the Indigenous people of Aotearoa, space on the airwaves, on the FM frequencies. because prior to that, it was mainly commercial entities that had access to these frequencies, and were broadcasting in New Zealand English. So through a lot of work of fighting, Māori fighting for rights for their language, for their culture, for the land and still fighting today, they were able to get access to a range of spectrum FM frequencies for different tribes in New Zealand to broadcast. So in 1990, we started out broadcasting in te reo Māori, the language that’s specific to the far north of Aotearoa. Since then, Te Hiku Media got into terrestrial television — bunny ears sort of TV broadcasting, and that was through public broadcasting as well, but not through a specific sort of Māori based allocation of frequency, just sort of community broadcasting through an organization called New Zealand On Air Fund.
And then eventually, we moved into digital and moved online. And that transition started around slowly started when the internet came about, but really, it started in around 2012, when we had this big digital switchover. And that’s when most of the western world decided that we were going to have 4G, which is great. But in doing that, we’re going to have to turn off our old bunny ear TV, those whitespace frequencies. So you’re looking at like, what is it 700 megahertz or whatever for 4G used to be your terrestrial television broadcasting, but now it’s your telco sort of 4G. Because of our location, in the far north of New Zealand, we weren’t going to have this new terrestrial based HD digital TV broadcasts because we didn’t have enough people in our community.
Our only option to continue broadcasting on sort of a public broadcasting space was through the satellite. And we didn’t have the kind of money to broadcast through a satellite. So the digital switchover pretty much killed our television station, our terrestrial, many your base television station in the far north of chi Taya. But in doing that, it forced us to move online. And so we’re talking 2013 — Te Hiku Media, this small Māori organization, in the far north in New Zealand, started broadcasting 24/7 Web TV. This is before your national broadcasters in America were doing it, before Facebook was livestreaming and before Periscope came around. So we started out very early on doing 24/7 live television, and also livestreaming important community events.
I joined the organization in 2014, and the organization decided that it really needed a strong digital strategy. Many of the people from our community — that our organization is meant to serve — don’t actually physically live in the community, in the far north in New Zealand. A lot of them live in Australia for work; they’re working in mines, so they can make money and so they can send back home to their families or they live in in Auckland, in the major cities. So our organization, with the support of the elders in our community said: Look, we have to be online, the stories that we tell on the radio need to be made available online so people can access those stories, so people can access the language, so people can access their culture. So we said: Okay, that makes sense, let’s do it. But the key was: How do we do it? And even to this day, a lot of small media, or broadcasting organizations, have this sort of strategy of: You have a WordPress website, and you put all your content in YouTube or SoundCloud or Facebook, and then you embed that content on your WordPress website.
Sometimes you can do this for pretty much free or very cheaply. We knew that we could not do that, we knew that if we put our content in YouTube, we would be signing the rights away to that content to YouTube — aka Google, aka Alphabet, or whatever they call these days — and they can do whatever they want with it. And they’re very explicit about that in their terms and conditions. You give us exclusive rights to create derived works, and derived works means a lot of things, including machine learning models. This is 2013-2014. We have to build our own platform, because the language is a taonga, it’s a treasure for Māori. And for many Indigenous people, our languages, our cultures are treasures, and we look after them the same way we look after our environments. We’re stewards of environments, we’re stewards of our data. And so we had to build our own digital platform from scratch.
Now, that sounds really fancy and hard, but we just used Django, which is an amazing open source web framework. And used Django to sort of build our platform. Now, fast-forward almost 10 years from now, we have thousands of hours of high quality te reo Māori data, not just voice data in terms of training speech models, etc. But also the content, the knowledge embedded in that data is also their high quality Māori content. And so now, in addition to still broadcasting on the radio, five days a week, we have anti girl, our oldest staff who just turned 80 In November. She’s on the radio every morning speaking te reo Māori, and we’re still doing regional television programming. We’re still livestreaming important community events. Next week, we’re livestreaming a speech competition for Te Tai Tokerau, that’s a high school speech competition in te reo Māori and in English.
In addition to all that, we’ve got eight A100s — for those geeks out there know what I’m talking about — eight A100s, four with 80 gigs, and four with 40 gigs, sitting in this very derelict, musky building, in Kaitaia training machine learning models, training models for speech recognition, training models for speech synthesis, training models to measure the pronunciation of te reo Māori in real time. This is so we can help people improve their pronunciation, to help bring back the native sound. Because through colonization, English sound has been leaking its way into te reo, and same in other indigenous languages. Of course, there’s this whole ChatGPT thing going on, and everyone’s so excited about it. We’re thinking about it — we’re not getting too excited about it — but we are thinking about: Oh, yeah, maybe we should address the elephant in the room, and think about what can we do moving forward in terms of LLMs (language learning models) and building ML (machine learning) based technology that can help us achieve our mission, which is the promotion of te reo Māori.
PM: I think you’ve given us such a good picture of what the organization is doing, what the goal of the work is. It’s so fascinating to hear you talk about how this is a radio station that was created in the 1990s, and then evolved into television, and then as part of the digital switchover goes online, and is now working on these advanced AI tools around Māori language to promote the revitalization of the language. And I wanted to talk about that revitalization piece, because this seems really core to everything that has been guiding this organization since its inception in 1990. Obviously, I’m sure it’s true in Aotearoa, New Zealand — as it is in Canada, as it is in many settler countries where these colonizers came in and eradicated the languages — what was the kind of effect of that on the Māori language and other indigenous languages? What is it been like trying to revitalize these in trying to get these languages to be spoken more in society? Because it seemed to me, when I was in Aotearoa I would say, you see Māori a lot more commonly than you see indigenous languages in countries like Canada or the United States. Can you talk to us a bit about that?
KM: I’m Hawaiian, nō Hawaiʻi ahau; my partner is Māori. My partner is also the CEO of our organization. I very much been a part of the family, the whānau and the community where he comes from, but I am Hawaiian. So I work for Te Hiku media and represent them as the CTO. I am able to state or advise on issues relating to language revitalization and technology and AI, data sovereignty, etc. But I don’t speak on behalf of the Māori people. I only speak on behalf of our organization and perhaps the Marae, which is the small community that my partner comes from. I just wanted to put that out there because now we’re talking about language revitalization with te reo Māori. And also I do want to talk for and about Hawaiian as well, because there is a way we can compare the two.
Now, I’m just a small blip in time. Colonization happened a couple hundred years ago. Our people have been fighting ever since for rights, for our land, for our language, for everything. And I’ve only really just came here recently, compared to decades of fighting. It’s just so happened that I’ve come here around this time of “AI,” with quotes in the air. So in terms of the language revitalization movement in Aotearoa, in the 80s there was the Te Reo Māori legislation, which made te reo Māori an official language of Aotearoa. That then led on to the legislation that I talked about, which was the one for the FM frequencies. There’s a few other frequencies I didn’t talk about. Then there was a 3G, then there was the 4G, and now there’s the 5G. So through the Treaty of Waitangi, Māori have rights here, and that’s what they signed in 1840, and the Crown has ignored those rights.
But now the Crown is kind of listening. And so Māori has this mechanism, the Treaty of Waitangi, that allow them to recover their rights, and get their rights to land, to spectrum, to speaking their language, so those sorts of things. A lot has accelerated since this legislation in terms of funding from government to support language revitalization. Whether that means, supporting Kōhanga Reo, which is early childhood education, immersion of te reo Māori for kids. Whether it’s the primary or secondary school, so there’s a lot of kura kaupapa Māori in Aotearoa where there are Māori immersion language schools. Now, the government has set a goal of having 1 million speakers today of te reo Māori by 2040, and you can debate as to whether that’s achievable or whether it’s ambitious. At the end of the day, it means we’re going to need more Māori language teachers, which means we are going to need more people learning to te reo Māori.
There’s a lot that’s going to have to happen in order for you to do not only have a million people speaking te reo Māori, but a million people actively participating in society in te reo Māori. And what does that mean today? That means talking to these stupid things, talking to phones, in your Indigenous language. There’s this whole digital realm; there’s call centers; there’s voicemail. There’s so many things where, now today, automatic transcriptions is ubiquitous. People expect to have live captions in any sort of Zoom call these days. The technology is so ubiquitous now, for English language tools. There’s an expectation, in some cases, that they should work for te reo Māori. Now, to contrast that with Hawaiʻi. Hawaiʻi has seen the same, I guess, renaissance, which started around the 70s — both Māori and Hawaiians — but I guess less support from government, or from the colonizer, in this case.
I did not know that it was illegal to teach Hawaiian in school until the 80s in Hawaiʻi. I didn’t know that because when I was a kid, we had the kumus, who would come around — this is public education, public school. You’d have a time where the kumu would come in. They play the ukulele, and you sing your colors and Hawaiian and you eat some kalo and some sugarcane, that sort of thing. But I had no idea that it was that recent that it was still illegal to speak Hawaiian in school in Hawaiʻi. There really isn’t funding from state or from federal for the revitalization of Hawaiʻi. I mean, there’s probably some money out there, but nothing like you see certainly in Aotearoa in terms of that what they’re putting into ensuring, not just that the language is revitalized, but that it’s thriving. That it’s actually thriving in this country. I don’t see anything in Hawaiʻi, that’s wanting to do that, aside from the actual communities who have been doing this for decades and who have been fighting.
And you hear it, you go to the Big Island, Hilo, or even on the south now. And you can hear people talking in Hawaiian, people at the hotels, the workers there, or even just families at resorts talking in Hawaiʻi. It’s amazing and it feels really good. But you don’t hear that anywhere else. You don’t really hear that on Kauaʻi, you certainly struggle to hear it in Oʻahu, in the main cities, unless you go to the right places. So there’s definitely a strong community effort and will to bring the Hawaiian language back, but you don’t see the funding coming in that you do at the government level in Aotearoa. I think when you hear other Indigenous peoples, it’s the same sort of situation. Obviously, we all have the same passion and fire to learn our languages and bring them back. But when you have to put food on the table, or actually have a roof over your head, or access to clean drinking water. There’s so many other things that are essential to live, before actually having to learn a language that was literally beaten out of your people.
PM: It definitely falls down the list of priorities, when the actual things that you need to pay attention to are so existential. Obviously, we’ve seen that in Canada as well, where we had a whole residential school system that was designed to ensure that indigenous people were having their culture taken from them, as part of an institutional cultural genocide that happened here, and that the State and the country and society is finally reckoning with. One of the things that we are always told growing up is that Canada is a bilingual country. It’s English, but it also speaks French. It just feels so weird to hear that today, because no, it’s not. There are all these Indigenous languages, as well, that we’re slowly starting to hear more of in society. If you go up north, you’ll see street signs with the indigenous languages on them and stuff like that. But it feels like there needs to be more of that, and it feels like calling Canada bilingual feels very wrong today, in a way that maybe it wouldn’t have a few decades ago.
That’s sort of my comment, obviously, not Indigenous, but just observing this from afar, I guess. So you talk to us there about the importance of Indigenous language and revitalization in the Māori context, but in the Hawaiian context, as well. In terms of Te Hiku, you talked about its evolution over time, and it is building its own AI and language models. Can you talk about how the organization started to do that and why it saw it is such an important thing to begin to do? Especially when you have these major tech companies that are also not just creating English language tools, and language tools, and things like that, but increasingly moving into smaller languages, like Indigenous languages as well.
KM: Though we started out in 1990, we actually are in possession of tapes that were recorded in the 70s, like actual cassette tapes. Because families have realized that Te Hiku is a good place to store that sort of a thing. We can look after your cassette tapes, or they trust us to look after and do the right thing with that taonga, with those stories. And since then, we’ve started to digitize some of our analog audio, and as a part of that project, how do we make these old stories more accessible to people who are on their language learning journey? So we have native speakers, one of whom was born in the late 19th century. They are speaking a language that is hard to find today, it’s a native sound. They’re using colloquialisms and idioms and all sorts of things that you don’t really hear today.
There’s really only a handful of people who could actually completely transcribe these recordings accurately. And then understand the idiomatic expressions that are being used, and sort of translate that to people and our CEO is one of those people who is able to do this. So when we have this project of digitizing old, native speaker archives, and then transcribing them, it took ages to transcribe. This is around, I think, 2016, 2017 when we started on this project, and then of course, naturally, you’re like: Oh, well, why don’t we just get computers to help us do this?
PM: Why not, right?
KM: Why not? Because we had Siri at the time, but I mean, Siri doesn’t work very well, for New Zealand English.
PM: I don’t think Siri works very well with any English.
KM: Oh, really? Okay, well, it works very well with my colonized American English that I got from Hawaiʻi. [laughs]
PM: Fair! [laughs]
KM: So I always thought: Oh, maybe we can do our own speech recognition for te reo Māori. Obviously, no one had done it at the time, we didn’t expect Google or some other big tech to have te reo Māori speech transcription. So that was very much a case of like: Here’s a piece of technology that would accelerate our goal of making native speaker language more accessible to our community. If we can get a machine to automatically transcribe stories from decades ago — and not only transcribe it, but tag the idiomatic expressions, and summarize it and do all this amazing stuff that you could do with technology — to make that piece of content or audio, or make that story more accessible, searchable, etc. Accessible in terms of your language abilities, having assisted transcriptions, etc, and that would be absolutely amazing. That would help us to bring back this native sound and native culture that has been lost or beaten out of us through colonization.
Well, I was like: Ah, the technology exists, we can do that. But the real challenge we knew was actually going to be a data problem. We knew that the data was going to be the hard part, because the technology was there. How do we get the data that enables us to train a speech transcription model? So fast forward a little bit, we kind of started this journey, the same time that Mozilla’s Common Voice started. And whilst we did get wind of Mozilla’s Common Voice, were like: Uh, should we use their open source repo that does all this, or should we just do our own? And because my experience was in Django land, and not in whatever framework they were using, it just made sense that we continue in doing our own thing. I think it took about five months for Mozilla to get about thousand hours of English. And the demographics of that corpus was predominantly white dudes, because that’s Mozilla’s audience. It’s tech guys, and things like that, and there’s nothing wrong with that. That’s just who their audience is.
We started a campaign to collect labeled audio for speech, transcription. Mobilized our community ‚ did some social media videos and had some prizes, etc. And we collected about 320 hours in 10 days. Apparently, when you go to the language conferences, that’s just unheard of. I’m sure big tech scrapes more data every day, but certainly in terms of community language initiatives, that was just phenomenal in terms of the amount of labeled data we collected in a short amount of time. And within a few months, Mozilla’s DeepSpeech came out, so we pulled their repository from GitHub, had all our data. So by June 2018, we had the first te reo Māori speech recognition model. I think it was right around a 15% word error rate, which is pretty good, considering we only had about 400 hours data. But the Māori language is phonetically not as complex as English, for example, with half the amount of characters. So, it worked out pretty good.
PM: It’s great. Obviously, I’ve read a bit about it in some articles that have been written about it too. And I think it’s fascinating to read about that experience, and reading about that competition that you held in order to get the community to help you out, to get all of this language data, these recordings, that you needed in order to build this model. So that then you could go back, and I assume, part of the uses, then is to transcribe all of that decades of recordings that you have, so that people can access those sorts of things. And one of the things that stood out to me too, was that there was a distinction in one of the pieces that you wrote between a more contemporary Māori that is more influenced by the New Zealand English versus more of a native Māori that is the more original sound. And wanting to distinguish between those and to ensure that people could still hear that original way that the language is spoken as this revitalization effort continues.
KM: That’s the ultimate goal here with these with these language tools, is how do we bring back the native sound or we want to bring back the native sound. We’re hoping that with these technologies, we can help remember what that native sound was — and not just the actual sound, but also the type of language that is used. We talk about colloquialisms and those sorts of things. And whether we can use technologies to help shift people, remove the colonial [e] sound from their E and those sorts of things. That is the ultimate goal. To get our languages and our people back in a state where what would have been like if we weren’t colonized, in terms of where would our languages be? Where would our cultures be? Where would we be technologically, if we weren’t colonized? It’s like we’re always operating at a deficit. That we’re trying to aspire to where we could have been or where we should have been, as opposed to these other people who are like: I’m going to go to Mars and colonize it, etc. Because I’ve conquered the world, and everything is solved on planet Earth, but let’s get to Mars and solve some other problems or whatever.
PM: That’s when you really don’t have any more earthly concerns, that you’re concerned about colonizing another planet. But you’re talking about the work that you did, the data that you collected in order to put this model together. Obviously, we’re in this moment where there’s a ton of hype around AI technologies and generative AI technologies, in particular, you mentioned ChatGPT. We also have Stable Diffusion. You’ve written about Whisper, of course, and we can talk a bit about that. You’re talking about the work that you and your team put into building out this model, specifically for the Māori language, to try to help in these revitalization efforts and you’ve talked about how you’re doing this with not a ton of resources.
Certainly you have some computer hardware in the facility that you have, but it’s not nearly the same scale as these major companies. So what do you make of the narratives that we’re hearing right now around AI, as these kind of large companies and these powerful individuals are saying all this ridiculous stuff about how AI is going to transform the world? And then you’re looking at that, from your perspective, and what you’ve been able to accomplish just working on these things as Te Hiku Media with your small team.
KM: I think certainly what these companies are doing is just colonialism. I mean, they’re trying to conquer the world, really. They want everyone to use their tools, their platform, they’re very much an imperialist nation, only they’re a corporation of an imperialist nation. Let’s be honest about that one. Now, the other thing we set out to do is actually build these language tools for Māori, so that Māori can build apps and games and what have you, so that Māori can build digital technologies using te reo Māori as a core. And there was no way in hell any foreign entity was going to do this for te reo Māori. There wasn’t, at the time, enough money to be had in doing Māori speech transcription. There is money to be had, let’s be honest, a million people speak te reo Māori in New Zealand means that we will have a Māori language economy in fact, we already have a Māori language economy. There is money to be had, but who should have that money?
PM: Is that a question for me? Obviously, Māori people!
KM: Oh, okay. I’m glad you got that right. Yeah, 100% for you. Absolutely. And why? Well, let’s just remember, well, it was actually their language, not only that, but it was beaten out of them. And our languages were beaten out of us, there were laws that forbid are our ancestors from speaking their languages, in schools. These colonial governments, and people of those governments, worked very hard to ensure that our languages would become extinct. And in some cases, they have succeeded. In some cases, they are succeeding. Fortunately, for many Pacific languages, they haven’t succeeded. But now we’re at this point, where any tech company with enough resources to scrape all the data of the world — aka, take all the land of the world — can just train up models, and all of a sudden, operate in our languages, and not only operate, but actually sell services to us, in our languages.
So first, they came, and told us, we couldn’t speak her language, then they whacked us for speaking our language. Now they’ve taken our language and want to sell it back to us. Like you have no better example of colonization than that. I mean, except what they did with land, which is pretty much the same thing. Land, language, data — it’s all the same for to us. So like, that’s the situation that we’re in. But when you want to think about, like: What’s Microsoft trying to do? Well, obviously they’re trying to maximize profits. But what company runs New Zealand’s government? Everyone’s on Microsoft Teams, they all got running Microsoft Windows, whatever it is now, 11 or something.
PM: Sending their Outlook emails.
KM: Exactly. Exactly. And with any government in the world, these are tendered contracts. You have to use teams for the next five years. Then the contract comes up for renewal, and there’s some sort of process you follow. Google is going to try and get it and Microsoft is going to try and get it. I think there’s pretty much the only two companies and they’re both American companies. The moment that any one of them can say: Oh, everything in Microsoft also works in te reo Māori. Everything in Microsoft also works in Samoan, in any other Indigenous language where, it’s some non-US colony. Microsoft or Google can say we operate in your language that gives them another tech that allows them to then secure a multimillion dollar contract with the government X amount of years.
That’s the value in supporting hundreds of languages. It’s just further domination, in terms of these technologies. I mean, if Apple could speak every language of the world, more people would have Apple iPhones, maybe, they’re so bloody expensive, maybe not. And so that’s the play here. It is colonization, it is domination, they don’t care about the integrity of our languages. They just need it to be good enough. So someone can say: Yeah, ChatGPT is good enough for te reo Māori; let’s start using it. Someone without enough knowledge of the language is going to say that, because it’s not good enough. It thinks in English, and it spits out convincing Hawaiian and te reo, and Japanese, so I’m told.
PM: I think it’s such a good point to talk about why these companies will pursue it in the first place and the financial incentives they have in order to do so. That also, I think that the really important point there is that, sure these companies want to add all these languages to their list, so they can say they’re offering Māori and Hawaiian and all these other ones, but they don’t actually care whether the service that they’re offering in that language is reflective of the language itself. It just needs to be good enough to meet the lowest possible bar, so that they can say that this is another option that’s available on their tool. Whereas, someone like Te Hiku and the work that you’re doing — and I’m sure other Indigenous groups who are engaged in this kind of work in other parts of the world — are much more concerned with, as you’re talking about, the actual integrity of the language, the actual sound of the language, that it’s actually representing the language in the proper way, instead of further messing with the language, and misrepresenting it to a public that, as you say, is trying to learn it, trying to revitalize it, in this moment.
KM: That’s right. If they don’t do it right, they will harm our languages more. That’s just obvious. I was going to mention, you talked about, or we talked about, good enough, and what is good enough? Now OpenAI specifically says what good enough is, and that is for their their Whisper model, which is multi-lingual speech transcription model.
PM: Interestingly, a model that we don’t hear very much about. We hear a lot about ChatGPT, we hear a lot about Stable Diffusion. Don’t hear so much about that one.
KM: No, no. Whisper, we haven’t really heard about it. It kind of just popped on the scene at the end of September. You would have known from reading our article or blog. But I think the implications of it is massive. So you think about the ability to transcribe any audio that’s being streamed or placed on the internet. So we’re talking to all of YouTube because let’s be honest, youtube-dl.
PM: We’ve all used those websites.
KM: Everyone’s using you to train the models. Whisper is this multi-lingual speech transcription model now available as paid API through OpenAI. Now, they have a threshold whereby if Whisper performs better than a 50% word error-rate for language, they will make that language available through their API. Really getting it wrong half the time is suitable enough for you for a product? Well obviously, they’re not there to provide a good quality product. They’re there to scrape as much data as they can. The whole ChatGPT thing, people were just giving their data away willy nilly. And some knew that they were doing and others don’t. Some are even paying and giving their data away willy nilly, which is taking a play from Ancestry.com, which was recently bought, I heard, by Blackstone or something.
PM: I read that too.
KM: So this 50% word error rate, well, that’s already a bit mind boggling. But what are they measuring it against? It turns out, there’s this thing called FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech), something like that. It is a data set of around a hundred phrases — probably first written in English — translated into as many languages as possible hundred plus languages. And then, “native speakers,” and those hundred plus languages, then read these phrases in their language. I don’t know who gathered this data, and I’m trying to figure it out. Maybe there’s listener out there who’s got a bit of insight or wants to send Paris Marx an email.
PM: And I could certainly forward it on.
KM: Right! In 2018 Lionbridge, who sell globalization as a service — that’s their marketing or, that’s how I market them — were soliciting people, indigenous people, to read their languages. Something like $45 an hour for you to go and read phrases in your language. And then there were cases where they actually got back and now were offering $90 an hour. They really wanted a lot of this data. I suspect that that Lionbridge campaign is this FLEURS data set of a hundred plus languages with a hundred phrases in each language. I have no proof of that. But I suspect that that’s where this comes from. I can’t think of any other huge effort to collect very specific language data from as many languages as possible, including indigenous languages.
Anyway, so let’s go to te reo Māori. So te reo Māori is represented in this dataset. And I’m not a fluent speaker of te reo, but I think anyone who’s lived in New Zealand, who then listens to these readers, can tell you that these are not native speakers of language. And some of them are not even pronouncing te reo Māori correctly. So this very crappy data set is being used by big tech, by the industry, to determine whether their tools work sufficiently in this list of a hundred plus languages. Not only is the 50% word error rate just a terrible bar to reach, but the rule they’re using is pretty fucking crooked. It’s terrible. So, that’s the situation. Now, Timnit (Gebru), who’s been on this show, her and I call up a few weeks ago. And they’ve made the same observation for African languages. They brought this up at a conference recently in Africa talking about how there’s this FLEURS data set, and it’s absolutely crap.
And the reason why this is important is because at least for them, investors might say: Oh, why should we support te reo or support Lesan, or these other Indigenous languages? Facebook’s already doing it; OpenAI is already doing it. But actually, they’re not. I mean, sure they’ve done it, but they’re not doing it well. And now they have this measure that says that they can do it, but even the measure is terrible. So the problem has not been solved for most of the languages of the world. Perhaps it’s been solved for English, and any other main colonial languages, but it hasn’t been solved for most of the languages in the world. Of course, Facebook’s response is: Oh, help us to make this data set better. Help us to more accurately understand your language.
And it’s like: Well, why would we do this? Why do we want to help big corporations, big tech to better know our languages only so they can create more profits from it? Like, what do we actually getting return? The honor of working with some flash company, because that’s a thing. That’s a thing that we see, I see it in the Hawaiian community, we see it here in Aotearoa, like: Ooh, I’m working with Google. As if working with Google is so good, or so important. But people get off on that, and they will make poor decisions because they’re in that situation, feeling like they’re so cool and they’re so great, because they’re working with Google. Like, who cares?
PM: Totally, totally agree with you on that. You’re talking about how these large companies use this data and abuse this data, basically, by scraping everything that’s online and trying to get access to language data that comes from indigenous people. In order to train these models that they don’t really care about, because they’re things that not as many people — not nearly as many people as like English or French, or whatever — are going to use. One of the things that stood out to me, as I was reading about the work that Te Hiku does, is that you have a particular license for the data and the tools that you create. Can you talk to us a bit about that, because that seemed like a particularly important and novel thing that you were doing with what you’re developing?
KM: So we have this license called the Kaitiakitanga License. Kaitiaki is loosely translated into guardian, and the idea is that we’re guardians, we’re stewards of the data, in the same way that we should be stewards of land, we don’t own land, we look after it. And it looks after us. Likewise, we take the same approach to the data that we are in possession of; we don’t claim ownership over it. Perhaps in court, in a Western sense, we might have to say that we do own it, and copyright, etc. But certainly in te reo Māori, in the Māori domain, we don’t own it. We are simply the caretakers, at this point in time, of this data. And it’s really just customs, cultural protocols, that we’ve used in looking after our data. Actually, I will say our CEO has been really good in ensuring that our organization practices tikanga, or Māori protocol, very well.
That’s just spilled into how we operate as the business, and even our staff have picked up on this, and operate with a bit more of sort of cultural intelligence around protocol and things like that. So the Kaitiakitanga License, the other way to sort of say it, is it’s affirmative action for open source. I like to say that because open source is very important. But I think what we’re seeing now, even more so is that those who are privileged, will benefit more from open — from open source technologies, from data in the public domain, especially now, when you need how many H100s to train these models? All the public domain data and open source tools out there are great for you. If you’ve got, I don’t know, a thousand H100s to train an MLM? You know what I mean?
You either need a computer, or you even need an education to know what is GitHub and how do I use it and how do I write code? And many of our people, Māori and Pacifica, aren’t there. Remember, I mentioned: Oh, who’s putting food on the table tonight? Where are we going to sleep? Is the heater gonna work? There’s so much inequity, that we’re not even ready yet to benefit from open source from open AI models. And when we started our project, building these Language Tools for Māori? Do you think Māori were lining up to access this technology was non-Māori. Non-Māori were lining up. It wasn’t a very big line. But more than 10, non-Māori reached out wanting access to these tools. And we have to decide whether or not we should give the non-Māori access to this technology.
Because again, we want to ensure that Māori have the benefit, first mover advantage, for Māori language technologies, because it is the language that was again, beaten out of their ancestors. They should have as much opportunity. And we need to level the playing field, right, because there aren’t very many Māori in STEM. So this is how we’re leveling the playing field by building these Māori language technologies, but saying Māori have preference to use these technologies first, so that we can level the playing field. And that’s what we’re advocating for. So that is, that’s one way to look at our Kaitiakitanga License. So certainly, that’s the approach that we’re taking. But then you have a situation like Duolingo, who now offers a ʻŌlelo Hawaiʻi. So for $200 a year, I can learn ʻŌlelo on Duolingo it’s great, right? It’s great. Oh, it’s so amazing. They’re going to help us save our language.
The Hawaiians got up to six figures. It cost like six figures to help Duolingo to have a Hawaiian language corpus and lesson plan. So the Hawaiians put a lot of money into putting Hawaiian on Duolingo, right? Does Duolingo share any royalties back to the Hawaiian language community? I get it costs money to build apps. We know, and operate etc., etc., etc. But does a portion of those profits actually come back to the Hawaiian language community? The Hawaiians that are living in tents on the side of the road as Mark Zuckerberg builds his fortress, and every other tech person? I mean, Larry Ellison owns the whole frickin Island and has weird parties.
PM: I think that Google guy, Larry Page, is over there too, I believe.
KM: Oh, yeah. I heard Elon apparently has a place on Maui as well. I know Oprah actually owns quite a lot of land. But, it’s not one colonizer. It’s another one, right? So what we’re advocating in this instance, is: Hey, Duolingo, please give a portion of profits to the Hawaiian language community. And then it gets complicated, like, well, who should get the money, etc., etc.? So I’m just going to say give it to Punana Leo. That’s the Hawaiian immersion for the babies, from I think two to four, before you go to kindergarten. So I would just say give it to them. Kamehameha Schools doesn’t need it, they’ve got a lot of money. But we need more Pūnana Leo, we need more Hawaiian immersion. My niece and nephew can’t even go to Hawaiian immersion because the spaces are filled and sometimes the spaces are filled by — you guessed it!— non-Hawaiians. So we have non-Hawaiians learning our language before even the Hawaiian people can learn their language.
Because not everybody can afford to go and move to this part of the island to access this amazing, Kawaikini Hawaiian immersion culture school. Because all the Hawaiians the way down this way, and can’t afford to sit in our traffic, two hours of traffic every day. But the rich people can easily send their kid to go and learn Hawaiian, and win the Hawaiian language competition despite not actually being Hawaiian. And don’t get me wrong, everybody needs to learn Hawaiian, if we want Hawaiian to be thriving in Hawaiʻi. But many non-Hawaiian are having the ability to learn Hawaiian before our own people. And you see the same thing here in Aotearoa. So there’s another playing field, we need to level. How many Māori have the free time to just go and learn the language? And there’s the emotional baggage that comes with learning your language that you should have known right? It is harder for an Indigenous person to learn their Indigenous language than it is for an outsider to learn their language, because they don’t have the generational trauma and all the other baggage that comes with the fact that you don’t speak your language.
PM: It’s a really good point. And it’s kind of shocking to hear the story you tell about the people in Hawaiʻi, who are Hawaiians not being able to access the programs designed to teach people Hawaiian. It just shows how messed up that system is. And I wonder, obviously, I’m sure one of the goals with these tools that you’re developing is to have it reach kind of a wider audience of people, Māori and non-Māori, to try to revitalize this language. So how do you kind of bridge having this license and wanting to make sure that Māori still control the data still benefit from these tools that you’re creating? But then kind of also having it be accessible to people so that they can work with these tools?
KM: Yeah, I don’t know. And I mean, that’s where we need help, right? I mean, if anyone at Duolingo was listening, that would be a start. I mean, even if it’s a token gesture of royalties from any person learning Hawaiian on Duolingo who’s a paid subscriber, just take a portion of that — whatever percent you want to do, we can find about that later — and send it to Pūnana Leo. And that would just send a signal to the industry saying: Not only should we be paying royalties to artists, we should be paying royalties to all the people you’ve taken data from. In this case, we actually put effort and money and time into like, creating this corpus and then handing it over to like American corporation, and now they’re profiting from it. That one’s a bit more obvious, like in terms of royalties, it gets a bit grayer in other places. What I’m passionate about here is I see these ml tools as a way to shorten the time it takes to learn our languages and shorten the time it’ll take to bring our languages back to a state with a thriving in our communities. And that’s what I want to happen.
But what’s important about that is not when, it’s not why — because we know all that — it’s the how. And Hawaiians should be profiting from the Hawaiian language because at the end of the day, we’re very much in a capitalistic world. And there’s profit to be had. I sound very much like a Ferengi. Hawaiian should profit from ʻŌlelo Hawaiʻi. Māori should profit from te reo Māori. Sure we’re going to have to work with, we’re going to have to run servers and some cloud provider. And yes, they’re going to make a profit off of us using their servers. And that’s just the economy. But ultimately, Hawaiians, Indigenous people should be the leaders of Indigenous language technologiesm of Indigenous language programs, of anything Indigenous actually. I mean, even cultural appropriation — let’s just talk about Disney for a moment, for all the fucked up…I swear a lot. I’ve done pretty good in not using the F word.
PM: That’s okay. Swears are allowed on this show!
KM: Okay, I forgot. When I was a kid, I got one of those Talk boy from “Home Alone.” Maybe I’m dating myself home alone, he
PM: I watched “Home Alone!”
KM: So, he can record himself, anyways, I had one. And then like, one day, my dad’s having conversation. And he’s like, he says f—- this and f— that. So I hit record, and recorded my dad for one minute, and he dropped the F bomb more than 10 timess in one minute sort of the speech. It’s just how we communicate. He wasn’t using any vulgar way. It’s just Oh, you know, that fucking guy. Oh, he fucking that’s a bit of pigeon.
PM: I can see that you were involved with recording language and being involved with language early on?
KM: [Laughs] I never drew that connection.
PM: I’m wondering, obviously, we’ve been talking about Māori language, we’ve been talking about Hawaiian has Te Hiku been in touch with other indigenous groups. And groups who are trying to do indigenous revitalization in other parts of the world, to help them and kind of share knowledge around what you’ve been doing with them. So they can try to do it with their own languages?
KM: Yes, yes, we certainly have. In one instance, someone from another Indigenous community just had to see that it was possible. We gave a presentation in 2019, at ICLDC, the International Conference on Language Documentation & Conservation. And there were a couple of First Nations people, Native Americans there who saw what we did, and were just inspired to do it themselves, saying: Yeah, we can do this. And like I think that was more impactful than any frickin Nature article, we could have written, right than any paper, we actually don’t write many academic papers, because we just can’t be bothered, to be honest. That’s not how we reach the communities we need to reach. They don’t have access to Nature — they certainly can’t pay for it. But they’re also not reading it. So that has been one way in which we’ve, I guess, impacted the wider community. Certainly the work we’re doing around the kaitiakitanga a license that has, that’s a no brainer for other Indigenous people, but it’s actually the non-Indigenous people who’ve been learning about the license. We’ve been having an impact there, which is great.
And as I said, I’m Hawaiian, so we are closely working with the Hawaiians, and we’re trying to build that relationship more so that because I want to see how these those we’ve done for te reo Māori. I want to see them for Hawaiian, when you go to Hawaiʻi, right. If you’re on Hawaiian Airlines, it’s good because you actually your first introduction to the Hawaiian language is good pronunciation because Hawaiian Airlines does a really good job at ensuring the staff learn the language, but also that their pronunciation is good. But then once you get into the Honolulu airport, you’ll hear someone go: Aloha and welcome to Honolulu International Airport. Introduction to the why languages Honolulu and this bastardization of a language of this mispronunciation. That just happens over and over and over and to the point where even Hawaiians are mispronouncing the language because the mispronunciation is so mainstream. It’s been normalized. Even in pop culture, American TV, there’s always one episode about Hawaiʻi or something or entire, like shows that in Hawaiʻi, and you got to listen to those programs, and there’s just so much incorrect pronunciation, or language. And they don’t even care; they don’t even try. You listen to these pilots on planes and or stewardess, and they don’t even try?
PM: Absolutely, it also brings to mind, obviously, you’re talking about Indigenous languages there, Hawaiian context, in the Aotearoa context, but it also makes me think of just kind of regional dialects and things like that as well, as they kind of die out, because there’s this broader kind of hegemonic notion of American English or broader English is that just get promoted and that people kind of adopt and not really thinking about it, because you’re not always thinking about language and pronunciations when you’re going about your day to day, but it’s still important.
KM: Absolutely. And that’s something we haven’t really talked about is sort of dialects and regional variation. I mean, Hawaiʻi had that, and kind of to this day still does. A lot of that lost due to colonization, in some of that language information might be embedded somewhere and in archives, but we’re not sure we have to sort of find out. So, we can use these tools to find the dialects that were maybe gone sort of extinct, and whether we can bring them back, or whether we need to the other end, based on the question you asked, and I forgot to go here. And what’s important is in terms of the work that we’re doing, we need to make sure we’re not another sort of white savior. Right. So whilst we are an indigenous organization, for us to just go to Hawaiʻi and say, Oh, we’re going to build Hawaiʻi language technologies for you. Oh, and hearing you can can pay us for it too. I mean, that’s exactly what the colonizer does. So we won’t do that. So it’s all about the how it’s how do we work with these other communities to collaborate. So if they want us to very much come in, and just like build the technology for them, if we can, we would consider it.
But we would much rather help other communities to build up their capability. So they can be the leaders of these technologies, and they can champion the change that they need, because they know their communities best. They know what their communities need. They know what the needs are for their languages. We don’t know where outsiders, we can speak to what we need here in Aotearoa, certainly what we need in the community that we represent in the far north. But we don’t know what’s best for these other Indigenous communities. Like I said, the best impact we had is just telling our story. And for them to get inspired, to figure out how they should go about the journey of building speech transcription for, say, the Mohawk language is one example. Rather than us coming in saying: This is how you should do it. But if they need help, maybe they need compute. We’ve got some compute, and some spare time there, we can help or just sharing ideas of things not to try, because we tried, it didn’t work and it shortens the path to achieving your goal.
PM: Absolutely, I love that. And I think it’s so important. Not to try to take over what everyone else is doing, but to share that knowledge so that they can build what works for them, taking advantage of the experience that you already have, and kind of giving this a shot first, I guess, and being willing and open to collaborate with other communities and other groups who want to try to or are working to revitalize their languages as well, recognizing that this is something that’s happening, in many countries around the world right now. And it’s something that’s very important and hopefully continues. I thought that this was a fantastic conversation. And I basically just want to close out by saying, like, is there anything that you think that we missed? Is there any kind of point that you wanted to make or leave the listeners with as we’ve had this discussion to leave them thinking about, I guess, the AI tools that we’re thinking about now, but also how this applies to Indigenous cultures, indigenous language, and anything else that you think is relevant?
KM: Well, the one thing on my mind right now is what’s a practical solution moving forward to ensure that our languages do exist on these mainstream devices that we can operate, that we can thrive in our languages in the digital domain on the devices that we have? When you look at how these companies operate — I’m talking about the Big Five, Google and Apple are the only ones that make mobile devices really, sure, Samsung makes them but it’s Google’s operating system. When you look at how they operate, it’s very much these walled gardens. These closed systems is these these very, very deep vertical to ensure that everything is very much in Apple’s lane or in Google’s lane. That is not how we are going to achieve equity in society. I think these companies know that but that’s how they get more profit. And that’s all that matters at the end of the day, sadly. That’s all that matters to them. I think some might argue that Google is a little better at advocating for interoperability or open protocols. Although Google has also been the same company that’s kind of gets everyone on board, some like open protocol train, and then just decides to kill it. They’re both guilty of imperialism.
But what I want to see is I want to see technology where we, as the people who paid for the bloody thing in the first place, we get to decide what machine learning models we’re running on our devices. So I want to be able to pull Siri out and put in a Hawaiian equivalent, or Māori, or a Polynesian equivalent, let’s be honest, who can speak all the Polynesian languages and English and Pidgin. But who also knows us and knows our culture and isn’t going to say stupid things, or do stupid things if we want to look into the future, and digital avatars, and something that has more cultural knowledge. I don’t think these one models to rule them all, which is what they’re all trying to do, because that’s the maximizing profit approach. I don’t think that’s going to work, I think we’re going to need a bunch of distributed models that are attuned to specific use cases, specific cultures, specific peoples. And I would very much like the ability to swap out the models on these devices and use my own models. And you can’t really swap out Siri. But there are ways in which Apple is kind of opening it up. You can kind of get Siri to process commands for your app, etc., etc.
But in terms of: Well, can I get Siri to speak my language? Absolutely not, you can’t do that. And I’m hoping that we can have these conversations. I don’t expect them to agree to our terms. But I would encourage all indigenous people to be very staunch, and make sure that they agree to your terms. And if they don’t want agree to your terms, then leave the conversation. Because we’ve always been in the position where we had to compromise. You know, in order to facilitate colonization. I mean, even with the Duolingo one. If Hawaiians were more staunch, if I was at the table, I would be like: No, you know, give us a portion of profits. And then you can have this. It’s up to them — Duolingo was going to say yes, or they’re going to say no. If they say no, fine. Let’s go spend half a million or more on some Hawaiians to create a learning app. Because why not? They could use the money. They’re living in tents. We need more interoperability in tech. I’m a fan of Mastodon, and federated social media, decentralization. That’s obviously the way forward. Whether we’re going to achieve it as another question, but I definitely think big tech should be legislated to make their things more interoperable, so that consumers have more choices around the models that have been deployed on their devices, etc., etc.
PM: I know I said that was my last question. But as you’re discussing that what really comes to mind, in a sense is like, obviously, we have these massive companies right now. And we have all this hype around AI and generative AI. And this is all based on a lot of centralized computing power, all these massive data centers that they have around the world, all the data that they’ve been able to scrape off of the wider web, to try to create these models that they want us to believe can do basically everything. But we know that that is not actually the case. And I think that, in talking to you and hearing what you’re saying, I think that you kind of do show a different model and a different approach to these things that not only says we don’t need to have these massive models that are trying to do absolutely everything, we can train these specific models that are doing specific things that we think are important, like revitalizing the Māori language, or the Hawaiian language or whatever, that doesn’t need nearly as much kind of computing power as what they’re trying to use on what they’re doing right now. But we can actually get tangible benefits out of that, rather than just kind of being led along by these massive tech companies, these imperialist tech companies that are trying to take over everything. And I think that there’s a very different model that is kind of being shown there.
KM: Absolutely. We have a bilingual speech transcription model it code switches between New Zealand English, and te reo Māori. It’s pretty darn good. It’s not perfect. It’s not ready for primetime; we’re not going to release it because it’s not good enough. It’s actually really good at Māori. And it’s not very good at New Zealand English, because you need more English data. We trained this on one A100 with 80 gigs. It took a week and a bit on order of like, 2000-3000 hours of data, right? And it’s better than what Whisper 2 can do for New Zealand English and certainly for Māori. We didn’t even need to go there. It’s just it can’t do Māori. It says it can, but it can’t. Let’s just be honest — it can’t.
But it can do New Zealand English-ish. It’s not as good as New Zealand English as we are. We probably have the best New Zealand English transcription model right now. And we didn’t need to be unethical; we didn’t need to steal any data. We didn’t need hundreds of H100s. I think what we’re showing in the work that we’re doing is that if you really put time and effort into the data and respect into the data that you require to train these models, you can actually do a pretty darn good job when you’re focused on solving a specific context rather than global domination.
PM: Which we don’t need anyway.
KM: We don’t want to all be the same.
PM: No, absolutely not, that’d be so boring. Well, I think that this was a fascinating conversation. I really appreciate you taking the time to come on the show. It’s been great to explore the work that you’re doing, the perspective that you’re offering on these technologies, and how we might approach these things. I really appreciate it, so thanks for taking the time.
KM: Thanks so much for having me Paris and responding when I reached out. I love the stories that I hear on your podcast, and I expect you have a pretty cool audience out there. Hi, everybody! And I really wanted to make sure that what we’re doing is heard because we need to see change in this industry, and the only way to do it is just for more people to hear at least our side of the story, and see some ways in which we can make at least, some small changes or some steps in the right direction to ensure more equity in digital, and especially for marginalized communities.
PM: I couldn’t agree more and thanks again.