Papago Woman - Ch. 1 (An AI Audio Journey)
By Leonard Bruce
What is the project?
This is a first pass at creating an O’odham audiobook for Papago Woman by Ruth Underhill using an AI voice clone.
Why did I do it?
The main reason is …
Even though I recommend this book to literally everyone, few people have read it yet.
No shade - I’m busy too, I can count the books I’ve read for “fun” the past year on one hand.
But that is why I wanted to try this out - I wanted to create a way for more folks to access this amazing story that has been so impactful to my journey to learn more about O’odham history.
The “Why” for creating an AI voice is that I wanted to explore the capabilities of the technology for creating new mediums to tell our stories. But I couldn’t just use an off-the-shelf AI voice - I wanted the voice for this book to be more authentic to the character of Chona. I couldn’t find an AI voice that sounded quite like what I wanted - an O’odham elder telling a story. So I created one.
At the end of the day - it isn’t quite perfect, but I think I’m happy enough with the outcome and the process that I want to share what I have so far.
How did I do it?
I’m going to focus on two main parts here - the AI voice and the Audio processing. Before I get too deep, I want to be clear that I don’t know what the heck I’m doing so take my process with a grain of salt. I’m sure real audio engineers have much better (and quicker) ways of doing this:
The AI Voice
The voice you are hearing as part of this production is a hybrid of a few different voices.
First, the “base” or foundational voice is from my Mom, I then added more voices from ElevenLabs AI library to get the voice to sound older and have more of a raspy tone.
I will warn you reader, this was probably not the best way to go about things… having multiple voices in the model created some weird outputs in the audio - it was very hard to get a consistent audio level from the AI models.
But - I think it worked out well enough. The final voice is certainly NOT my mother - even though I hear some of her coming through, it sounds much like she is aged 20-30 years. It did catch her slightly southern accent though.
Why my mom?
To start - my Mom passed in 2014. She passed just a few days after her 50th birthday from cancer and her deterioration happened so rapidly, I don’t think me or my sister really recognized how fast we were going to lose her. Unfortunately we don’t have much audio or video of her. In fact, I had to train this model on only about 2 minutes of crappy audio from voicemails that she left my sister and I.
One aspect of this project was me trying to see if I could re-create her voice from the limited samples I had available. Partially because I think the technology is interesting, but I also wanted to see if I could create a way for her voice to be part of my kids lives - to tell them stories, to teach them history and culture.
I can’t think of anything my mom would have loved more than to be part of her grandkids' lives to help steer them into womanhood. I think she would have loved the idea of being able to expand that to teaching others in the Community as well. To be a part of this project with the goal of growing historical and cultural knowledge.
And in some ways that part of the project worked well. If you are interested in learning more I’ll link to further thoughts on the AI portion specifically, but anyway - lets move on. .
Why mix the voice?
I wanted the voice in this production to be an O’odham one. To be clear, no one voice “sounds O’odham” we all sound different. But I wanted this story to be told with a more authentic voice, and while I don’t think there is any one way for O’odham to sound, I like knowing that the story is being told by an O’odham woman.
I originally had the voice of an elder that I thought would be perfect for this project, but I wasn’t able to get the family’s permission to use the elder’s voice I wanted and I don't think it would be correct to create a voice without some type of permission.
I was already testing out AI voice cloning for my own voice and my mom’s. But neither of our voices worked by themselves. I’m a man, and I believe this story needs to be from a female perspective. And for my mom, her base voice didn’t work for this story for a number of reasons.
The AI version of my mom’s voice had a lot of issues. One major one was that all of my voice clips were of her are in a very low voice, almost whispering at times. It made the outputs from the model sound too melancholy.
The other issue is that Chona is telling this story at an old age - Underhill states she thinks Chona was somewhere in her 80’s when they met. So, my mom’s voice didn’t sound “old” enough. She was only 50 when she passed, and I just didn't think her voice matched the character.
In the end, I used her voice and others from ElevenLabs in a type of “hybrid” model to create the voice you are hearing.
Why Use AI?
I’ll be writing a whole blog about the ethical implications of this project, but here are some general reasons -
After the Anton’s Defiance production I wanted to test if this AI technology could be used to help bring to life more books or stories from our Community. We have such amazing resources to learn from, but no one is going to create more accessible versions of them for us - so I did it myself. I think there are a lot of ways generative AI (images, text, voice, etc) can be used for bad - but I think this is a cool example of how the technology can be used for good.
The second reason I used AI is that I wanted to have a voice for this project that evokes the auditory experience of hearing the story first hand. Chona’s story is so amazing and I didn’t want it to sound like a California valley girl or British woman. I wanted it to sound like one of us.
But, I’m doing this project in my free time and with my personal funds - I don’t have the money to hire someone to record a book. Renting a recording studio, taking time to read and re-read, and editing the audio - it’s expensive. So this was a low cost solution to getting the outcome I wanted.
A large part of my work the past few years has been trying to leverage technology to support our Community - we have amazing tools at our disposal, lets use them!
How long did it take?
Completing this project took well over 180 hours of work doing the following:
Thinking about the implications of using AI for the project
Testing different methods of voice cloning
Testing different AI models
Pre-processing audio (normalizing, compressing, editing)
Re-writing Papago Woman to match AI pronunciation (CHO YAH!)
Re-generating janky audio outputs
Post-processing audio (normalizing, compressing, editing)
I spent a lot of time thinking about this project before I got into it, and It took a long time to complete due to token limits for generating the audio. Next chapter won’t be as difficult I think.
How much did it cost?
This was a bit of an expensive project - Here is a breakdown of my general costs
ElevenLabs (Creator) - $24/mo ($72 total)
ChatGPT (Premium) - $20/mo ($40 total)
Descript (Creator) - $35/mo ($70 total)
Da Vinci Resolve (free) - FREE!
Audacity (free) - FREE!
So ~$200 out of pocket and whatever my night and weekend hours are worth.
What was the process?
I’m not going to outline the whole process - but be sure I spent a LOT of hours fiddling with generating voices. I’ll give a quick outline and some pointers…
Pre-Process The Voice - Keep in mind that the model is going to be using whatever voice you add, so any ums or ahs are going to show up. Long pauses, etc. I learned that pre-processing the audio to remove that made a huge difference in the audio quality of the AI voice. I also found that using ElevenLabs voice isolator and Descripts “Studio Sound” add-on made a HUGE difference in the clarity of the voice.
Generate! - Once I had the base voice cleared up and isolated, I added it into ElevenLabs and used the “instant voice clone” option. I fiddled with the voice settings until I thought it sounded right. This took a lot of tokens because I was trying to get the right output - but eventually I landed on (30% Variable, 90% Similar, 49% Exaggerated) that worked the most often and sounded the best.
Post Process the Audio - After I had the Papago Woman chapter generated I first used Audacity to clean the audio - Normalize, compress, Truncate silence. The AI sometimes had some weird gaps in the words and this process helped to make it sound a lot better. I also send the audio back through Descript at this point to have its “Studio Sound” add-on clean up the audio.
Re-Generate The Weirdness - Look… AI voices are still pretty new and the tech has some really weird outputs sometimes. Weird noises or speed ups on certain phrases. So I went back and re-generated and cut in some of the “outtakes”. It sucks because the audio change is noticeable - but better than what was there before.
Celebrate - And that is it! I’m being SUPER reductive in how much work it was, but now that I’ve done it I think it will be easier in the future.
What are ways to make the project better?
There are a few ways this project could be better:
Don’t Use a Hybrid Voice
The biggest issue I think is the hybrid voice. If I continue to use AI, I think finding and using a single voice - preferably recorded with a good microphone - would be best. I think the current voice I have is passable, but having a dedicated voice for this project would make it much easier.
I’d love if we (Indigenous, O’odham, GRIC) had a voice database for projects like this. I’m looking for some other books that would make sense to use my own voice to narrate and try to do this with a single voice instead of a hybrid one.
Get a Local Voice Actor
Honestly, if I had the funding I would really like to just hire someone to do the voice acting to narrate the book. I’d like it to be someone from Community and look for the perfect voice. I think this would work well because I haven’t found a way to give AI direction in it’s reading. So, while I read some parts of the book as funny or joking - the AI reads it all in a bit of a serious manner. I just don’t think AI is a replacement for a real human being.
Get a Real Audio Engineer
Look - I don’t know how audio works. I’m stumbling my way through it watching Youtube tutorials. Getting someone who knows how to clean and process audio on the pre and post processing side of this project would elevate it a LOT.
Final Thoughts
I’m glad I spent the time and effort on this project. I know a lot more about how voice cloning works, I’ve developed some voice models of myself and my mom to use on future projects, and I created a really cool initial chapter that will hopefully get people interested in reading Papago Woman. I might circle back to create an audio production of the rest of the book, but for now I’m marking this project complete.
If you are interested in some of the more theoretical AI thoughts you can check out my companion blog HERE
Comments