PS - Yes, I know that isn't how we dressed. AI models suck at representation. But I needed an image to post to Youtube. I'll get a better image sooner than later!
What is the Project
I created an audio production based on a transcript conversation between an Akimel O’odham man and the local Indian Agent from 1912 around O’odham singing and dancing in the Gila River Indian Community.
Why did I do it?
Because Gina at CPAO told me to - blame her for this. I’ve been kicking this story around in some personal conversations since I first read it back in 2016 or so. I think this story has so many powerful themes and provides a glimpse into the struggle of our ancestors to maintain some type of cultural legacy.
I’ve told the story of this transcript to multiple people, but it was Gina who encouraged me to do something to tell the story more publically and put it on my website. Sometimes it takes a push to get started with something, and I’m grateful to her for providing it to me. I know this isn’t the only story of resistance from our ancestors, but I think it is a great look into history and I hope other folks enjoy it!
Also, it gave me a good excuse to play around with audio production and AI =)
How did I do it?
This project was broken into a few steps with a few different systems -
First - I read the hell out of this document. This step is the most important! Don’t skimp on the reading folks - AI will only get you so far… and you gotta know when it’s hallucinating on you.
Second - I took the original document and put it through Adobe OCR. This makes the PDF about to be copied (Ctrl + C) and pasted. It wasn’t perfect. As with most older documents (or poorly scanned ones..), the OCR was filled with errors, broken grammar, and weird characters. Don’t worry - this is where the AI comes in.
Third - I used Open AI’s Chat GPT-4 to transcribe the text. This is useful for two things - first GPT-4 was great for filling in the gaps or errors in the text. There was a bit of hallucination and creating false words, but it was pretty minimal and easily fixed by hand.
The other useful part was that GPT-4 re-structures the text to make it easier to transfer to a document. I can’t stress how important this is because you need to “chunk” the project into smaller pieces - more on that later.
Fourth - I used Chat GPT-4 to simplify and shorten the text and format it for a podcast or audio production. This step was hit or miss. I kept some of the suggestions it made, but it kept using language that sounded too… AI. I changed my prompt to ask for more “natural language” and it helped… but I didn’t feel right about it.
Fifth - Sooo…. I went back and reversed all of the AI language changes. Sometimes you do stuff and it doesn’t work. sigh. I realized I wanted to keep as much of the original wording and nuance as possible and the AI kept wanting to summarize things, even with prompts that told it not to summarize. Fail.
Sixth - I was worried that the transcript would be too long so I did some editing and cleaned up some of the sections manually. While I wanted to keep the original text as much as possible, I was worried it would be too long - too many “tokens” – more on that later.
Seventh - Now that I had the text ready, I wanted to record it into audio. My original intention was to read it myself - but I had a cold and my voice was jacked. Then, I found GPT-4 can do audio reading of outputs. I was blown away by how good and natural the AI sounded, so I had the idea of doing AI voices for the different parts of the transcript. GPT-4 doesn’t allow exporting their audio, so I tried some different text-to-speech systems.
I ended up using a website called ElevenLabs. ElevenLabs has a creator tier that allows for uploading “projects” and you can assign different voices to different sections of the text. After a bit of fiddling I was able to find some pretty good voices and assign them to the text. I then downloaded the audio file.
Eighth - While I was listening to the audio file I realized it was too… boring. I wanted to add some spice to it and not just have two voices talking back and forth. Also, some of the text was too close together - there weren’t enough natural pauses between the speakers. So I uploaded the file into a program called DaVinchi Resolve - because its freeeeee.
Ninth - In DaVinchi, as I was editing the audio it still sounded… boring. I wanted to have some background music and some foley - or sound effects. The AI in step 4 had given some ideas about sound effects when it was formatting the project as a podcast, so I had some ideas to start with. I downloaded DaVinchi’s sound library, but it was limited so I went online and found some other audio on Pixabay.
Tenth - Finally, I used Audacity to record the introduction and conclusion to bookend the production and add context. Boom - project done.
Why use AI?
For this project, I think AI was very helpful in getting the transcript cleaned up and easier to read. I’ll 100% be using it for this in the future.
As for the audio - I’m still on the fence about this. I think if I had a budget and more time I would like to have some real voice actors, or at least real people to read the scenes. The AI works well, but it is such a limited selection for “elder” voices - and NO selection for voices that have the accent I want - O’odham/Arizona.
One solution ElevenLabs has for this is Voice Generation. There are ways to create my own voice actors based on a voice sample - but I didn’t want to get into that with this project. It might be something for the future, but I didn’t want to spend the time and effort testing it at the moment.
Overall - I think the AI was super useful in getting the project created and into the world. It isn’t perfect - but I’d say it’s about 80% of my vision. Which… good enough for a pilot project!
AI Tips?
This was a longer set of text, which meant I had to “chunk” it - or break the text into smaller pieces. I find that breaking the project into smaller chunks of text makes it easier for quality assurance (QA) when sending it through the AI. It also reduces the chance that the AI will hallucinate.
Chunking your project will also help reduce the “tokens” that you are using. Most AI systems run on a token system that limits the work you can do or you pay for overages. Chunking the project helps to make sure your prompt actually works without using too much of your token quota.
On AI models - For the text, GPT-4 or equivalent is a must. I tried to do the transcription and other work on 3.5, and it was a terrible experience. A ton of mistakes, the AI wouldn’t listen to my instructions, and even when starting new threads or clearing the history it would keep injecting weird hallucinations of words that were said - or change the words mid-sentence. Maybe I didn’t prompt it well enough?
Either way, GPT-4 was so much easier to use and it rarely made mistakes. I had an existing subscription to Open AI for past projects, so I didn’t try other models (Gemeni/Claude/Etc).
How long did it take?
Not counting how this story has been in my head rent-free for like 8 years… this project took about 32 hours. Most of it was fiddling with audio settings and figuring out how the heck to get it on Youtube.
I spent:
~8 hours doing the transcription and editing for the text
~18 hours doing the audio editing and foley work
~2 hours doing narration and writing
~4 hours of chatting with mentors about the project
What were the costs?
I normally try to create my projects in a way that is free and easy for Community to re-create. This one isn’t free :(
Like I mentioned before, the free GPT versions will make it harder to transcribe the text and the headache isn’t worth it. Similarly, I tried a bunch of free Text-to-Speech platforms and none of them were even close to natural. Most of them sounded like straight-up robots.
Adobe Pro - $20/mo (OCR)
Open AI Chat GPT Premium - $20/mo (Text Transcripts)
ElevenLabs Creator - $20/mo (AI Voice acting)
DaVinchi Resolve - Free (Audio Editing)
Pixabay - Free (Sound effects)
Audacity - Free (Audio Recording)
On the note of “chunking” the project - this is especially important for paid versions of GPT and Eleven Labs. Both have pretty limited tokens and I just barely avoided the limit on ElevenLabs.
What are ways to make it better?
Real Voice Actors
The AI voices aren’t perfect. They get the job done, but there is a loss of nuance. Having genuine voice actors would make this project much better.
Add to the Story
I’d love to learn more about the story and find the family to learn more about the events here and around the Community. Maybe adding more episodes or doing a whole deep dive into this time period. I’ve heard the line of songs still exist as well, so adding more of the actual songs would be great too. Maybe someday…
Better AI Models
This is always my general complaint - the AI space isn’t diverse enough! I’d love if we (O’odham) had our own datasets to help create more authentic content. O’odham voices in this case.
One idea I had was to engage with some local folks to explore creating a voice clone through ElevenLabs. I felt like it would take too long and require a lot more testing. There are also a lot of implications for voice cloning I’d want to ponder.
It would be amazing if a indigenous community was a hub for this kind of dataset so I don’t have to create it myself - that way we had authentic voices and information for AI generation.
Re-do the Dialog?
I don’t know if this would make it better or not. I feel like the original dialog is pretty good… but some of the parts are super longwinded - even with my editing. The last section with Thackery droning on is especially annoying for me. So, I wonder if the project would be made more readable or listenable with more focused dialog.
A Play?
Finally - I think this whole situation would be perfect for a play. This is not my expertise, but while I was adding in the sound effects and listening to the final project I kept thinking of how amazing it would be to have this visualized on stage. Especially with O’odham actors!
Or - Maybe once Sora is made public I can dust this project off and practice some AI movie-making skills :P
댓글