By Leonard Bruce
What is the project?
I created an O’odham Learning Library by using ChatGPT to transform the entries from Bernard Fontana’s O’odham Annotated Bibliography (Fontana, 2009) into a searchable database and add them to my pre-existing repository.
Why did I do it?
I originally started a small repository to organize my research library in 2022. I started a history project and I was frustrated by the lack of repositories that were related specifically to O’odham history. While there are repositories at libraries and other areas, they are rarely comprehensive and often hard to navigate. I wanted to have a central database of only O’odham related texts and resources. So I started building one.
I began by searching online for books and articles related to O’odham history to build my database and I slowly got to over 100 O’odham-related journal entries, books, and dissertations. Then - I found the Fontana Annotated Bibliography (http://tinyurl.com/fontanabib).
I had seen Fontana’s name during my search, and I had many of his articles listed in my original database, but I’d never seen this document before. Fontana (and others) had compiled this document of over a thousand pages with references to thousands of O’odham-related texts. A treasure trove of data!
But - it wasn’t easily searchable. It was hard to navigate. So, I decided to transcribe it in a searchable format (excel) and enter it into my database (AirTable) so more people could have access to the data and more O’odham intellectuals can build on the amazing research available.
I’m deeply thankful for the amazing work that Fontana & Owen did, and recognize that I’m building on the work of a giant in the field. I hope my humble addition will be helpful to others!
Be sure to check out the full database at www.LeonardBruce.com/projects or at http://tinyurl.com/Oodhamairtable
How should you use this?
Any way you like - but respectfully! Don’t assume everything you find is accurate - find a knowledge-keeper in your community to double check things.
In my vision I hope that O’odham are accessing this database to learn about the vast amount of amazing history our ancestors have shared. Some ways I think it is helpful:
Browse
Just scroll through the system and see if anything pops out for you. I’ve been amazed at the amount of random stories I’ve found that lead me down the coolest rabbit holes. If you really want to nerd out, find a dissertation, thesis, or journal article to get a deep dive into a subject. If you want a more casual read, find a magazine article or read some poetry!
Find a research topic
Maybe you are a student and you want to find past work done with your Community? Look around your field and see if there are any past studies or papers related to what interests you. Take a critical look at their work, maybe even see if you can replicate it? This database will help you find a wide variety of work to add to your research bibliography.
From O’odham fingertips to your eyes
One part that I love - and will be working on more - is signaling when an author is O’odham. Reading, watching, or listening to a work from an O’odham just hits different. The sources feel more authentic and helps me feel connected. Go through the Author section and filter by O’odham Author to find media that had an O’odham author involved!
How did I do it?
My first crack at the database was doing searches for key words across different libraries - Arizona State University, Public Libraries, Worldcat, Anna’s Archive, etc. I used keywords like “O’odham, Papago, Pima, Tohono” and so on. This led to some success, but the work really popped off once I found Fontana’s bibliography.
First, I created a copy of Fontana’s Annotated Bibliography that was OCR optimized with Adobe Acrobat. Most of the text was simple, but I would get occasional errors, so I used the OCR tools in Adobe to correct errors before transcription.
I then copied and pasted ~6-10 bibliography entries at a time into ChatGPT 3.5 model with the following prompt:
“Organize the data in the following columns: full Article Title without quotations, year, resource type, author (First, middle, last), journal, the exact notes associated with the entry between the [], and the pages referenced .
Also include a column that summarizes in three general topics the resource using the following categories or a close similar category: History, Stories & Legends, Water, Music, Art, Education, Employment, Economy, Government, Culture, Health, Environment, Religion, Language, Youth Book, Boarding School, Children”
I found that the prompt would keep the information organized in a similar fashion, but after a few entries the AI would start to have small variations in the output and I would have to correct the model by adding prompts like
“This table is missing a column” / “The title is cut off” / “The author name is wrong”
Occasionally the chat thread would become too long and begin to slow down my web browser - likely because the model consults with past prompts to help guide it’s response.
I began to start new chats using the same prompt to stop the slowdown. I needed to provide correction prompts to the new chat, but I found that it was easier to “train” a new thread instead of dealing with the slowdown.
While AI made the process easier, it was still time consuming. There are some systems that can process larger chunks of text, but I also wanted to review each entry to ensure the format and data had higher integrity and fewer AI hallucinations or errors. The AI model would occasionally hallucinate in the note summaries, but it was rare.
The model would generate an excel-readable table and I would copy the table into my AirTable database.
Why Use AI?
This was the first major project that I used AI in my workflow - and it was amazing. I’ve done similar projects of text and data transcription (https://www.leonardbruce.com/projects) in the past. Using AI helped a TON. Being able to paste the data into the model and have it be transcribed into an Excel-readable table saved a huge amount of time.
I will be integrating AI into my workflows in the future for text processing - and I included my general workflow here for others to learn from. It wasn’t an easy process to learn and implement, but I can’t wait to try it out in different projects!
Issues using AI
A list of the variations that I had to look out for as I was doing this project:
The categories would often change.
Even with a predefined list of categories that were outlined, I found that the AI did have good suggestions for other topic areas and I decided to allow the variation. This is an area that will need to be cleaned at some point.
The table format would often change.
Even following the same prompt, the AI would often make changes to the table - adding new columns, removing columns, cutting off titles, or summarizing Fontana’s notes. When found, I would re-do the prompt and the system would generally fix the error. For the notes, I allowed the summarization, but the exact notes should be added at some point.
Pages were often inconsistently numbered.
A common inconsistency was that the AI would change the way that pages/page numbers were read from an entry. This is likely because of the difference in a book, journal, magazine, etc. Some entries have a page range and some are the number of pages in the entry. This section should be cleaned at some point to add some consistency.
Topics were inconsistent.
The AI was useful for creating topics, and saved a lot of time. I found that the model kept developing its own tag suggestions and kept them instead of only allowing the prompts limited list. The database ended up with too many highly specific or duplicated tags and these will need to be reviewed, combined, and cleaned up at some point to serve as a better finding guide to users.
How long did it take?
Completing this project took about 160 hours of work doing the following:
Testing different methods of transcription
Testing different AI models
Thinking about the implications of using AI for the project
Testing prompts and outputs
OCR/Text Cleaning
Actual transcription
Light Database Cleaning
A large portion of my time was taken by testing different methods of transcription and different tools and AI models. I found that some tools worked better than others and had better outputs. I settled on ChatGPT 3.5 because it was free and accessible for most people who would want to do similar projects.
There were many models from www.HuggingFace.com and similar repositories that overall worked better ( higher character limits, more accurate transcription, etc) - but may be harder to use due to the hardware and technical knowledge requirements of a self-hosted AI model. Many of the better self-hosted models require higher end computers and at moderate levels of comfort with programming.
What are ways to make the database better?
There are a few ways this database could be better:
The Annotated Bib needs to be reviewed and the exact notes should be added.
While doing the project I noticed that Fontana’s notes were being summarized and there is some loss of detail for some entries. The summarizations work fine, but adding the exact text would be useful for doing searches or for finding specific sources. This could be done quicker using a different AI model or a premium ChatGPT 4 model.
The data needs a general cleaning - Book Type, Authors, Pages, Topics
Most of the data output is clear and accurate, but there are some transcription errors and other mistakes that need to be cleaned. Book types are somewhat inconsistent, Authors need to be reviewed and names corrected - especially when there are more than one listed, the page numbers need to be corrected and standardized, and the topics should be standardized.
Extra Context Should be Added
While the database includes most data points to help a user find and retrieve the entry - more context from the Fontana bibliography should be added. For instance, many entries are missing the journal volume and issue the entry comes from. A more specific prompt may help.
Entries that Fontana missed should be added
This Annotated Bibliography is one of the most comprehensive I have seen for O’odham subjects, but it isn’t fully complete. Finding and adding past materials that have been missed by Fontana would help make this more comprehensive. Also to re-examine the field of O’odham literature to add new sources is needed as well!
A ranking system to should be implemented
I’ve found that this database is so big that there is just too much information. One solution could be to create a type of ranking system to show how “useful” the source is. Highly ranking sources such as documents that are tied directly to member voices or that have specific information in them and ranking sources with generalized or off-hand notes lower.
A list of recommended entries and focused deep-dives should be developed
Another solution to the vast number of sources in this database could be to create “recommended” lists for certain topics or creating a blog to highlight specific resources. A recommended reading list could point users to a collection of highly useful sources for a specific topic for instance. Or a blog could focus on one or more entries to explore them in depth.
Documents should be found and linked to entries
I’m not sure how licensing and copyright issues would allow this, but each entry should have the source document attached to make it easier to review. Many entries could be found through internet searches using Worldcat, Anna’s Archive, or other digital libraries. For physical copies, partnerships can be developed with archives to digitize records and attach them to the base.
Final Thoughts
Again - I am endlessly thankful for Bernard Fontana and the work he did with his annotated bibliography. His work is truly amazing and I’m glad to have been able to do my small part to transform it into something a bit more searchable, and more accessible.
I hope this project makes it easier for other O’odham to access these amazing resources and discover the knowledge our ancestors left for us.
Comentários