How Can Information Architecture Help Address AI Risk?

There’s no shortage of stories about “AI gone wrong” in the news, on social media, and here on LinkedIn. For example, the AI Incident Database, in what is almost certainly an incomplete collection, lists nearly 4000 of them. Regardless of the metric used, these incidents damage businesses and do real harm to people.

In this post, I will explore a few recent and interesting academic and business publications that attempt to understand and address AI risk. It’s a broad, deep, highly technical, and rapidly developing area, and this is far from a complete review. However, it’s a topic that’s important to all of us. I think we, as information architects and information professionals in general, can contribute to addressing these issues. Writing this post helped solidify my own understanding, and I hope it will be beneficial to others as well.

A screen shot from a search engine shows results for AI risks, including: transparency, ethical dilemmas, dependence on AI, unclear legal regulation, misinformation, bias, security risks, job displacement, AI race, and other unintended consequences.

AI Risk is a Significant, Growing, and Incompletely Understood Problem

A recent project published by the interdisciplinary MIT FutureTech group, The AI Risk Repository, identifies and categorizes AI risks in terms of cause and domain, based on a survey of published AI risk frameworks. This resulted in two taxonomy models which they used to classify over 700 specific identified AI risks. Many AI risks, such as inappropriate or even dangerous use of AI, are ultimately the result of choices made by businesses and other organizations. Other risks arise from how AI models are developed and trained. Training data is an important part of AI performance and—even with the best of intentions—the make-up and quality of this data is a risk factor. An example of this dependency comes from the AI Risk Repository overview of the Discrimination risk domain (AI Risk Repository; page 33),

Decisions made during the development of an algorithmic system and the content, quality, and diversity of the training data [emphasis added] can significantly impact which people and experiences the system can effectively understand, represent, and accommodate.

The AI Risk Repository documents a number of AI risks that are effectively new versions of an old problem (garbage in, garbage out), and I’m far from the first to make this observation. In fact, there is a growing body of regulations around AI, with the EU AI Act and the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence executive order from US President Joe Biden, along with supporting guidance from US regulatory agencies, representing two important recent examples.

An important theme of these regulations is AI transparency and a fundamental requirement of transparency is data provenance: essentially, understanding what the data going into an AI training set is, where it came from, and ensuring that this is documented and tracked in a standardized and reproducible way. This is something that IAs and librarians have been doing for decades. The underlying data is a critical component of the AI ecosystem and contributing to preparation and management of it is something that information architects are well equipped to do.

As I noted above, the specific content of AI training data directly impacts AI performance, in both positive and negative ways. Knowing and managing the provenance of AI training data can contribute to improved outcomes and reduced risk in several ways.

  • Awareness and control of when and how copyrighted material, and intellectual property in general, is used by AI models.
  • Better understanding of model inputs, which both improves AI performance and reduces the risk of accidental inclusion of personally identifiable information, private or restricted data, incorrect or low-quality data, and other inappropriate data in training datasets.
  • Transparency to identify and possibly address biases that are embedded in training data.

Any information or data professional who has worked in a regulated industry will be familiar with the processes for establishing and tracking data provenance. In a nutshell, it means annotation of data with standardized and controlled metadata as part of a holistic management strategy. Again, metadata development and management, which will make use of tools and techniques such as controlled value lists, synonym mapping, and taxonomies, is very much within the scope of typical information architecture work. We may be accustomed to calling it “content” rather than “data,” and focusing on use cases like discovery but the principles are the same.

A suggested set of metadata elements “to facilitate authenticity, consent, and informed use of AI data” gives an idea of how this might be approached. Many of these elements lend themselves to controlled vocabularies and could, in fact, be sourced from existing standardized vocabularies published by government agencies, standards organizations, and similar authoritative sources. Initiatives such as FAIR, which was originally intended for the management and sharing of scientific data, or even the general-purpose Dublin Core, which will be familiar to most information architects, are some additional tools that could be repurposed to manage data provenance for AI development.

An example of how this might look in practice is “Datasheets for Datasets,” which describes a research project from the University of Michigan iSchool where prospective AI training sets were accompanied by a file containing contextual information about the dataset to inform AI project stakeholders and raise awareness of potential ethical issues. A key point is that this intervention happens before an AI model is trained. Proactively identifying and preventing a problem is nearly always more effective than fixing it after the fact. This was a small study but the results were promising in terms of raising awareness of ethical issues for AI developers. It shows that simply making an effort to educate stakeholders and raise awareness of ethical concerns, even in a low-tech way, has a positive impact. It’s easy to imagine an approach that combines the transparency metadata with a plain-language summary like this as part of an approval workflow that helps make ethics and risk considerations part of AI development.

References

Most of the references in the text of this article are to project websites, write-ups in popular sources such as Medium, and the like, as I thought this would be of greatest interest to most readers. However, for completeness, I’ve included references to academic publications below.

Boyd, Karen L. (2021). “Datasheets for Datasets Help ML Engineers Notice and Understand Ethical Issues in Training Data.” Proc. ACM Hum.-Comput. Interact. Volume 5, Issue CSCW2, Article 438 (October 2021). https://doi.org/10.1145/1122445.1122456  

Longpre, Shayne, Robert Mahari, Naana Obeng-Marnu, William Brannon, Tobin South, Jad Kabbara, and Sandy Pentland. (2024). “Data Authenticity, Consent, and Provenance for AI Are All Broken: What Will It Take to Fix Them?” An MIT Exploration of Generative AI, March. https://doi.org/10.21428/e4baedd9.a650f77d

Redman, Thomas C. (2024, August 12). “Ensure High-Quality Data Powers Your AI” Harvard Business Review. https://hbr.org/2024/08/ensure-high-quality-data-powers-your-ai 

Slattery, P., Saeri, A. K., Grundy, E. A. C., Graham, J., Noetel, M., Uuk, R., Dao, J., Pour, S., Casper, S., & Thompson, N. (2024). A systematic evidence review and common frame of reference for the risks from artificial intelligence. ArXiv. https://doi.org/10.48550/arXiv.2408.12622

John Tulinsky
Information Architect |  + posts