I had the pleasure of hosting Dr. Venkatasubramanian at the MIT CSAIL HCI Seminar on Tuesday, October 1. What I found most insightful about his talk was his conceptualization of sensors rather than solutions to questions about AI governance: rather than focusing on creating specific parameters for an interpretable AI system (e.g., "should my false positive rate be 8% or lower?") that organizations can dodge or otherwise manipulate, the point is to be able detect the different features of a particular system with a diagnostic toolkit that can be flexible and adapt to different contexts. Below you'll find a loosely live-blogged account of that incredible talk (all mistakes and misinterpretation are mine alone).
Dr. Suresh Venkatasubramanian is Professor of Computer Science and Data Science at Brown University, where he directs the Center for Technological Responsibility, Reimagination, and Redesign (CNTR) with the Data Science Institute. He recently finished a stint in the Biden-Harris administration, where he served as Assistant Director for Science and Justice in the White House Office of Science and Technology Policy. In that capacity, he helped co-author the Blueprint for an AI Bill of Rights.
Moles, Snakes, and Turtles: How we need to think about practical AI governance
In the depth of the pandemic, he says, it was easy to incessantly scroll through the dark timeline of COVID Twitter. He found a salve for this in animal videos, which is why this talk is called "moles, snakes, and turtles." You'll find pictures of these adorable animals scattered throughout the talk, and he uses them as symbols for specific concepts that he says are crucial to thinking about practical AI governance.
During his time in DC, he was always pegged as The Tech Person ™ who knew about all things computers and artificial intelligence. A policymaker might come to him and say, "there aren't any easy ways to measure bias and fix it," to which he would respond, "there are actually 10+ years of research on how to do this!" The conversation would then unfold like so:
Policymaker: Okay, if there's all this research, then should my false positive rate be 8%?
Venkatasubramanian: Well, it depends...
P: Can you tell me what measure of fairness to write into this policy doc?
V: Well, it depends...
P: Could you at least tell me what methods vendors should use for bias mitigation?
V: Well, it depends...
He ended up saying "it depends" to a lot of these questions about bias detection and mitigation. But how do we make practical AI governance work? We should aspire to design systems that are explainable and usable for users -- but how do we actually do that? That's the hard question. We need to move away from the normative question -- "should we do something about AI governance?" -- to a practical one: "how do we actually do it?" This suggests that a certain mode of research needs some updates.
What makes it hard to deploy systems with proper governance?
The question of how to build AI governance is really important: there are lots of different answers. "No, shut it all down." "We need guardrails." What makes things actually difficult in this domain is specificity: being able to take in issues of context to understand the dynamics of a system. The Colorado insurance draft rules, for example, uses Bayesian Improved First Name and Surname Geocoding (BIFSG) as a method of inferring race and ethnicity in data when this data is not available. He'll talk more about this method in detail later in this talk, but the basic concept is that this method uses logistic regression to model the underwriting outcome. This is a relatively straightforward task to do, and you can write an R package to do this! But can you be that specific when you're doing insurance underwriting regulation? You'll need something specific -- concrete numbers and methods -- if you're going to tell companies to do something.
At a high level, you also need local context, since predicting when there's high risk is really important (e.g., medical applications). One thing that's very tricky to manage is that prediction works differently in different settings, so it's difficult to build governance structures around this.
🐭 On moles: There have been a few reactions to this work. Here's where the mole comes in: we've gone into "whack-a-mole" mode. A problem comes up, and whack! You've done something wrong. Wrong metric? Whack! Critique is a big part of the literature, which people sometimes conflate with criticism. It's a difference that we forget sometimes.
One thing that the AI Bill of Rights has is concrete examples where something has shown up -- this was powerful because it showed that the points we were making were not just fear-mongering, but examples of things that have already happened. For related work on the topics covered so far, see the following:
- Fairness and Abstraction in Sociotechnical Systems (2019)
- Problems with Shapley-value-based explanations as feature importance measures (2020)
- The Misuse of AUC: What High Impact Risk Assessment Gets Wrong (2023)
🐍 On snakes: Adaptivity is a really important part of this conversation too, as people are always going to snake around the regulation. It's important to consider the "snake" mode -- where people will always try to work around the system -- in order to anticipate problems that come up.
- You still see me: how data protection supports the architecture of ML surveillance (2024)
- Break it till you make it: an exploration of the ramifications of copyright liability (2024)
🐢 On turtles: Instead of snakes, we need to be thinking about turtles: that is, steady building progress. We often talk about how techno-solutionism is a bad thing, but not doing anything is itself a problem. We need sensors, not solutions: we want to detect when something is weird is happening. These systems can give us a picture of what's happening in the system -- the sensor itself doesn't tell you automatically that something's wrong; you still have to interpret pieces of the puzzle. As Hutchinson and Mitchell show, there are a lot of parallels with historical concerns about AI/ML: "50 Years of Test (Un)fairness: Lessons for Machine Learning (2019). They talk about the switch of figuring out when a system is fair to how a system is unfair. In different local contexts, what a system looks like when things go wrong is very different -- but with a collection of sensors, you can create a panoply of tools to diagnose different problems. There can be a decision flowchart, but even that's not a universal solution -- we need precision.
Standardization is also really important: NIST has taken on a huge role because they're crafting the frameworks to regulate AI across a wide variety of settings. They determine in particular domains / cases: what are best practices? Did you do the reasonable thing? In that case, you're probably not liable; but if people told you beforehand (vis-a-vis standards) that you should have been doing something one way, you might be liable for negligence: Deconstructing Design Decisions: Why Courts Must Interrogate Machine Learning and Other Technologies (2024). You can think of this as a provocation, but I will instantiate the concepts we've been talking about with a case study.
Case study: Bayesian Improved Surname Geocoding (BISG)
We need demographic attributes to do bias estimation, but sometimes we don't want to or can't collect sensitive demographic attributes. What can we do about that? We can estimate demographic attributes via proxies. This happens already in the credit and insurance industries -- they might not want to collect this data because they don't want to be responsible for ignoring it. This is what BISG -- originally developed at the Rand Corporation -- is trying to do. It takes advantage of census data. The key thing that makes it work is that you assume for a given race, surname and zip code are independent. Then, when you do the appropriate calculations, you can estimate race. This may not work for someone like Kamala Harris (using just her last name, Harris), but that's why sometimes we use BIFSG (which uses first name as well).
The Consumer Financial Protection Bureau (CFPB) did something that I love -- they put up a GitHub repo and released the code for BISG in it. At the time, they were suing a car dealership in Chicago for questionable loan rates given to the Latino population. They used BISG to make the claim that there was discrimination despite the fact that the dealership said that they don't collect this data. Is BISG a good thing or a bad thing? I'm not arguing either way!
- Measuring and Mitigating Racial Disparities in Tax Audits (Stanford SIEPR)
That being said, there are a few issues with BISG:
- It underestimate probabilities of smaller racial groups (e.g., Native groups)
- The independence assumption ignores patterns of segregation
- Misclassification rates are correlated with demographic and socioeconomic factors
But...it is well-calibrated at a population level! This is the importance of context: is there a way that we can update or fine tune BISG? Can we use the training data labels to help build a better disparity estimator?
Coming back to context: we're using more information locally that you might not always have access to, and that makes the outcomes different.
Discussion / takeaways: Tools like BISG are important because of legal limits on collecting demographic data (cf. ECOA and data protection). Small amounts of context can help with estimation. We should allow for the collection of small amounts of purpose-limited demographic data or collect demographic information after decisions have been made. Promising avenues of future work include providing a "best in show" framework for proxy estimation (AKA turtles!). Under what conditions and what contexts should this model be used? BISG isn't necessarily the standard that courts should use; we should have standards (à la NIST) for these governance frameworks.