There have been plenty of opinion pieces about whether, and whether that’s a good thing or a bad thing. Regardless of how we judge the comparison, one question that these debates might cause us to ask is, “if data is the new oil, then what is it fuelling?” One obvious answer is “a surveillance capitalist nightmare,” but a less provocative (although no more accurate) answer might be “machine learning.”
Whenever we hear that AI systems require enormous amounts of data, or things to that effect, we are speaking about one approach to AI, namely, which relies on access to large amounts of training data so that the algorithms 'learn' rules. The recent surge in popularity of AI that began in 2012 is, in fact, a surge in the popularity of ML. Other, older, approaches to AI, such as expert systems, don’t require training data because they are laboriously ‘hand programmed’ by domain experts.
What makes ML unique here is that the system has to be fed with data so that it can be ‘trained’ to make certain distinctions or categorisations. A typically tough challenge for an ML system can be seen in the image below: which pictures show cats, and which croissants?
Other questions that we might want our ML system to answer could be:
- Does this image contain a face?
- Which movie would a person like to watch next?
- How can we autocomplete the sentence currently being typed out by a user?
To be able to answer any of these questions, an ML system needs to be trained on large datasets, which are generally manually labelled by. For the cat-croissant problem, what we typically need is a large, labelled dataset of pictures labelled ‘cat’ and pictures labelled ‘croissant’ for the system to learn from.
Indeed, if you’ve ever wasted five minutes of your life clicking on “squares that contain traffic lights” to get through a CAPTCHA ("Completely Automated Public Turing test to tell Computers and Humans Apart”) on a website, then you have done some of this labelling yourself.
Given that ML systems require huge amounts of data to train, we often hear pleas to remove restrictions on data collection and usage so that ML innovation can reach its full potential. We also hear the common refrain that regulation will kill innovation in AI (see: We can't regulate AI for more on this) and have the consequence that countries with strong data regulation will ‘lose the AI race’ to other countries, such as China, where AI developers apparently have access to huge amounts of sensitive data.
But just how much of an impact can such unrestricted access to data have? Can ML systems solve any problem with enough data, or are there hard limits that this approach inevitably runs up against?
If we come back to the example of distinguishing cats from croissants, ‘solving the problem’ of distinguishing cats from croissants can be accomplished by feeding a sufficient amount of relevant data into a suitable machine learning model. If we had an enormous dataset with pictures of ginger cats and croissants from all conceivable angles, in theory our ML model should become an absolute ace at distinguishing one from the other.
Of course, such a system would be basically useless, as even 100% accuracy here would have very little real world application and at best would just be doing as well as ordinary people, who don’t tend to regularly mistake cats for croissants or vice-versa. It might also run into problems if faced with a black cat and one of those weird vegan-charcoal croissants.
There are, however, many tasks where highly accurate machine learning systems could make a real difference to our lives. On the one hand, we have tasks at which humans tend to be quite bad, such as trawling through huge amounts of text or video footage to measure the occurence of certain words or objects. On the other hand, we have tasks that humans might be quite good at, but for one reason or another are not the type of tasks that we want to do. This could be because they are highly repetitive/boring, or downright horrifying.
As an example of a repetitive task, we could take the example of language translation. While there are many people who are expert translators, and while the majority of us possess the capability to learn a foreign language, we all benefit from tools such aswhich can provide relatively good translations instantly. Although such tools will never produce award-winning translations of literature, they nevertheless do a good job of translating menus for us on holiday and other such routine tasks.
Machine learning is also useful for automating tasks that people tend to find unpleasant or horrifying. A notable example made headlines in early 2020 when a team at Stanford trained a machine learning system to recognise people’s ‘analprint’ so that it could monitor their toilet usage to keep an eye on their health. Obviously most people would find it rather unpleasant to have to learn to recognise patients by their analprint and verify their identity in this way each time they used the toilet. However, as multiple people pointed out in response, there are thankfully many less invasive methods of verifying identity, such as fingerprints.
We can clearly see that there are many ways in which machine learning techniques can utilize large datasets to help human beings with certain tasks. This does not mean, however, that if we just get enough data, we will be able to train machine learning systems to solve any problem whatsoever, nor that the technique of machine learning itself is suited to all problems. In a piece entitled How to recognize AI snake oil, Stanford professor Arvind Narayanan demonstrates the limitations of ML by proposing that we make a distinction between three types of problems that ML is being used to solve.
Firstly, there are perception problems. What's important here is that there is some ground truth against which to measure accuracy (this will never be 100%, of course, but it’s as close as possible). For example, in transcribing speech to text, we can say with something close to certainty whether the transcription is correct.
Similarly, in facial recognition tasks where a system has to identify whether two photographs are of the same person (this is called, or 1-1 matching), we can say for sure whether the system has made a correct prediction. As Narayanan says, for this type of problem, “given enough data and compute, AI will learn the patterns that distinguish one face from another.”
Here we have seen real progress in recent years, and it’s with this type of problem that the idea that “AI can solve any problem with enough data” holds at least some water. Computing power and data quantity are not sufficient, of course, as data quality is a key factor, along with all the complexities of designing and fine tuning the algorithms themselves.
Nevertheless, there can be serious ethical issues even with such perception problems, as there are applications of machine learning which are problematic in and of themselves, regardless of how accurate they may be. For example, even in a situation in which facial recognition technology had achieved extremely high accuracy, it remains an. Perfecting accuracy is not the same as making a system ethical or legal.
Things become a lot more murky in the next class of problems that Narayanan calls problems of automating judgment. What we are doing here is trying to get an ML system to learn how we make certain judgments by feeding it a sufficient number of examples.
Take spam detection, for instance. If we train an ML system on a dataset containing hundreds of thousands of emails, some marked as ‘spam,’ others as ‘not spam,’ the idea is that the algorithm will learn how to make the same distinctions we made. In the case of spam detection, the accuracy can arguably reach quite a high level. This is largely because there are usually no serious disagreements about what constitutes spammy email. Things are far less clear in other cases, however, such as, where definitions are highly contentious and require human nuance.
With these problems of automating judgement, it seems that the more contentious the criterion, the lower the possible accuracy of the system. In cases where there are polarised or mutually exclusive definitions of a phenomenon, there will be no way for a ML system to satisfy both, and so the system will be fundamentally flawed from one or other perspective.
We cannot, for example, train an ML system to judge good literature. This is fundamentally impossible because we cannot provide it with a data set of good and bad literature that everyone would agree with. At best, we can train it to recognise what literature certain types of people would find good, but this is not the same thing.
The final type of problem Narayanan refers to as predicting social outcomes. The issue here is that we are dealing with systems with serious social consequences and fundamentally contentious concepts. Most importantly, these systems are trying to predict the future and this is the key difference with the problem of automating judgement.
As noted above, training an ML system to identify ‘good literature’ is not a solvable problem, because the criteria of judgement cannot be unambiguously defined. At the same time, even such an intractable problem as literary criticism is fundamentally only dealing with the past: the books have already been written, we just want the system to classify them according to criteria.
Predicting social outcomes, however, combines the problem of contentious criteria with the problem of making predictions about future events for which we have incomplete information. As examples of this type of problem, Narayanan lists predicting criminal recidivism, predicting terrorist risk, predictive policing, predicting job performance, and predicting at-risk children for social intervention.
All of these problems involve predicting the future, which he says should be something that we don’t believe we can do with ML in such serious use cases, but notes that “we seem to have decided to suspend common sense when AI is involved.”
A good example of the ineffectiveness of predicting social outcomes is a recent study, thewhich collected an enormous amount of data about so-called ‘fragile families’ and held a competition to see if researchers could predict six 'life outcomes' for children, parents, and households. Researchers were given nearly 13,000 data points on over 4,000 families.
Much to the surprise of everyone involved, none of the entries achieved any kind of reasonable accuracy. The most cutting-edge machine learning approaches with access to almost 13,000 data points barely performed better than a hundred-year-old technique using 4 data points (and none of them performed well at all). Similarly, another study by Julia Dressel and Hany Farid showed that the notorious criminal recidivism prediction system, COMPAS, was “no more accurate or fair than predictions made by people with little or no criminal justice expertise.”
They also demonstrated that “despite the impressive collection of 137 features, it would appear that a linear classifier based on only 2 features—age and total number of previous convictions—is all that is required to yield the same prediction accuracy as COMPAS.” In both cases, we see that fancy algorithms and huge data sets made no difference to accuracy and predictive power.
More importantly, the entire problem which such systems are trying to solve is framed in a way that can only lead to harmful outcomes, because the idea that complex social outcomes can be predicted from past data is highly problematic, especially in cases where those predictions have serious consequences for people.
In addition to being no better than rudimentary methods in these cases, ML systems introduce a host of additional risks. Narayanan lists a number of these, such as:
- Hunger for personal data
- Massive transfer of power from domain experts & workers to unaccountable tech companies
- Lack of explainability
- Distraction from interventions (i.e. we focus on tweaking algorithms instead of broader social solutions)
- Addition of a veneer of accuracy/objectivity
There has been an alarming number of instances of 'AI' being used to make predictions and judgments for which the technology is entirely unsuited, and which in some cases shouldn't be made at all, even by humans.
To list just a couple of recent examples: a paper published in July 2020 claimed to use the body mass index (BMI) of politicians as a predictor of political corruption; it emerged that the US Department of Defence has invested $1,000,000 in developing an AI system that could "predict an enemy's emotions"; and most troublingly of all recent examples, a paper was published entitled “A Deep Neural Network Model to Predict Criminality Using Image Processing,” which claimed to be able to predict 'criminality' by analysing images of people's faces.
Regarding the latter paper, over 1,000 AI experts signed a letter condemning the research and outlining the "ways crime prediction technology reproduces, naturalizes and amplifies discriminatory outcomes, and why exclusively technical criteria are insufficient for evaluating their risks." As they further noted, "there is no way to develop a system that can predict or identify “criminality” that is not racially biased — because the category of “criminality” itself is racially biased."
The paper in question was ultimately withdrawn, but the example clearly demonstrates that the problem of suchis current, acute, and potentially deadly for marginalised groups.
Criticism has been raised against many of these scientifically dubious applications of machine learning, such as, and we have seen increasing pushback against technosolutionist attitudes that seek to turn social issues into technical problems to be solved by ML systems.
Although we can't provide a fully comprehensive analysis of why ML systems fail at different tasks, we've hopefully managed to highlight some of the prominent ways in which these systems fail, and to dispel the idea that 'AI can solve any problem.' To help you delve further into these issues, we've put together a bibliography to provide some extra resources to explore the problems discussed here in more detail.
Bibliography & Resources
Here we've assembled some resources should you want to dig further into the problems discussed above.
The Economist - The world's most valuable resource is no longer oil, but data
- Proposes the idea of GDP as a measure: gross data product
- Interesting mailing list for critical AI studies
- Excellent bibliography
- What are the unintended consequences of designing systems at scale based on existing patterns in society?
- When and how should AI systems prioritize individuals over society, and vice versa?
- When is introducing an AI system the right answer—and when is it not?
- Assessing if AI is the right solution for your users’ needs
- A buyer's checklist for AI in health and care
First, we've got some general resources on the problem, and then some specific resources on AI and racist pseudoscience, emotion detection, gender detection:
- MIT Tech Review article on this study: AI can’t predict how a child’s life will turn out even with a ton of data
And here are two excellent lists of books and other resources on race and technology:
AI Now's 2019 Annual Report which calls out the shaky scientific foundations of emotion detection
Explainers on emotion AI from ethicalintelligence.com: