In highly complex and uncertain environments, algorithms are often more accurate than experts. In healthcare, forecasters tend to rely heavily on judgment, despite the inherent bias and inefficiencies.
When we evaluate the results of price predictions, demand forecasting, or shortage predictions, algorithms regularly beat human specialists.
Suboptimal pharmaceutical forecasting
According to the Lancaster Centre for Forecasting
- In the pharmaceutical industry, demand forecasting is a relatively new task, which helps to explain why most companies (82.1%) rely on simple methods such as:
- Moving averages for the preprocessing of time series,
- The naïve technique in which the last period’s actuals are used as this period’s forecast, without adjusting them or attempting to establish causal factors.
- The Delphi Technique, which relies on the experts’ panel. See AI for demand forecast in healthcare.
- Demand Forecasters tend to rely heavily on judgment, despite the inherent bias and inefficiencies associated with it.
- Spreadsheet Software in Forecasting is the most commonly used type of software.
- The average one-month and three-month ahead item level MAPE is around 40%.
Weller, M., and S. Crone. (2012) “Supply Chain Forecasting: Best Practices & Benchmarking Study.” Lancaster Centre for Forecasting, 41. https://core.ac.uk/download/pdf/224768371.pdf
Early evidence by Paul E. Meehl in 1954. Algorithms obtain better results in clinical predictions than medical professionals
In 1954, Paul E. Meehl found that the data combined with simple algorithms obtained better results than the opinions of the medical professionals.
Although Meehl’s study caused a considerable stir among medical professionals, algorithm-based prediction has advanced in both medicine and other fields.
He analyzed 20 studies on the reliability of clinical predictions based on the opinions of qualified professionals versus those obtained using statistical data.
The problem is noise
Organizations expect to see uniformity in the choices of their employees, but humans are not reliable. Judgements can vary significantly from one individual to the next. Both internal and external factors, such as mood or the weather, can change one person’s decisions from event to event. This variability of decisions is called noise, and it is costly to companies, which are usually entirely ignorant to it.
In Noise, his new book, Daniel Kahneman, Sibony, and Sunstein show how noise produces errors in many fields, including healthcare and economic forecasting.
Kahneman collates studies in the area that are conclusive: algorithms beat experts. Predictions made by algorithms, even the simplest of them, are often more accurate than experts. The key benefit of algorithms is that they are noise-free. Superior consistency allows even simple algorithms to achieve greater accuracy than human experts.
A human demand forecaster
A model predicting the decisions of a demand forecaster.
In most cases, the model outperformed the professional it was based on. The substitute was better than the original product.
Newer evidence. A model imitating a forecaster predicts better than the actual expert
Lewis R. Goldberg is an American psychologist, who is best known for his five-factor model of personality.
Goldberg studied statistical models that describe the judgments of an individual. Goldberg’s models predict the decisions that an expert will make.
It is just as easy to build a model of a judge as it is to build a model of reality. The same predictors are used. It is only the target variable (the dependent variable, the one to be predicted) that changes. When building a model of a judge, the target variable is the decisions an individual will make. Whereas, when creating a model of reality, the target variable is the actual outcome.
For example, in demand forecasting, a model of a forecaster would predict the quantities they would forecast. The model of the actual demand would forecast what the demand will be in three months, for example.
By the way, in statistics, the term “judge” is applied to anyone who delivers a judgement.
A comprehensive review of studies on judgment found that, in 237 studies, the average correlation between the model of the judge and the judge’s judgments was .80.
What happens if we compare the predictions of the predictor and its model?
The forecasting did not lose precision when the model generated predictions. They improved. In most cases, the model outperformed the professional it was based on. The substitute was better than the original product.
But why is the model is better than the original?
A mathematical model of our choice cannot add anything to the info they already contain. All these models can do is subtract and simplify.
Complex rules only give us the illusion of validity and will, in fact, undermine the quality of our judgments. Some nuances are valid, but many are not.
Most importantly, a model of my judgment will not replicate the noisy pattern in my decisions. It cannot reproduce the positive and negative errors that arise from unpredictable reactions that we may have to a particular case. The model will also not fall victim to the influences of the momentary context and my state of mind when I make a single decision.
Most likely, these noisy errors in judgment are not systematically associated with anything, which means that for most cases they can be considered random.
The effect of removing noise from your judgments will always be an enhancement in your predictive accuracy.
Noise reduction mechanically increases the validity of the predictive judgment. In short, substituting meus with a model of myself does two things: it removes my subtlety, and it removes the noise from my pattern.
The strong conclusion that the judge’s model is more valid than the judge communicates an important message: the benefits of subtle rules in human judgment, when they exist, are generally not enough to neutralize the harmful effects of noise.
You may believe that it is more ingenious, more insightful, and more nuanced than the linear representation of a judge’s thinking. But, the reality is human experts are noisy.
Why do complex prediction rules damage accuracy, despite our strong feeling that they are based on valid knowledge? For a start, many of the complex rules that people make up are probably not correct in general. But there is another problem: even when complex rules are valid in principle, they are inevitably applied under conditions that are rarely observed.
A human demand forecaster
10,000 randomly weighted linear models performed better than the human experts.
Most recent evidence. Even random models outperform experts
Research by Martin Yu and Nathan Kuncel used data from an international consulting firm that used experts to evaluate 847 candidates for executive positions.
The experts used seven different assessment dimensions and their judgement to rate candidates and assign an overall predictive score to each.
The results were rather unimpressive.
Yu and Kuncel decided to compare the judges not with the best simple model of themselves (like Goldberg did), but with a random linear model. So, they generated 10,000 sets of random weights for the seven predictors and applied the 10,000 random formulas to predict job performance.
Their finding was remarkable. It illustrated that any linear model, when applied consistently to all cases, is likely to outperform human judges in predicting an outcome using the same information.
In one of the three samples, 77% of the 10,000 randomly weighted linear models performed better than the human experts.
In the other two samples, 100% of the random models beat humans.
This research concludes that programmed adherence to a simple rule (Yu and Kuncel call it “nonsensical consistency”) could significantly improve judgment in a complex problem, illustrating the massive effect noise has on the validity of judgment-based predictions.
In predictive judgments, algorithms beat human experts: models of reality, judges’ models, or even randomly generated models.
- Daniel Kahneman, Thinking, Fast and Slow, Farrar, Straus and Giroux, 2011.
2. Dan Ariely, Predictably Irrational, HarperCollins, 2008.
3. Paul E. Meehl, Clinical Versus Statistical Prediction: A Theoretical Analysis and a Review of the Evidence, Echo Point
Books & Media, 2013.
4. Lewis Goldberg, “Man Versus Model of Man: A Rationale, plus Some Evidence, for a Method of Improving on Clinical Inferences,” Psychological Bulletin 73, no. 6 (1970): 422–432.
5. Yu, Martin C. and Kuncel, Nathan R. (2020) “Pushing the Limits for Judgmental Consistency: Comparing Random Weighting Schemes with Expert Judgments,” Personnel Assessment and Decisions: Vol. 6 : Iss. 2 , Article 2. Available at: https://scholarworks.bgsu.edu/pad/vol6/iss2/2