Why Big Data Isn’t Necessarily Smart Data

Analytics has become a religion of sort among the savvy and sophisticated in tech and finance — with a central belief that any information is accessible if we just figure out an algorithm to crunch data about it.  

The particularly devoted have even gone so far as to declare both modeling and the scientific method dead — made obsolete by faster computers processing petabytes of information. Why try to model for a right answer, when one can just ask the data itself?  

“This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology,” wrote Wired’s then Editor-in-Chief Chris Anderson. “Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.”

And while that may seem a bit hyperbolic, the greatest hits of the magic of analytics are well-known: The Oakland A’s managed to beat back better financed teams through the pure power of their data; Nate Silver routinely buries the field when it comes to calling elections, because he is just that much better at reading meta-analysis of polling data.  

It is a tempting narrative.

It is also grossly misleading, according to an emerging group of experts.  

“Data and data sets are not objective; they are creations of human design,” noted Kate Crawford, Microsoft’s Head of Research and a visiting professor at the MIT Center for Civic Media in Cambridge, in a recent Harvard Business Review article.  

And data can be wrong, Crawford notes, or at least used incorrectly to answer questions because “data fundamentalism” leads to a sort of blind faith in algorithms without enough critical examination of how algorithms actually work, what they do and why they can fail.  

Well of course it does. There is abundance of data. But the trick is finding enough of the right data to draw the insights one needs to turn big data into intelligent insights that drive smart decisions.

Which brings us to alternative lending, where many firms define their essential value propositions in terms of proprietary algorithms that throw big data at questions of creditworthiness — and establishing lending guidelines from the answers they get. “Big data does it better,” is the rallying cry of alt lenders everywhere.

But does it?  

When Algorithms Fall Down

Remember when Google predicted the flu better than the CDC by analyzing petabytes of search data through their “Flu Trends Tracker?

No you don’t, that was a trick question.  

You certainly remember reading headlines about it in 2013, but further analysis uncovered something. Google was wrong and over several years had managed to mostly over-estimate flu cases, especially in 2013 when it missed big. Google data, when combined with CDC modeling, could be very useful, but by itself over indexed people who were merely worried about the flu (due to media reports) with people who actually were experiencing flu symptoms. Earlier this year, Google shuttered Flu Tracker, labeled an “epic failure” (yes, that’s also from Wired, which was sure that with enough data no model could be of any use a few years earlier).

And while the Google miss has gotten a lot of press for how well it highlights one of the big problems with big data — it is far from the only case MIT’s Crawford notes, though it does indicate the problem: the data harvest algorithm can only do as well as the assumptions of the people who programmed it allow it to.

The Google algorithm wasn’t wrong in that it did the wrong thing. It did exactly what it was supposed to — figure out how many people were Googling flu symptoms — it was just linking a bad assumption to the data it was crunching: that anyone searching for flu symptoms was experiencing them, not merely worried about experiencing them in the future.

But other examples abound, Crawford notes, like for example a famous study that took and aggregated data from Twitter and Foursquare during Hurricane Sandy to draw conclusions about how people handle a natural emergency. They even managed some unexpected observations, like nightlife picking up the day after, which the authors ascribed to cabin fever. And while the observations were interesting, they might have been a little off.

“The greatest number of tweets about Sandy came from Manhattan. This makes sense given the city’s high level of smartphone ownership and Twitter use, but it creates the illusion that Manhattan was the hub of the disaster,” Crawford noted.

Except Manhattan wasn’t the hub of the disaster, but in the hub of the disaster they were more likely to experience power outages (and drained batteries) and less interested in tweeting or going out, because they were awfully busy being battered by a natural disaster and then cleaning up afterward.

“In fact, there was much more going on outside the privileged, urban experience of Sandy that Twitter data failed to convey, especially in aggregate. We can think of this as a ‘signal problem,’” Crawford noted.

Such signal problems are common when it comes to using big data, often because the folks who design the algorithms tend to build in “universal assumptions” about how people behave (i.e. “everyone has a smartphone that is working” or “everyone experiencing X is tweeting about it”) that are far less universal than they think, and are thus skewing the data in ways that are hard to see.

Even more problematically, expert blindness to built-in assumptions is not always obvious, especially to non-expert observers.  

Algorithms Don’t Work The Way That Most People Think They Do

University of Utah scientist Suresh Venkatasubramanian noted in a recent blog post that a lot of the confusion about what is going on with the algorithms, particularly the kind that power big data analysis,  is that people sort of misunderstand what exactly they do.  

“An algorithm is like a recipe,” Venkatasubramanian wrote. “It takes ‘inputs’ (the ingredients), performs a set of simple and (hopefully) well-defined steps, and then terminates after producing an ‘output’ (the meal).”

And for simple algorithms, this is a fine illustration, Venkatasubramanian asserted.

“[It is] dead wrong, at least when trying to understand the bewildering universe of algorithms that collectively define machine learning, or deep learning, or Big AI: all the algorithms that are constantly in the news nowadays.”

In the case of the Big Data AI, he noted, the algorithms are less like recipes and more like a method for generating a recipe. In his example he imagines a chef who had lost their specific instructions for a dish. Instead of just looking up a new dish, the chef made it from memory and served it to a group and made notes on their input. The chef then keeps serving the group the dish over and over, noting their input and modifying his meal, until the group agrees that an appropriate recipe has been found.  

“Eventually, I have a pretty decent sambar recipe. For some reason, I have to twirl around three times while holding the split peas and water before putting it on the stove, and the the salt has to be ladled out using a plastic spoon, but the taste is great, so who cares!”

“And that’s how a learning algorithm works. It isn’t a recipe. It’s a procedure for constructing a recipe. It’s a game of roulette on a 50 dimensional wheel that lands on a particular spot (a recipe) based completely on how it was trained, what examples it saw, and how long it took to search. In each case, the ball lands on an acceptable answer, but these answers are wildly different, and they often make very little sense to the person executing them.”

What the programmers behind the algorithms are doing is trying to get the computer to look at an impossibly large field of data and “think” about what correlated data forms and important or “actionable pattern.” But the people who are programming computers to “think” are still human – which means they are likely to build in some of the problems of human cognition to the programs they build.

“We’re trying to design algorithms that mimic what humans can do. In the process, we’re designing algorithms that have the same blind spots, unique experiences, and inscrutable behaviors that we do. We can’t just ‘look at the code’ any more than we can unravel our own ‘code.’”

Why Alt Lending Enthusiasts Should Pay Particular Attention To All Of This

A change in the conventional narrative about the power of big data is of course interesting to any player in the payments and commerce ecosystem. In the last year alone, we have interviewed countless innovators in payments, eCommerce, mobile commerce, retail marketing, backend logistics, digital security, medicine — name a vertical (or subvertical) and it is almost a sure thing that we can point you to the team (or dozens of teams) that are working hard to make it run better with power of machine learning and data.  

But alt lending is strongly hit in two unique ways. The first is around central value proposition, which for a variety of lenders hinges acutely on big data’s ability to evaluate creditworthiness much, much faster than traditional bankings’ paperwork-intensive process.  

The old model, the argument goes, is too slow for consumers and SMBs and too myopically focused on too few variables. The new model can leverage all that cloud computing power to generate a score within minutes that has factored in thousands, or even hundreds of thousands of variables correlated to creditworthiness.  

And while the new model is certainly faster, it is hard to evaluate whether it is “better” because one would have to readily know what those thousands and hundreds of thousands of elements are. Both Venkatasubramanian and Crawford note that many of those variables might be correlated, but ultimately useless, data — or it could actually be correlated data that is highly misleading because it tends to skewer the sample set.  

But that alone, both data experts note, is not a reason to give up on big data and go back to actuarial tables and scouting reports of old. Big data is great, both Venkatasubramanian and Crawford agree, and decision engines can greatly expand the frontiers of dialog in high-level choices. Decision engines are useful tools, they both argue, but they aren’t good decision makers. A decision engine (because it was programmed by a person) can have unhelpful assumptions built into its “thinking” but unlike a human decision maker, it isn’t self-aware and cannot account for mistakes in how it “reasons.” Simply asking the data isn’t a good way to leverage big data analytics, because when it comes to making choices on the basis of what this tells you, both experts agree the smart money is on finding a human being.

And that is a problem for alt lenders, for which the big value is largely located in the immediacy of letting the algorithm make the decision. Alt lenders could create a human intercession layer between the algorithm output and the underwriting decision, but the slow human reviewing a data set is what a lot of alternative lending is based on avoiding.  

Alternative lenders can point to a successful track record with relatively low defaults rates so far — and risk mitigated by rather steep interest rates in most cases. But given the billions invested in various alternative lenders in 2015 alone, it seems at least worth noting that even data experts are becoming increasingly prone to question taking “pure data” results if it will be in alternative lending’s long-term best interest to loudly trumpet how much it lets the algorithms make the central decisions.