Bringing All of BigData.Gov Into the 21st Century

There’s a huge enterprise that is constantly sweeping up vast amounts of data on the details of buying and selling, and of getting and spending in the U.S. No, this isn’t another complaint about Google. The giant enterprise in question is the U.S. government, BigData.Gov.

BigData.Gov is constantly collecting massive amounts of data from businesses and consumers throughout the economy. It’s data that no Silicon Valley firm could ever get. It’s data that could be converted into information that could guide entrepreneurs, help businesses make decisions, and power applications that could aid consumers in so many parts of their lives.

And, in some cases BigData.Gov helps out. We all pay attention to data on GDP, employment and unemployment, and inflation. Economists know that these quantities are hard to measure well, and those who have looked closely have generally given the federal statistical agencies at least passing marks for how they measure these and other key quantities, recognizing that these agencies are chronically under-funded by Congress.

As economists who’ve been using lots of government data for a long time, we’ve always been always been delighted that the federal government had systematic, professionally managed programs for collecting data and publishing important information. These programs have been critical to research that has advanced our understanding of the economy. Like most of our colleagues, we’ve assumed that all government statistics were as carefully defined and measured as GDP and other estimates that many people watch closely.

A recent experience has shaken our confidence in that belief. We now know that for whatever reason, some of BigData.Gov’s published statistics simply don’t measure what any sensible person thinks they should measure. They are not useful for business decision-making and could be dangerously misleading, even though they are based on mountains of carefully collected data.

Last summer we just wanted data on how the online revolution was affecting brick and mortar retail for our book Matchmakers: The New Economics of Multisided Platforms. Our field research showed that some traditional physical stores were rapidly blending the best of physical, web, and mobile to serve their customers, while others were simply being hammered by the growth of e-commerce. We thought getting data would be pretty easy. The Census Bureau has reports galore on eCommerce, broken down by industry. And it should. It surveys businesses every year and asks them about e-commerce sales. Then it compiles the data into a report. Its been doing this since the early 1990s.

Our first disappointment was that the most recent data for retail industries like Electronics and Appliance Stores was two years old. For studying anything related to online these days, 2013 is ancient history. The most plausible defense for this is that budgets are tight, but surely it is more important to devote resources to measuring in a timely fashion the parts of the economy that are changing rapidly than to keeping current on sectors that move more glacially.

Worse, though, when we looked at the published statistics, we scratched our heads. The Census tables showed that General Merchandise Stores—that’s where they code Walmart—only had $88 million of online sales in 2013. That couldn’t be right, since walmart.com was a multi-billion dollar operation that year.

So we started digging. We pored over their surveys and documentation. We e-mailed the Census official in charge of the retail trade data. Then after we published our results in an online working paper to help other users of those data and to describe how we tried to use other sources to improve the Census estimates, the Census public information office contacted us. Not to offer congratulations or thanks. They demanded that we retract our paper, and we learned that the Census official with whom we had been emailing had been directed not to communicate with us directly any more. The public information office retracted, in writing and in a long telephone conversation, what the Census official had told us in multiple emails. They said the Bureau followed a process that wasn’t really the process described in the documentation that we had been able to find online.

And, oh, by the way, the data on online sales they report for General Merchandise Stores—well that’s really not online sales for General Merchandise Stores. We were told that their procedures called for online sales of general merchandise stores like Walmart to be included, along with the sales of food trucks and the like, under Non-Store Retailers. You may be as surprised as we were to learn that Walmart is a non-store retailer. You might be even more surprised, as we were, to learn that online sales of manufacturers like Apple and Nike are also supposed to be included under Non-Store Retailers. Think about a food truck selling iPads out of one window.

What, then, is the $88 million reported as online sales of General Merchandise Stores in 2013? As nearly as we can figure out, that’s the total online sales of small general merchandise stores that didn’t separately report that they also operated a non-store retail operation, as they should have done. That is, it is junk data containing no valid information. If you took it at face value, you would be surprised that giant firms like Walmart seem to think that e-commerce is worth their attention.

We, of course, retracted our paper, since it had been based on the assumption that the Census e-commerce statistics by industry were biased only because, as the Census official told us, some large firms didn’t report those sales, not because, as we subsequently learned, the industry-specific numbers were essentially meaningless. Census public information did not retract the Census official’s statement that e-commerce estimates were understated.

We think we understand what they do about e-commerce, though it doesn’t make a lot of sense. But it is hard to be sure. No one audits the work, and the Census people themselves seem to have different views on what was done. Because the rise of e-commerce is such an important development for so many businesses, and reliable statistics are thus so important to businesses and researchers alike, we’ve called for an independent review of the Census’s retail sales data.

This rather painful episode—which was about getting sensible numbers for one paragraph in a 262-page book—woke us from our slumber when it comes to federal data. We have always had great respect for the professionalism of the federal statistical agencies, and we recognize their chronic budget problems, but BigData.Gov does suck up massive amounts of time from businesses and consumers collecting data. Relative to that effort, the costs of turning all those data into information useful to businesses and researchers and providing it in a timely fashion must be tiny. The agencies clearly do exactly this in some cases, but we have learned that they don’t do it in all cases. And the public has no idea which published, official data are as meaningless and potentially misleading as the $88 million figure that first caught our eyes.

Go back to the question we raised. Brick and mortar retailers are now rapidly trying to adapt to a world in which people want to be able to shop where they want when they want—sometimes in store, sometimes online, sometimes using their mobile device. A massive part of the economy is undergoing stress as some retailers are being cut down by a buzz saw, while others are making a lot of progress transforming how they do business. Reliable, current information on what’s going on would be valuable and could be available but is not. So entrepreneurs, investors, and analysts don’t have much to go on. The federal government has, or could get, the right data and could tabulate it sensibly and report it quickly without adding to the third digit of the deficit.

Given today’s technology, there is no reasonable explanation for why the government needs two years to process data and report results on retail sales. If any data geek in private business said two months, never mind two years, to go from survey data in hand to final results, they’d be tossed into the unemployment line.

Sadly, the only way to learn which statistics are as reliable as GDP and which are as useless as reported e-commerce sales by General Merchandise Stores may be to give federal data collection and reporting a top to bottom overhaul, perhaps along the following lines:

  1. Government statistical agencies need oversight from the research and business communities for everything from what they ask to what that report—that’s why we recommended an independent review of the Census retail data.
  2. Government statistical agencies need to move quickly to get data out in weeks or months and not years.
  3. Government statistical agencies need to design coding and classification systems to deliver useful information. Why can’t we be told how much general merchandise stores and other retail industries sell online and how much they sell offline? How about firms that are mainly manufacturers? Census already has the necessary data, after all.
  4. Government statistical agencies need to make more raw data available to researchers in ways that maintain confidentiality. Census and other agencies already do this in some areas; they could do it across the board. If we had access to the raw data on retailing, we could unscramble some of the eggs that the current classification system has so thoroughly scrambled.

Now, you might thing that BigData.Gov has beaten us to it if you stumbled across data.gov. The banner at the top of the website says “The home of the U.S. Governments open data. Here you will find data, tools and resources to conduct research, develop web and mobile applications, design data visualizations, and more.”

Wow. This site should be getting massive traffic given that it is doing all that. According to compete.com it gets a trivial 120k unique visitors a months. That won’t surprise you if you troll the site. Mainly it provides links to the ancient data subject to all the defects we mention above.

It is time to break BigData.Gov’s stranglehold over all the data that we, as citizens and businesses, hand over to them. Let’s take a small step first: support our call for an independent review of how the Census Bureau collects, analyzes, and reports retail trade data.

This piece is a longer version of our Harvard Business Review blog post that can be found here.