More Data Isn’t Always Better for AI Decisions

MIT AI data

For decades, the prevailing view in artificial intelligence (AI) and analytics has been “more data is better.” Larger datasets are often associated with improved model accuracy and stronger performance across unpredictable scenarios. This assumption has driven enterprises to invest heavily in data acquisition, and the computing power required to process ever-expanding volumes of information.

    Get the Full Story

    Complete the form to unlock this article and enjoy unlimited free access to all PYMNTS content — no additional logins required.

    yesSubscribe to our daily newsletter, PYMNTS Today.

    By completing this form, you agree to receive marketing communications from PYMNTS and to the sharing of your information with our sponsor, if applicable, in accordance with our Privacy Policy and Terms and Conditions.

    Rethinking Data

    MIT researchers asked a different question: What is the minimum amount of data required to guarantee an optimal decision? Their work focuses on structured decision-making problems under uncertainty, where outcomes depend on unknown parameters such as costs, demand or risk factors. Instead of treating data as something to be maximized, the researchers treat it as something that can be mathematically bounded.

    The framework characterizes how uncertainty shapes the decision space. Each possible configuration of unknown parameters corresponds to a region where a particular decision is optimal. A dataset is considered sufficient if it provides enough information to determine which region contains the true parameters. If the dataset cannot rule out a region that would lead to a different optimal decision, more data is required. If it can, additional data adds no decision-making value.

    The researchers developed an algorithm that systematically tests whether any unseen scenario could overturn the current optimal decision. If such a scenario exists, the algorithm identifies exactly what additional data point would resolve that uncertainty. If not, it certifies that the existing dataset is sufficient. A second algorithm then computes the optimal decision using only that minimal dataset.

    Implications for AI and Banks

    The implications of this research are particularly striking for banks and financial institutions that rely on large historical datasets for credit modeling, fraud detection, liquidity management and portfolio optimization. In many cases, firms continue to collect and process vast amounts of data in pursuit of marginal accuracy gains, even when those gains do not materially change decisions.

    This research also aligns with the growing interest in small and specialized models designed for specific tasks rather than general-purpose intelligence. Smaller models trained on sufficient datasets are easier to audit and less costly. PYMNTS has reported on this parallel shift underway inside financial services, where institutions are reassessing whether ever-larger models and datasets actually translate into better outcomes.

    Advertisement: Scroll to Continue

    For financial institutions facing regulatory scrutiny, the ability to demonstrate that a decision is optimal based on a clearly defined and minimal dataset can improve transparency and governance.

    The work also reframes the economics of data. Data is costly to collect, store, secure and govern. Reducing data requirements without sacrificing decision quality can lower infrastructure spending, shorten model development cycles, and reduce exposure to data-privacy and retention risks.

    That tension between data abundance and decision quality has already surfaced in financial crime and real-time risk systems. PYMNTS has covered that as banks move toward real-time fraud detection and payments monitoring, excessive or poorly curated data can slow systems and increase false positives. In those environments, relevance and precision increasingly matter more than volume.

    Efficient Decision Systems

    The researchers emphasize that their framework does not argue against data altogether, but against unnecessary data. The goal is not to approximate decisions with less information, but to identify the precise information needed to guarantee the best possible choice.

    Experts say the work introduces a new way of thinking about data efficiency in AI. Rather than treating model performance as a function of scale alone, it ties performance directly to decision structure and uncertainty. If successful, the approach could influence how AI systems are designed across sectors where data collection is expensive or constrained, including finance, energy, healthcare and supply chains.

    For all PYMNTS AI coverage, subscribe to the daily AI Newsletter.