Microsoft’s Mishap Highlights Data Security Challenges When Training AI

Microsoft Emphasizes Importance of GPU Supply for AI

Artificial intelligence (AI) models are some of the most data-hungry computing platforms in existence. 

The technology may have the potential to transform the world, but generative AI doesn’t work without access to, and the ingesting of, enormous amounts of data

This, as Microsoft’s AI research team accidentally exposed 38 terabytes of private data while publishing open-source AI training data to cloud-based code hosting platform GitHub. 

The exposed data included a disk backup of two Microsoft employees’ workstations with overly sensitive personal data such as private keys, passwords to internal Microsoft services, and over 30,000 messages from 359 Microsoft employees. 

Due to an accidental configuration of the published data, which provided “full control” instead of “read-only” permissions, any attacker would have been able to not only view the exposed files but also manipulate, overwrite or delete them. 

“No customer data was exposed, and no other internal services were put at risk because of this issue. No customer action is required in response to this issue. We are sharing the learnings and best practices below to inform our customers and help them avoid similar incidents in the future,” Microsoft wrote in a statement acknowledging the error. 

While crisis was avoided this time, the case serves as a glaring example of the new risks that organizations face when starting to integrate AI more broadly into their operations. 

As staff engineers increasingly work with massive amounts of specialized and sensitive data to train AI models, firms will need to establish the appropriate governance policies and education guardrails in order to mitigate security risks. 

Read alsoWalled Garden LLMs Build Enterprise Trust in AI

Training Specialized AI Models 

Specialized AI models need to be trained on specialized data, and as enterprises large and small come to embrace the benefits AI can bring to routine workflows, it is becoming increasingly crucial for IT, data and security teams to understand the inherent exposure risks native to each stage of the AI development process. 

Open data sharing is a key component of AI training, with researchers collecting and sharing massive amounts of external and internal data to build out the required training information for their AI models.

But sharing larger amounts of data leaves companies exposed to larger risks if that data shared incorrectly, as happened with Microsoft. 

After all, as PYMNTS has written, AI represents one of the first technologies that can violate nearly all of an organization’s internal corporate policies in one fell swoop. 

At the center of many business use concerns around the integration of generative AI solutions lies ongoing questions around the integrity of data and information fed to the AI models, as well as the provenance and security of those data inputs. In order to effectively and securely leverage AI tools, businesses must ensure first that they have the appropriate data infrastructure in place to avoid AI’s foundational pitfalls. 

“At a high level, generative AI has the potential to create a new data layer, like when HTTP was created and gave rise to the internet beginning in the 1990s. As with any new data layer or protocol, governance, rules and standards must apply,” Shaunt Sarkissian, founder and CEO of AI-ID, told PYMNTS last month.

See moreAI ‘Model Collapse’ Highlights Need for AI Data Governance

Securing the Future of AI 

Since its development and commercialization, humans have tended to over-trust the computer. 

That’s why, despite ongoing fears about AI’s apocalyptic potential, organizations need to be more worried about shoddy AI software than any threat of the technology going rogue.

PYMNTS Intelligence has found that many companies are unsure of where they stand in regard to generative AI, but they still feel a pressing need to adopt it.

Sixty-two percent of surveyed executives do not believe their companies have the expertise to employ the technology effectively, according to “Understanding the Future of Generative AI,” a PYMNTS and AI-ID collaboration.

A hyper-rapid growth in the capabilities of today’s computing power and cloud storage infrastructure has transformed the enterprise operating landscape and set the stage for data-fueled innovations like AI to transform business processes. 

But while most AI models currently are being produced by either the world’s most valuable tech companies or its most well-funded startups, computing power is only going to get cheaper. Fast forward a couple of years from now, and AI models may have advanced to the point where every day consumers can run models as sophisticated as today’s most cutting-edge platforms on their personal devices at home. 

The world is at a tipping point, one where the zettabytes of proprietary data being produced each year needs to be addressed, and soon. Otherwise, as the innovations of tomorrow scale up, so too will their risks.