Today’s large language models (LLMs) aren’t actually trained on that large a set of languages.
For the most part, the vernacular vehicle underlying the generative artificial intelligence (AI) platforms operating right now is the language that dominates the content AI models train themselves on: English.
The model, named after the highest peak in the United Arab Emirates (UAE), was jointly developed by Inception — an AI-focused subsidiary of the Abu-Dhabi based tech company G42 — California-based AI research firm Cerebras, and the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI).
G42 is chaired by the UAE’s national security adviser, Sheikh Tahnoon bin Zayed al-Nahyan.
The 13-billion parameter model was trained over 21 days on a supercomputer co-developed by G42 and Cerebras using a purpose-built dataset of 116 billion Arabic tokens and 279 billion English word tokens designed to capture the complexity and nuance of Arabic.
“With this release, we are setting a new standard for AI advancement in the Middle East and ensuring that the Arabic language, with its depth and heritage, finds its voice within the AI landscape,” said Andrew Jackson, CEO of Inception, in a statement announcing Jais.
“Developing such a high-caliber Arabic LLM demanded cutting-edge AI research in addition to an in-depth and nuanced understanding of the Arabic language, its diversity and heritage, and the growing importance of LLMs across all echelons of society,” MBZUAI President Eric Xing added.
Several organizations, including the UAE Ministry of Foreign Affairs, the UAE Ministry of Industry and Advanced Technology, The Department of Health – Abu Dhabi, Abu Dhabi National Oil Company (ADNOC), Etihad Airways, First Abu Dhabi Bank (FAB), and e& have signed on as launch partners to use the Jais platform.
While the bilingual Jais is positioned for use by the world’s more than 400 million Arabic speakers, it is not the first Arabic-focused LLM platform to launch in the Middle East.
The UAE has already developed a separate open-source LLM called Falcon at the state-owned Technology Innovation Institute in Masdar City; however, according to the Jais whitepaper, Falcon’s Arabic accuracy is weaker than Jais’ own.
That’s because Falcon’s software wasn’t pre-trained in Arabic, while Jais was purpose-built to have a non-U.S.-centric foundation, giving it a more accurate understanding of the Middle East’s culture and behavioral contexts. Jais can generate content using both modern standard Arabic as well as many of the Middle East’s diverse spoken dialects.
According to the team behind Jais, the LLM can also hold its own against English models of similar size despite being trained on fewer English tokens. The team said it showed that the Jais’ English component learned from the Arabic data and vice versa, pointing to new possibilities in LLM’s development and training.
Many of today’s most advanced LLMs, including OpenAI’s GPT-4, Google’s PaLM, and Meta’s open-source LLaMA, are all able to understand and generate Arabic text, and as generative AI technology becomes commercialized globally, being able to tailor LLMs to distinct cultural needs and preferences could prove to be a competitive differentiator.
As PYMNTS has reported, the telecom sector is already working develop a multilingual LLM for global telecommunications companies, with language capabilities that include Korean, English, German, Japanese, Arabic, Spanish and more.
Still, one of the main challenges in training LLMs using languages other than English is the ongoing lack of high-quality native language data that can be found online compared to the widespread prevalence of English content. For its part, Jais was trained using Arabic media, scraping content from social media platforms, and pairing Arabic commands with English-driven code sequences.
As more nations look to develop their own native AI platforms, differences will begin to emerge in the generative capabilities of the foundational LLMs.
For example China, whose AI regulations went live this month, already bans LLMs that generate content that “attempts to subvert the state power.” The UAE has placed similar guardrails on Jais, which is pre-trained not to produce content that steps outside of reasonable bounds in terms of the Middle East’s cultural and religious sensibilities or that does not represent the values of the organizations involved in the LLM’s development.
The involvement of the UAE’s national security adviser, Sheikh Tahnoon bin Zayed al-Nahyan, in the development of Jais has also raised concerns about potential misuse of the technology by the region’s autocratic leaders, and the U.S. earlier expanded the restriction of exports of NVIDIA AI chips to include a number of undisclosed countries in the Middle East.