Published on November 1, 2023, 6:22 am
Efforts are underway to establish a standardized approach for evaluating generative artificial intelligence (AI) products. The goal is to create a common set of benchmarks and a “body of knowledge” on how these tools should be tested, in order to address the risks associated with generative AI applications. This initiative, called Sandbox, is led by Singapore’s Infocomm Media Development Authority (IMDA) and AI Verify Foundation, with support from global market players including Amazon Web Services (AWS), Anthropic, Google, and Microsoft.
The current group of participants consists of 15 organizations, which also includes Deloitte, EY, IBM, OCBC Bank, and Singtel. Sandbox is guided by a draft catalog that categorizes existing benchmarks and evaluation methods used for large language models (LLMs). The catalog compiles commonly used technical testing tools and recommends a baseline set of tests to evaluate generative AI products.
The objective of this initiative is to establish a common language and promote the safe and trustworthy adoption of generative AI. IMDA emphasizes the importance of systematic and robust evaluation in building trust and understanding the capabilities and limitations of models. Through rigorous evaluation, developers can identify areas for improvement and enhancements.
To achieve this common language, a standardized taxonomy and pre-deployment safety evaluations for LLMs are necessary. IMDA hopes that the draft catalog will serve as a starting point for global discussions on safety standards for LLMs. It highlights the need to involve stakeholders beyond model developers, including application developers who build on top of these models and developers of third-party testing tools.
Sandbox aims to demonstrate how different players in the ecosystem can collaborate effectively. Model developers like Anthropic or Google can work alongside app developers such as OCBC or Singtel, alongside third-party testers like Deloitte and EY, on generative AI use cases in sectors like financial services or telecommunications. Regulators such as Singapore’s Personal Data Protection Commission are also encouraged to participate, fostering an environment that promotes transparency and experimentation.
In addition, IMDA expects Sandbox to uncover gaps in the current state of generative AI evaluations, particularly in domain-specific applications such as human resources and cultural-specific areas. To address this, Sandbox will develop benchmarks tailored to specific use cases and countries like Singapore, taking into account cultural and language specificities.
Furthermore, IMDA is collaborating with Anthropic on a Sandbox project that focuses on red teaming. Red teaming involves challenging policies and assumptions used in AI systems through an adversarial approach. Anthropic’s models and research tooling platform will be utilized to develop customized red-teaming methodologies suitable for Singapore’s diverse linguistic and cultural landscape.
In July, the Singapore government launched two sandboxes using Google Cloud’s generative AI toolsets. One sandbox is exclusively for government agencies to develop and test generative AI applications, while the other is available to local organizations at no cost for three months with up to 100 use cases.
The Sandbox initiative led by IMDA and AI Verify Foundation is a significant step towards establishing common standards and promoting the safe adoption of generative AI. By bringing together industry players, regulators, and developers, this collaborative effort aims to create a standardized framework for evaluating generative AI products while addressing potential risks associated with their use.