Published on February 2, 2024, 11:26 am

One of the most highly debated topics in generative artificial intelligence (AI) circles is the value and effectiveness of open-source versus closed-source models. Open-source large language models (LLMs) are constantly being developed by a diverse group of contributors, led by Meta’s Llama 2. On the other hand, closed-source LLMs, such as OpenAI’s GPT-4 and Anthropic’s Claude 2, have already established themselves in the commercial sector.

A recent study conducted by scientists from Pepperdine University, University of California at Los Angeles, and UC Riverside aimed to compare these different programs’ performance in answering questions related to nephrology—the study of kidneys. The findings were published in NEJM AI, a prestigious journal from the New England Journal of Medicine.

The research team found that Llama 2 performed poorly in terms of providing correct answers and quality explanations compared to both GPT-4 and Claude 2. In fact, GPT-4 achieved human-like performance for the majority of nephrology topics, scoring 73.3%. Although just shy of a passing grade for humans (75%), this result represents a significant achievement for an AI model.

In contrast, most open-source LLMs received overall scores similar to randomly guessing answers. Among the five open-source models tested, Llama 2 performed slightly better than others like Vicuña and Falcon but still fell short with a score of 30.6%.

The study focused on “zero-shot” tasks, where language models are tested without any modifications or pre-learned examples for specific topics. This approach is meant to assess their ability to learn new capabilities within context. The researchers fed each model—including Llama 2 and four other open-source programs along with two commercial programs—858 nephrology questions from NephSAP (Nephrology Self-Assessment Program), widely used by physicians for self-study.

Converting the plain-text files of NephSAP into prompts suitable for feeding into the language models required significant data preparation. Each prompt included a natural language question and multiple-choice answers. Additionally, since GPT-4, Llama 2, and others often generated lengthy text output, the researchers developed techniques to parse and compare the model’s answers to determine correctness automatically.

The authors suspect that one reason for open-source models’ poor performance compared to GPT-4 is due to the proprietary medical data incorporated into Anthropic’s and OpenAI’s training processes. These organizations have access to curated and peer-reviewed nonpublic materials like textbooks, published articles, and specialized datasets. Accessing such high-quality medical training data not publicly available remains a key factor that may influence future LLM performance improvements.

While GPT-4 achieved scores close to those of humans, there is still considerable room for improvement across all language models in this domain. Efforts are underway to level the playing field regarding training data—most notably through federated training. This method allows language models to train privately on local data while contributing aggregated results to collective efforts hosted on public clouds. The ML Commons industry consortium’s MedPerf initiative is a premier example of harnessing confidential medical data sources while enhancing open-source foundational models.

In some cases, commercial models may also contribute specific medical competencies by being distilled into open-source programs. For instance, Google DeepMind’s MedPaLM is an LLM tailored to answer questions sourced from various medical datasets, including those about consumer health inquiries found on the internet.

Moreover, even without explicit medical knowledge training, language model outputs can be improved using “retrieval-augmented generation.” This approach involves amplifying the neural network’s capabilities by seeking external input during output formation.

Regardless of whether open-source or closed-source approaches prevail, the open nature of projects like Llama 2 allows multiple contributors to improve the programs continuously. In contrast, commercial models like GPT-4 and Claude 2 remain under the complete control of their corporate owners.


Comments are closed.