Published on February 5, 2024, 10:09 pm
Building enterprise solutions with Generative AI, specifically Large Language Models (LLMs) like GPT-4, Claude 2, and others, presents significant challenges. These challenges primarily revolve around three critical constraints: cost, latency, and relevance. Overcoming these obstacles is crucial for businesses to fully leverage the potential of this technology.
Cost is a major consideration when developing and deploying LLMs. Building a model from scratch is expensive and time-consuming, so most companies rely on pre-built models accessed through APIs. Closed-source models tend to offer superior performance and ease of use but can be costly to access. On the other hand, open-source models are generally more affordable but require additional engineering capabilities for deployment and maintenance. Another approach to lowering costs is using smaller models, although this may impact relevance.
Latency refers to the processing speed of LLMs. Providers often set rate limits on the number of tokens that can be processed per minute due to limited compute resources. This makes real-time processing challenging for large-scale applications that require handling millions of tokens per minute. There are ways to improve latency, such as leveraging private clouds or using smaller models. However, these strategies may come at the expense of relevance.
Relevance is critical for user adoption and business impact. Generative AI systems need to generate accurate and contextually appropriate output. However, LLMs often produce outputs that require significant post-processing to meet specific criteria. Enhancing relevance can be achieved by injecting more information into the model through prompt engineering, Retrieval-Augmented Generation techniques, or fine-tuning with additional datasets. Each method has its pros and cons but may introduce latency issues or increased costs.
To navigate these challenges effectively, companies can employ various techniques. Parallelizing requests across multiple older models, chunking up data for efficient processing, model distillation to create smaller specialized models from larger ones – all help optimize latency and cost efficiency.
BCG designed a hyper-parallelized architecture for a global consumer-facing company building a virtual assistant. This architecture allows the system to make multiple LLM calls in parallel, significantly reducing response time. User input is classified to determine whether the LLM should provide an automatic answer or use a category-specific business logic. Relevant data is retrieved from proprietary knowledge bases and external services accessed through APIs, with different LLMs simultaneously pulling specific data to minimize latency. By leveraging a model-agnostic approach and using the cheapest model that accurately performs each task, cost optimization is achieved.
Chunking and distillation are additional strategies that can enhance efficiency and maintain performance. Chunking breaks down extensive text data into smaller segments for better processing, while distillation involves training smaller models using larger LLMs.
It’s important to note that foundation models should be employed strategically. Use cases must directly contribute to enhancing customer service, creating new revenue streams, or addressing specific business needs to justify using generative AI.
Enterprises can effectively harness the potential of generative AI by exploring alternative strategies and optimizing their architecture and workflows. Orchestrating LLMs, human oversight, and various AI tools into an efficient symphony is key to balancing cost and capability. However, it’s crucial to constantly evaluate solutions as technology advances rapidly in this field. By carefully navigating the tradeoffs, businesses can avoid getting lost in the Bermuda Triangle of generative AI.