Published on November 20, 2023, 4:39 pm

Integrating artificial intelligence (AI) into the daily workflow of employees has the potential to increase productivity in various tasks, from writing memos to developing software and creating marketing campaigns. However, the concern surrounding the risks of sharing data with third-party AI services is valid. The recent case of a Samsung employee exposing proprietary company information by uploading it to ChatGPT serves as a reminder of the potential dangers.

These concerns are reminiscent of the early days of cloud computing when users were worried about data security and ownership. Nowadays, managers confidently use mature cloud computing services that comply with regulatory and business requirements regarding data security and privacy. However, generative AI services, like OpenAI’s ChatGPT, are still in their early stages and face challenges in terms of maturity and data ownership.

Large language models (LLMs), such as ChatGPT, have been trained on a massive corpus of written content accessed from the internet without considering data ownership. This has led to legal repercussions, with bestselling authors taking legal action against OpenAI for using their copyrighted works without permission. Traditional media outlets have also engaged in licensing discussions with AI developers to protect their data. Unfortunately, negotiations between OpenAI and The New York Times broke down recently.

For companies experimenting with generative AI, one immediate concern is how to explore new use cases for LLMs using internal data safely. Uploading corporate data to commercial LLM services can pose risks since it could be captured as training data. Managers must find ways to protect proprietary data assets while improving data stewardship in corporate AI development practices in order to earn and maintain customer trust.

One possible solution is building generative AI solutions locally instead of relying on third-party services. However, this approach presents practical challenges since companies might not have the resources or expertise necessary for such endeavors. Thankfully, a growing open-source AI movement offers a potentially viable alternative similar to the excitement surrounding Linux in the 1990s.

Open-source models like Bloom, Vicuna, and Stable Diffusion provide foundational models that can be fine-tuned to specific tasks. Research into highly optimized training routines has shown that they can be fine-tuned using commodity hardware, leading to a flourishing ecosystem of models approaching the performance of ChatGPT. However, technical challenges still exist, and capitalizing on the rapid developments of these emerging open-source tools requires new investments in people and processes.

While locally controlled AI solutions keep proprietary data in-house, managers must take several actions to ensure safe and responsible use:

1. Navigate model and data licenses: Open-source models come with various licenses that dictate their usage. Organizations must carefully consider these licenses to ensure compliance and address potential legal restrictions.
2. Prevent data leakage: Companies need to prevent unintentional data leakage through open-ended user interfaces like chatbots. Privacy concerns arise when LLMs reveal private or proprietary information.
3. Adapt to changing data: On-premise models must be regularly updated with the latest data while maintaining stability and consistency for users.
4. Mitigate systemic biases: AI systems tend to perpetuate social and economic inequalities present in training data. Continuous monitoring is necessary to ensure equitable treatment.
5. Build trust with customers: Transparency regarding how personal data is used is crucial for customer trust. Companies should communicate intentions to use customer data for AI training, allowing individuals to opt-in whenever possible.

As open-source AI models become adopted across industries, concerns over data ownership will not be limited to Big Tech companies alone. Every organization deploying these models will face issues related to how data is collected and utilized by AI systems.

While best practices and recommended policies are still emerging, resources like Stanford Law School’s AI Data Stewardship Framework and the Association for Computing Machinery’s guidelines on generative AI can provide valuable insights into navigating these challenges responsibly.

At Tulane University, we have established the Center for Community-Engaged Artificial Intelligence to address these issues. Through a cross-disciplinary approach, we work with nonprofits and community groups to understand how AI affects their work, striving to build AI systems that give control over data and technology back to the people most affected by them. By adhering to similar values, corporations can become better stewards of the data they collect and use as they dive deeper into AI development.


Comments are closed.