OpenAI outlined a plan to collect data from organisations across a wide range of industries, languages and cultures to improve public and private datasets for its AI models.
It aims to gather large quantities of data to bolster general intelligence capabilities, citing occasional flaws in the information currently collated to train large language models (LLMs).
OpenAI stated such discrepancies can result in “AI hallucinations” which reflect false information, or facts not based on real data or events.
It will include text, images, audio and video not already easily accessible to the public, with a particular focus on data which “expresses human intention”, including long-form writing or conversations across various languages, topics and formats.
OpenAI is initially seeking partners for an open-source public dataset for use in AI model training; and private variants for proprietary variants.
“Overall, we are seeking partners who want to help us teach AI to understand our world in order to be maximally helpful to everyone,” the company stated.