Researchers from the University of California, Berkeley have found that OpenAI model ChatGPT has memorized a large number of copyrighted works. This can introduce bias to analytics conducted with OpenAI models.
Transparency and Unseen Biases
The researchers’ primary interest is in transparency and the potential for unseen biases when those relying on OpenAI remain in the dark about what sources were included and excluded from input. They have reported their findings on the arXiv preprint server.
Science fiction and fantasy books dominate the list of memorized books, presenting a built-in bias on the nature of responses ChatGPT may provide. The accuracy of such models is strongly dependent on the frequency with which a model has seen information in the training data, calling into question their ability to generalize.
The researchers said their findings make the case for the use of open models that disclose training data. Knowing what books a model has been trained on is critical to assess sources of bias.
Major legal challenges are likely in the near future. What are the limitations of “fair use” when copying text? Who owns the copyright on text generated in full or in part by ChatGPT? Who prevails when copyright protection is sought for multiple similar or identical outputs by multiple parties? And perhaps a more interesting question: Is machine language copyrightable at all?
The researchers’ findings raise questions of propriety and copyright protections, and their work has shown that OpenAI models know about books in proportion to their popularity on the web. While ChatGPT was found to be quite knowledgeable about works in the public domain, lesser-known works were largely unknown. The researchers suggest that machine language should be transparent and free from bias.