Spamming the Data Space – CLIP, GPT and synthetic data

Francis Hunger, December 7, 2022

Introduction

For the last time in human history the cultural-data space has not been contaminated. In recent years a new technique to acquire knowledge has emerged. Scraping the Internet and extracting information and data has become a new modus for companies and for university researchers in the field of machine learning. One of the currently largest publicly available training data sets to combine images and labels (which shall describe the images content), is Laion-5B, with 5,85 billion image-text pairs (Ilharco, Gabriel et al. 2021).[1]
The scope of scraping internet resources has become so all-encompassing, that researcher Eva Cetinic has proposed to call this form ‘cultural snapshot’: “By encoding numerous associations which exist between data items collected at a certain point in time, those models therefore represent synchronic assemblages of cultural snapshots, embedded in a specific technological framework. Metaphorically those models can be considered as some sort of encapsulation of the collective (un)conscious […]” (Cetinic 2022).[2] The important suggestion which Cetinic makes, is that these data collections are temporally anchored. The temporal dimension of these snapshots suggests that digital cultural snapshots taken at different times document different states of (online-)culture. So how will a 2021 snapshot differ from a 2031 cultural snapshot?

Consequences

Multi-modal models, like CLIP, trained on large-scale data sets, such as LAION-5B provide the statistical means to generate images from text prompts. In the CLIP Model, pre-trained models merge two embedding spaces, one for images and one for text-descriptions which with mathematical methods get layered together, so that the vectors in the one space, the image domain, align with vectors in the other space, the text domain, assuming there is a similarity between both, and one can translate into the other. In three short examples I’ll discuss some of the consequences of the underlying data for large-scale models from the perspective of cultural snapshots.

1.) Data Bias: A critical discussion of these large-scale multi-modal models for instance, has pointed out how they are culturally skewed and reproduce sexist and racist biases. Researchers Fabian Offert and Thao Phan, for instance, describe how the company Open AI decided not to mitigate the problem of whiteness by changing the model’s underlying data. Instead, Open AI added certain invisible keywords to users’ prompts to have more people of color included, without changing the model. Obviously, the calculations for creating these models or even curating the underlying data are so tremendous that for economic reasons even major problems cannot be corrected in the embedding space itself. Discussing the prevalent ‘whiteness’ in these models further, Offert and Phan suggest to turn to humanities in order to “identify the different technical modes of whiteness at play, and understand the reconceptualization and resurrection of whiteness as a machinic concept” (Offert and Phan 2022, 3).[3]

2.) Uneven spatial distribution: Users of large-scale multi-modal models have tested their limits when generating images. ‘Crungus’, and ‘Loab’ are two examples. ‘Loab’, the image of a women appeared when AI artist Supercomposite looked for the negative of a prompt: “DIGITA PNTICS skyline logo::-1”. Loab appears to be a consistent pixel accumulation, which repeatedly emerges in different configurations and cannot easily be traced back to a single origin.[4] The creator/discoverer of ‘Loab’ felt during intensive testing, that Loab might exist in its own pocket, because it was relatively reproducible, compared to other prompts, as if it was populating a certain statistical region within the larger latent space. Another, similar phenomenon of uneven spatial distribution in latent space is ‘Crungus’, basically a phantasy word which as a prompt nevertheless created results: a snarling, zombie-like figure with shoulder-long hair, which could be part of a horror movie.[5]

Both examples demonstrate that the cultural snapshots also contain material which cannot be easily identified or traced back and they demonstrate, how the latent space is an uneven spatial distribution by design. Since the models are built by a process called zero shot learning in difference to for instance the supervised learning used in ImageNet, there are no longer intentional ontologies used in the knowledge creation of these models. The human involvement involves the uncoordinated captioning of images by users online, and the setting up the scraping algorithms and excluding certain domains from being scraped by researchers.

3.) Data Spam: Looking at the history of spam it has emerged whenever a business case of creating large amount of messages using copy-and-paste could be made. Email spam, forum spam, comment spam, video spam on YouTube has been common and consistent over the past decades. Hand in hand with spam goes Search Engine Optimization (SEO), which optimizes content for discoverability by knowledge aggregators, namely search engines. The text-generator like GPT-3 has already proven to be an annoyance when users of one of the central online forums for programmers Stack Overflow began to flood it with automated comments. It turned out, that many generated answers proved incorrect but not easily discernable: “The primary problem is that while the answers which ChatGPT produces have a high rate of being incorrect, they typically look like they might be good and the answers are very easy to produce” (Stack Overflow moderators in: Vincent 2022). This is only one example of many, and it will extend from text, image and video generation and will become a major problem on Instagram, Flickr, Pinterest, and many other visual platforms. Possible applications for data spam are fake-news, subversive messages, or advertisement and so on.

Further, synthetic text spam or synthetic image spam using statistical tools like GPT, or CLIP produces results which will be evaluated by the same or similar machine learning architectures, and therefore may be more conform to the mathematical models than organically human produced content.

All in all, this poses the question, how to assess any online content after 2021.

Data Ecologies

While some may argue that generated text and images will save time and money for businesses, a data ecological view immediately recognizes a major problem: AI feeds into AI. To rephrase it: statistical computing feeds into statistical computing. In using these models and publishing the results online we are beginning to create a loop of prompts and results, with the results being fed into the next iteration of the cultural snapshots. That’s why I call the early cultural snapshots still uncontaminated, and I expect the next iterations of cultural snapshots will be contaminated. In the long term this may lead to a deterioration of the quality of the appropriated data. It also opens the opportunity for data spamming. Spammers or search engine optimizers may decide to create huge amounts of picture and captions to create a stronger presence for a certain product or cause.

These are the conditions under which such large image collections become available at all: the extraction of the unpaid labor of those who published the images originally online. Both the extractive nature and the very likely future contamination of cultural snapshots will make this approach untenable and unsustainable in the long run.

Addendum Feb 3, 2024

The point where search engines or platforms (notably facebook) become unusable is approaching with astonishing speed, for they get polluted with generative content. My hunch is that communities which keep their information space free from AI contamination will thrive. It would be strategically wise to discuss now how to establish/nurture these places.

Meanwhile the idea of model collapse that relates to my idea of cultural snapshot contamination has been researched properly in this paper: Anderson, Ross, Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, and Nicolas Papernot. 2023. “The Curse of Recursion – Training on Generated Data Makes Models Forget.” arXiv. https://doi.org/10.48550/arXiv.2305.17493.

Sources

Baio, Andy. 2022. “AI Data Laundering – How Academic and Nonprofit Researchers Shield Tech Companies from Accountability.” Blog. Waxy.Org (blog). September 30, 2022. https://waxy.org/2022/09/ai-data-laundering-how-academic-and-nonprofit-researchers-shield-tech-companies-from-accountability/.

Birhane, Abeba, Vinay Uday Prabhu, and Emmanuel Kahembwe. 2021. “Multimodal Datasets: Misogyny, Pornography, and Malignant Stereotypes.” arXiv. https://doi.org/10.48550/arXiv.2110.01963.

Cetinic, Eva. 2022. “The Myth of Culturally Agnostic AI Models.” Arxiv, November, 4. https://doi.org/10.48550/arXiv.2211.15271.

Ilharco, Gabriel, Wortsman, Mitchell, Carlini, Nicholas, Taori, Rohan, Dave, Achal, Shankar, Vaishaal, Namkoong, Hongseok, et al. 2021. “OpenCLIP.” Hamburg: Laion e.V. Zenodo. https://doi.org/10.5281/ZENODO.5143773.

Kelly [@Brainmage], Guy. 2022. “Well I REALLY Don’t like How Similar All These Pictures of ‘Crungus’, ….” Tweet. Twitter. https://twitter.com/Brainmage/status/1538111384390619136.

Lavoipierre, Ange. 2022. “There’s a Woman Haunting the Internet. She Was Created by AI. Now She Won’t Leave.” ABC News, November 25, 2022. https://www.abc.net.au/news/2022-11-26/loab-age-of-artificial-intelligence-future/101678206.

Offert, Fabian, and Thao Phan. 2022. “A Sign That Spells: DALL-E 2, Invisual Images and The Racial Politics of Feature Space.” ArXiv:2211.06323 [Cs], October. http://arxiv.org/abs/2211.06323.

Supercomposite [@supercomposite]. 2022. “🧵: I Discovered This Woman, Who I Call Loab, in April. ….” Tweet. Twitter. https://twitter.com/supercomposite/status/1567162288087470081.

Vincent, James. 2022. “AI-Generated Answers Temporarily Banned on Coding Q&A Site Stack Overflow.” The Verge. December 5, 2022. https://www.theverge.com/2022/12/5/23493932/chatgpt-ai-generated-answers-temporarily-banned-stack-overflow-llms-dangers.

Weisbuch, Max, Sarah A. Lamer, Evelyne Treinen, and Kristin Pauker. 2017. “Cultural Snapshots – Theory and Method.” Social and Personality Psychology Compass 11 (9). https://doi.org/10.1111/spc3.12334.

[1] LAION is organized as an independent German research association. This division of labor between smaller and larger actors, who shift responsibility away from the large companies which use the models based on these data collections has been criticized by in AI Data Laundering: How Academic and Nonprofit Researchers Shield Tech Companies from Accountability (Baio 2022).

[2] Cetinic borrows this concept from social and cultural psychology studies, referring to Cultural snapshots – Theory and method (Weisbuch et al. 2017).

[3] C.f. Multimodal datasets: misogyny, pornography, and malignant stereotypes (Birhane, Prabhu, and Kahembwe 2021).

[4] C.f. (Supercomposite [@supercomposite] 2022; Lavoipierre 2022). Note: I wasn’t able to reproduce Loab with my installation of Stable Diffusion v1.

[5] It was first produced using DALL-E mini in June 2022 by actor and comedian Guy Kelly, first reported on twitter (Kelly [@Brainmage] 2022).

Comments

Feel free to comment on https://dair-community.social/@databasecultures

Further discussion emerged on nettime https://nettime.org/Lists-Archives/nettime-l-2212/msg00019.html

New Sources to be included

The Curse of Recursion: Training on Generated Data Makes Models Forget, https://www.cl.cam.ac.uk/~is410/Papers/dementia_arxiv.pdf

Poisoning web-scale training datasets is practical. arXiv preprint arXiv:2302.10149, 2023.https://arxiv.org/abs/2302.10149

Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks, https://arxiv.org/abs/2306.07899

Self-Consuming Generative Models Go MAD https://arxiv.org/abs/2307.01850