Earlier this month, The Washington Post looked under the hood of some of the artificial intelligence systems that power increasingly popular chatbots. These bots answer questions about a vast array of topics, engage in a conversation and even generate a complex — though not necessarily accurate — academic paper.
To “instruct” English-language AIs, called large language models (LLMs), companies feed the LLMs enormous amounts of data collected from across the web, ranging from Wikipedia to news sites to video game forums. The Post’s investigation found that the news sites in Google’s C4 data set, which has been used to instruct high-profile LLMs like Facebook’s LlaMa and Google’s own T5, include not only content from major newspapers, but also from far-right sites like Breitbart and VDare.
For computer scientist and journalist Meredith Broussard, author of the new book “More Than A Glitch: Confronting Race, Gender, and Ability Bias in Tech,” the Post’s findings are both deeply troubling and business as usual. “All of the preexisting social problems are reflected in the training data used to train AI systems,” she said. “The real error is assuming the AI is doing something better than humans. It’s simply not true.”
There has been an explosion of interest in chatbots since the release of OpenAI’s ChatGPT last year. People have reported using ChatGPT to help with a growing list of tasks and activities, including homework, gardening, coding, gaming, writing and editing. New York Times columnist Farhad Manjoo reported it has changed the way he and other journalists do their work, but warned they need to proceed with caution. “ChatGPT and other chatbots are known to make stuff up or otherwise spew out incorrect information,” he wrote. “They’re also black boxes. Not even ChatGPT’s creators fully know why it suggests some ideas over others, or which way its biases run, or the myriad other ways it may screw up.”
But Broussard points out that bias problems plagued tech well before the chatbot craze. In a 2018 book “Algorithms of Oppression,” internet studies scholar Safiya U. Noble exposed how racism was baked into the algorithm that powers Google’s search engine. For example, Noble, now a professor at UCLA, found that when Googling the terms “Black girls,” “Latina girls” or “Asian girls,” the top results were pornography. In other contexts, artificial intelligence used for tasks like approval of mortgage applications led to Black applicants being 40% to 80% more likely to be denied a loan than similarly qualified white applicants.
Anyone who has searched the web for information on a topic knows that it can sometimes land them on a site spewing bigoted content or disinformation. The building blocks of chatbots have been scraped from the same internet. An offended user can navigate away from a toxic site in disgust. But because the data collection for LLMs is automated, such content gets included in the “instruction” for them. So if an LLM includes information from sites like Breitbart and VDare, which publish transphobic, anti-immigrant and racist content, that information — or disinformation — could be incorporated in a chatbot’s responses to your questions or requests for help.
“LLMs have been trained with white supremacist language and toxic material,” Broussard said, and “will definitely output white supremacist language.”
After reading the Washington Post story, I looked at VDare, a site I’ve reported on in the past, but had not visited in some time. One front-page stories, reflecting a preoccupation of the site, claimed that “black-on-white homicides” were contributing to the “death of white America” — an argument purported to be based on FBI statistics. It reminded me of how Donald Trump, when running for president in 2016, retweeted a tweet from a white supremacist account that included a racist image and numbers falsely claiming that Black people are responsible for 81 % of homicides of White people. The fact-checking site Politifact deemed the tweet “pants on fire” for its lies, but the power of technology — in that case, the retweet by someone who was famous, rich and a candidate for president — imbued the lie with a kind of imprimatur that mainstreamed white supremacist hate and far outstripped any corrective.









