Project Analyzing Human Language Usage Shuts Down Because ‘Generative AI Has Polluted the Data’

some_guy@lemmy.sdf.org · 8 months ago

Project Analyzing Human Language Usage Shuts Down Because ‘Generative AI Has Polluted the Data’

conciselyverbose@sh.itjust.works · edit-2 8 months ago

The problem is that LLMs aren’t human speech and any dataset that includes them cannot be an accurate representation of human speech.

It’s not “LLMs convinced humans to use ‘delve’ a lot”. It’s “this dataset is muddy as hell because a huge proportion of it is randomly generated noise”.

NuXCOM_90Percent@lemmy.zip · 8 months ago

What is “human speech”? Again, so many people (around the world) have picked up idioms and speaking cadences based on the media they consume. A great example is that two of my best friends are from the UK but have been in the US long enough that their families make fun of them. Yet their kid actually pronounces it “al-you-min-ee-uhm” even though they both say “al-ooh-min-um”. Why? Because he watches a cartoon where they pronounce it the British way.

And I already referenced socal-ification which is heavily based on screenwriters and actors who live in LA. Again, do we not speak “human speech” because it was artificially influenced?

Like, yeah, LLMs are “tainted” with the word “delve” (which I am pretty sure comes from youtube scripts anyway but…). So are people. There is a lot of value in researching the WHY a given word or idiom becomes so popular but, at the end of the day… people be saying “delve” a lot.

conciselyverbose@sh.itjust.works · edit-2 8 months ago

Speech written by a human. It’s not complicated.

It cannot possibly be human speech if it was produced by a machine.