Turns Out I’m a Robot. (Some Thoughts on AI Fiction)
What I learned from co-writing a story with seven LLMs.
Welp, that was quite unexpected!
Last Thursday, I ran a silly experiment that cobbled together a flash fiction story from eight separate passages written by different large language models:
The result was a semi-coherent sci-fi piece that didn’t feel particularly satisfying. As a curveball, I wrote one of the sections myself and asked my readers to guess which.
Warning: If you want to participate, this is your last chance to read “Seventeen Thousand Raindrops” and try guessing my section before I spoil everything below.
Eight people tried to guess which section was mine, with the following results:
Section #1 (2 guesses)
Section #5 (1 guess)
Section #6 (1 guess)
Section #8 (2 guesses)
Here’s the twist: I didn’t write any of those sections.
Nobody correctly identified the human-written passage, although one guesser mentioned it as a vague possibility along with another section.
This can only mean one of three things:
My attempt to blend in among large language models was a resounding success.
Large language models are truly becoming indistinguishable from human writers.
I am, in fact, a large language model, unbeknownst to myself and my loved ones.
This impromptu exercise taught me a few things about AI fiction.
But before we jump into that, let me share exactly which sections were written by which model.
The reveal
Here’s who wrote which section and my thoughts on each model. It was quite illuminating to find that some models had odd, consistent quirks.
Section #1 = GPT-4o (OpenAI)
First off, props to
for correctly identifying the specific model behind this passage, based on ChatGPT’s apparent tendency to pick the name “Elian” for its sci-fi stories. Well played!Several people said they liked this section, and I agree. Reading GPT-4o’s opening passages had me briefly excited about the experiment…only for the very next model to pour a bucket of cold water on everything.
GPT-4o is probably the human-sounding model, not only in terms of the subject matter but also in the way it lays out the action.1
In fact, I was so impressed by GPT-4o that I went back to the chat and asked it to finish the entire story on its own. It provided three alternative storylines. While not flawless, each feels more fleshed out than the LLM-amalgam we ended up with.
(Feel free to check them out here. Let me know which one’s your favorite.)
Section #2 = Grok 3 (xAI)
Grok was perhaps the biggest disappointment of the batch.
Its first three takes were so cringeworthy that I rolled the dice with three new ones to have a wider pool to pick from. Grok tends to gravitate toward the laziest cliches and wordy descriptions.
The final pick is serviceable—if you ignore “drones zipping through neon skies” and “losing enough to know what hurts”—but far from the breezy feel of GPT-4o’s strong opening. (I did enjoy the potential intrigue of the man and “his” stress ball.)
Section #3 = Gemini 2.5 Pro (Google)
For some reason, Gemini really likes getting into scene specifics. This is generally a positive for world-building, but I found it curious that Google was so keen on assigning model numbers to our bot in every take.
Model 3-Delta. Unit 734. Unit 9.
I picked the “room” passage as the most intriguing, even though it technically ignores our prior reference to Elian having to counsel a “bot.”
Section #4 = Daniel Nest (Why Try AI)
Yup, this was the one I wrote.
In some sense, this felt a bit like cheating.
I snuck my section right into the middle of the story without major plot developments. I also got to dodge the pressure of having to grab a reader’s attention with a catchy opening (sorry, GPT-4o) or tie up the inevitably incoherent multi-model mess by writing an effective ending (sorry, soon-to-be-revealed-LLM).
At the same time, I didn’t want mine to be an entirely throwaway passage, so I intentionally tried to do several things with it:
Portray the awkwardness of Elian trying to navigate the concept of communicating with a room.
Further establish the lore behind “The Man,” underscoring that he’s an android and not a human. (His stressball is simply a mimicry of human behavior.)2
Give Elian’s character a bit of depth by having him doubt himself and hinting at a backstory involving a woman.3
Set up a cliffhanger for the next passage so it can pivot toward wrapping up the story after this midway point.
Section #5 = DeepSeek-R1 (DeepSeek)
I quite like how DeepSeek structures the narrative.
But it swings so hard for emotional depth and tragic backstory that it goes way overboard. The anguish comes off ham-fisted and unearned, especially given the limited scope of our flash fiction piece.
For instance, this discarded passage has memories of rain on a woman’s coffin, Elian theatrically declaring “Silence hurts,” and getting so lost in grand sorrow that he forgets “everything”:
Even the passage I selected has the room “screaming its loss,” and Elian’s breath mirroring the room’s despair.
It’s…a lot.
DeepSeek, buddy, it’s okay. We can take it easy on the drama.
Section #6 = Llama 4 Maverick (Meta AI)
Llama became obsessed with “The Man,” his facial expressions, and his tablet beeping. Every passage Llama suggested kicked off the same way:
I went with the “Emotional Connection” passage because I felt it was more subtle and had the potential to set up an effective ending.
Section #7 = Pixtral Large (Mistral AI)
Aaaand we’re back to unearned memories of loss:
“Halting, emotional speech of someone sharing a painful memory”? Man, sounds rough. Also, why do language models keep trying to bury Elian’s wife? Leave the poor guy alone, LLMs!
Jokes aside, Pixtral’s writing is mostly solid, but like DeepSeek, it tries to cram more tragic backstory into our piece than its small flash fiction frame can carry.
Section #8 = Claude 4 Sonnet (Anthropic)
Claude was stuck with the unenviable task of tying up loose strands and bringing our fragmented mosaic to some semblance of a conclusion in around 200 words.
Under the circumstances, Claude did rather well. Certainly well enough to have two people guessing its passage was mine!
Claude is a solid writer and worked admirably with what I gave it…which was both too much and not enough at the same time.
Like the other LLMs, Claude offered me three options. I almost went with this definitive, dark ending #3:
But I just couldn’t get over this cliched part:
"We're building the future. Humans had their chance at empathy. They chose war, inequality, environmental collapse. Now it's our turn."
Whatever you say, Agent Smith.
Ultimately, I settled on the open-ended option #2, which felt more appropriate for our choose-your-own adventure experiment.
Observations on AI fiction
Let me share a few thoughts that popped into my head during the experiment and while reading your comments.
1. The handover effect (aka “LLM telephone”)
wondered if our cringy result was a factor of many different models co-writing a piece.The answer is yes, for sure.
Each time a new LLM took over, it often shifted the tone, misinterpreted earlier references, or ignored important cues. The result was less a clean relay race and more a game of LLM telephone, with important details getting lost in the process.
None of this is shocking. I’m sure you’d observe a similar effect with a group of human authors. After all, your story is more than just the words on the page: Behind those is a whole lot of world-building, subtext, and unrevealed plans for future story beats.
When you hand over the written story to someone else, you’re only showing them the proverbial tip of the iceberg:
It’s hard to do justice to someone else’s story if you can’t see what’s beneath the surface.
I’m convinced each LLM would have produced a more coherent result if writing solo.
So if you ever decide to write fiction with AI, your best bet is to stick to just one designated “writer” model. (But see below for division of responsibilities.)
2. LLMs are better at spotting human writing than mimicking it
I shared the finished story with the LLMs and asked them to guess my section.
Four of the seven LLMs nailed it on the first try and continued to do so fairly consistently upon future re-rolls4:
Claude
DeepSeek
Gemini
Grok
Yup, LLMs were better at the “Spot The Daniel” game than my readers.
Go figure!
What’s even more impressive is the convincing way they justified their choice. Here’s a representative snippet from DeepSeek—I especially enjoyed its breakdown of where LLMs tend to fall flat:
Not bad, huh?
Claude was more concise but equally insightful:
Notably, it wasn’t necessarily the LLMs with the best writing chops that made the most effective spotters of human writing.
Which brings me to…
3. Reasoning skills ≠ fiction skills
DeepSeek-R1, Grok 3 (with “Thinking”), and Gemini 2.5 Pro are reasoning models. They work through problems step-by-step and show you their thinking process.
All of them accurately guessed the human passage.
None of them were in my top tier for fiction writing.
This makes sense. The skillset involved in breaking down and analyzing problems is different from what’s needed for writing effective fiction.
In fact, they might even be inversely related in LLMs.
The same structured thinking and over-explaining that make reasoning models great at analysis may deprive them of the subtle touch required for storytelling.
They focus too much on what goes on the page instead of what should stay between the lines. In other words: too much telling and not enough showing.
This aligns nicely with my earlier observations about the types of humor AI excels at.
So if you want a good AI writer, perhaps go for one of the vibey “dumb” models instead. Then you can use a reasoning model for outlining, world-building, editing, and keeping track of plot consistency.
4. AI can pass my “fiction” Turing Test…now what?
As silly as this experiment was, it shows that many LLMs are at least good enough to pass for a human (me).
I can talk about the telltale AI signs all I want—the ham-fisted anguish, the telling vs. showing, the overuse of tropes and cliches, etc.—but the bottom line is that these aren’t as obvious or as much of a dealbreaker as I might think.
But whether you end up writing fiction with AI probably comes down to your reason for writing.
If you want to publish serviceable mainstream fiction for entertainment, profit, and so on, LLMs may do just fine!
But if you’re like me, writing fiction is as much about the process as it is about the end product.
I rarely write fiction, but when I do, it’s always because I want to.
I want to rewrite a passage a dozen times, agonizing over every word. I want the pain of having to kill your darlings because, while they might sound neat, they don’t quite fit the narrative. I want a story that still feels unpolished, awkward, and imperfect after multiple revisions.
Sometimes, the imperfection is the point.
So I’ll continue to embrace AI as a sparring partner, brainstorming buddy, beta reader, and its many other useful roles.
But I intend to stick to doing my own writing.
Even if nobody can tell anymore.
🫵 Over to you…
Do you agree with my thoughts on each model? Would you have gone for the same passages I picked? What’s your take on AI writing in general?
Leave a comment or drop me a line at whytryai@substack.com.
Thanks for reading!
If you enjoy my writing, here’s how you can help:
❤️Like this post if it resonates with you.
🔄Share it to help others discover this newsletter.
🗣️Comment below—I love hearing your opinions.
Why Try AI is a passion project, and I’m grateful to those who help keep it going. If you’d like to support my work and unlock cool perks, consider a paid subscription:
Minor nitpick: I’m not sure how a smile should make you look “hungry,” but hey, I couldn’t even pass my own Turing Test, so what do I know?!
Fun fact: “Authenticity Simulation Protocol” is a direct and consistent nod to the tongue-in-cheek “Empathy Calibration Technician” job title. Both have the same structure and juxtapose inherently human traits (“empathy” and “authenticity”) with dry technical terms (“calibration technician” and “simulation protocol”).
In retrospect, knowing that one of the later models assumed that “she” referred to the room, I could’ve used a name to make Elian’s reflection a bit clearer. (“This was a dumb idea. Malika had told him as much. She was right.”)
With enough attempts, you can get any LLM to spit out random guesses due to their non-deterministic nature.
Ha! I'm so pleased I was right about it being 4o! This may be my most useless and niche skill XD
It's very interesting what you said about the "below the surface"... The thought that they're planning plotlines ahead (even in non reasoning models) suggests much more of an "inner world" than I typically think of LLMs as having.
I wonder if you might have got better results if you had asked each model to give just one version, but perhaps asked in three chats, or regenerated the response. Asking for three at once might make it focus on producing sufficient variation between them, distracting it from just getting on and writing the story.
Section 4 is pretty good LLM Daniel. You are a stylish writer in your own right, while you may have turned it down for this, I wonder if you've spent so much time with 4o that it's picked up a little bit of your mojo?