Skip to main content
LATEST Why Ken Burns Is Questioning Trump’s America 250 Vision This Startup Is Teaching AI Models to Stop Thinking Alike CursorBench 3.1 Puts a Number on the Coding Assistants Hype How Apple’s Hide My Email Feature Could Be Unmasked Workers Want Quiet as Return-to-Office Mandates Fill the Office Again
Tech

This Startup Is Teaching AI Models to Stop Thinking Alike

Christina Hill
Christina Hill Staff Writer ·
10 min read
This Startup Is Teaching AI Models to Stop Thinking Alike

Why do the chatbots all say the same thing?

Try this party trick with a few major chatbots. Ask for a random number between one and ten, then ask again, and again. The answers often cluster in a surprisingly tight little pocket. You’ll see the same few picks crop up, then the models drift into familiar aftercare: a quick explanation, a polite offer to help, maybe a suggestion for how to use the number in a game or prompt. Nothing breaks, and nothing crashes. The output just feels oddly coordinated.

That’s the part people keep bumping into. The problem is usually not that the model can’t answer. It can. The problem is that it reaches for the safest probable answer, which is fine when you want a clean summary or a quick rewrite, and much less useful when you’re trying to brainstorm something that doesn’t sound like it came from a committee. In tech news, ai policy drafts, digital culture copy, and even the first pass on a campaign slogan, users often want fresh options. The models keep giving them the middle lane.

A chatbot that always lands on the safest answer is useful for chores. For creative work, safe can turn into stale very quickly.

That’s the hole Springboards is trying to fill. The Australian startup built Flint as a response to the gray, mass-market feel of standard chatbot output, where everything can start to sound as if it was polished by the same invisible hand. The pitch isn’t that current models are broken. It’s that they’re too eager to be agreeable. When someone is drafting ad copy, hunting for a product name, or trying to kick off a new concept, agreeable is only useful for part of the job.

Pip Bingemann, who is helping front the product, seems to sell it with a pretty simple line: sometimes the model should be exactly what you expect. If you ask for a neat answer, you should get one. The machine can do that without making a fuss, if you need a quick fix. But the model also needs to have a little room to surprise you. Not chaos. Not random noise for its own sake. Just enough variation to give a human something less obvious to work with.

Then again, that distinction matters. A chatbot that nails the obvious response can still save time. A chatbot that only ever nails the obvious response can flatten a creative session into a dead end. The gap Springboards is aiming at lives there, between utility and sameness, where a model can be technically correct and still feel useless for the actual task in front of it.

The company’s bet is simple enough to understand. People don’t always want the first decent answer. Sometimes they want a few odd ones, a few bolder ones, a few that make the room go quiet for a second. As far as I can tell, flint’s Springboards’ attempt to make that possible without turning the model into a carnival act. Whether that turns into a real tool for creative work, or just another neat demo, actually, let me rephrase: for the AI crowd, is the question hanging over it. The reasons behind the sameness are less mysterious than they first appear, and that’s where the story gets more interesting.

The case for sameness: from random numbers to river metaphors

The case for sameness: from random numbers to river metaphors

The random-number trick is funny because it feels like a glitch. Ask a few big chatbots for a number between one and ten, and a small cluster of answers keeps popping up before they wander off into the usual polite chatter. True enough. The more interesting part’s what happens when you move past party tricks and ask them to be imaginative for a living. That’s where the sameness stops looking accidental.

A NeurIPS-winning paper tested roughly two dozen large language models, pulling in systems from major US companies, open-source models from China, and others from around the field. The researchers didn’t just throw yes-or-no prompts at them. They used open-ended questions meant to pull out fresh language, then watched what happened when the same prompts were repeated. On the topic of time, the answers kept collapsing toward a few stock images. River language showed up again and again. So did weaving and thread imagery. Different models, different brands, same small shelf of metaphors.

When models are trained to be broadly useful, they often end up sounding broadly the same.

Naturally, that pattern shows up in everyday prompts too. Ask for a recommendation on a car and the answers tend to drift toward the same familiar names, usually Toyota or Honda territory, because those are safe, well-known, and hard to offend anyone with. Ask for a band name and the output fills up with the same mood words, the same cool-but-not-too-weird combinations, the same faintly synthetic sense that you’ve seen this exact idea on a poster somewhere before. It’s not that the models are incapable of variety. It’s that they keep selecting the kind of variety most likely to be accepted on the first pass.

The plain-English reason isn’t mysterious, even if the results are a little depressing for anyone hoping machines might be a font of chaos. These systems are trained in similar ways, on overlapping material, and then tuned to produce answers people rate as reliable, coherent, and helpful. That pushes them toward the center. The training tends to punish it unless the oddness is clearly useful, if a response is slightly odd. If a response is familiar, polished, and easy to digest, it usually survives. Over millions of examples, that adds up to a model that knows how to sound inventive without actually taking many risks.

That matters because the model is doing what it was rewarded for doing. It learns to avoid embarrassing detours. It learns to choose the likeliest continuation, then the likeliest one after that. For tasks where certainty and consistency matter, that’s a feature, not a bug. But for brainstorming, naming, mood boards, ad copy, or any prompt where the first decent answer isn’t the same thing as the best answer, the center of the distribution can feel like a very cramped place to live.

Kieran Browne has pointed out a more awkward part of the problem: the interface makes all of this feel personal. ChatGPT and its peers talk back in a one-to-one chat window, so users naturally assume they’re getting a distinct, bespoke reply. In practice, a lot of people are being served extremely similar material, just wrapped in different phrasing. The illusion is neat, and the output, less so.

That gap between feeling and mechanism is where the annoyance starts to get useful. If a chatbot hands you the same safe answer as everyone else, it may still be doing its job. But if you were trying to get unstuck, or if you needed something that wasn’t already worn smooth by a thousand similar prompts, the system has quietly missed the point. It hasn’t failed in an obvious way. The reality: it has succeeded too neatly.

For readers who want the neighboring rabbit hole, Anthropic has published notes on subliminal learning and stress-testing model specs, both of which circle the same basic issue from different angles: models can absorb and reproduce patterns in ways that are easy to miss if you only look at the polished answer. Springboards, the Australian AI startup behind Flint, is betting that this is exactly the flaw worth poking at. Its pitch makes a lot more sense once you accept that the boring answer problem’s structural, not a one-off quirk.

If you want a place to see how that pitch is being framed, Springboards has it up at trajectory.ai, and the company’s whole argument depends on the idea that “good enough” and “same as everyone else” are not always the same thing.

How Flint pushes the model off the rails, selectively

Springboards didn’t build Flint by training a giant model from scratch and waiting for magic to happen. It started with Qwen 3, the open model from Alibaba, then built a layer on top that changes how the model chooses its next words. That matters, because the usual way people force more variety out of an LLM is blunt. Turn the temperature up, and yes, you get more randomness. You also get wobble, loose logic, and the occasional answer that sounds like it was written after three coffees and a minor electrical fault.

That kind of global randomness is fun for a demo and annoying everywhere else. A marketing team can ask for ten taglines and get back nine that are oddly similar, or one that takes a hard left into nonsense. The model is technically “more creative,” but only in the sense that it’s now willing to embarrass itself in public. For a product built around generative AI, that trade-off, actually, let me rephrase: is usually too messy to ship. Flint’s pitch is that it should keep the useful parts of the model intact while only loosening the grip where the answer has room to move.

Flint is trying to change the odds, not turn every sentence into a dare.

The trick, at least as Springboards describes it, is selective variation. Instead of randomizing every token across the whole response, the system looks for spots where several continuations could work. Those are the seams. A model might be equally happy with three different verbs, or a few different ways to frame a metaphor, or one of several examples that all fit the same prompt. Flint injects more variation there, then leaves the rest alone. The result should feel less boxed in without falling apart.

That sounds modest, which is probably the point. People who bounce between Claude, GPT-style systems, and other familiar chatbots already know the annoyance of LLM homogeneity. You ask for a fresh angle, and the model gives you the safest version of the same thought, polished just enough to pass as original. Springboards is betting that the answer isn’t to crank the whole thing into chaos, but to introduce controlled disagreement at the right moments. That’s a much less glamorous trick than “build an AI with personality,” but it’s also a lot more usable.

There’s a practical reason for that restraint. The model may sound inventive for about six seconds and then start stepping on its own shoelaces, if you push randomness too hard. One sentence promises confidence and the next hedges away from it as well as the third wanders off into a word salad with a nice vocabulary. A lot of AI creativity tools make that mistake. They treat surprise as the product, when surprise’s usually just a side effect. Flint seems designed for the opposite problem: keep the surprise, lose the faceplant.

After that, the same philosophy shows up in Springboards’ companion tool for ad and marketing teams. Instead of forcing a user to accept one generated response as final. It lets them mix and match text from multiple models. That is a quietly useful idea, because creative work rarely fails from a lack of output. It fails because the first output is a little too tidy, a little too familiar, and nobody in the room wants to admit it. A system that lets you splice together the strongest lines from several candidates can be more useful than a single polished paragraph that everyone pretends to like.

Another thing: for teams working in lifestyle tech or brand work, that difference could be enough to matter. A copywriter might pull the opening from one model, the product framing from another, and the call to action from a third, then throw away the bits that feel too canned.

The broader technical backdrop isn’t hard to see. Work around model steering keeps circling the same question: how much should a model be allowed to settle into its favorite patterns, and how much should it be nudged away from them? A Wired report on ex-Google and Apple researchers building AI that gets smarter as you use it points to one answer, where systems adapt more closely to the person using them. Flint goes after a different target. It doesn’t try to learn your habits over time. It tries to stop the model from defaulting to the same safe habits in the first place. Anthropic’s note on policy steering and a Nature paper on language-model behavior sit in the same neighborhood, if you want the longer technical trail.

Bingemann and Browne’s framing is pretty clean. Flint’s meant to widen the lane, not force every response to act strange for the sake of it.

The upside, the limits, and what comes next

The early reactions to Flint have been encouraging, but nobody at Springboards is pretending the thing has cracked creativity wide open. That would be a bit much for a prototype. What it’s done, at least for a small group of testers, is create a different kind of first draft. Strategy consultant Zoe Scaman and marketing operator Maximilian Weigl have both treated it less like a finished writer and more like a prompt machine with manners, one that keeps nudging the conversation away from the first obvious answer.

That matters because most people don’t need a machine to be brilliant every time. They need it to stop being boring at the exact moment the brief starts feeling stale. In practice, that seems to be where Flint finds its lane. When ordinary models are asked to help with a finance-company campaign aimed at younger customers, they tend to drift toward the usual youth-marketing clichés: slang, a few references to “hustle,” maybe some bright colors and a whiff of sticker shock (for better or worse). Flint, by contrast, has been used to push the whole question somewhere more useful. “ That’s a much better question, and, frankly, a much less embarrassing one to take into a meeting.

Novelty only helps when it gives you a better starting point, not when it hands you a prettier pile of nonsense.

Scaman’s and Weigl’s reactions fit that idea. They’re not talking about copy-pasting Flint’s output and calling the day done. They’re talking about getting a few lines, angles, or prompts that don’t immediately collapse into the same safe marketing mush. The benefit is upstream. It changes the shape of the brainstorm before anyone’s sunk time into polishing the wrong idea. That’s a quieter win than some AI pitches promise, but it may be the more useful one.

Still, Flint is a prototype, and prototype is doing a lot of work there. Under pressure, it can break. It can wander off, produce something awkward, or fail to hold the thread when the prompt gets long and messy. Quick aside. The people testing it seem aware of that. They’re not treating it like a slot machine that occasionally spits out genius. They’re treating it like a tool that sometimes gives them one unusual angle worth pursuing. That’s a decent outcome, but it’s not a miracle.

There’s also a more basic caution running through the way users talk about it. Most of the time, average output is good enough. For a lot of routine work, the safest model answer is fine because the task itself is routine. Simple as that. Nobody needs a chatbot to invent a new universe every time someone asks for a headline or a subject line. And nobody should be copy-pasting AI text as if the machine has done the job for them (which is worth thinking about). If the output still needs judgment, editing, and some actual human taste, then that’s not a flaw. That’s the arrangement.

At the same time, the real test for Springboards and Flint AI is whether that arrangement holds up outside a demo room. Can it keep giving people a meaningful choice, not just between one model and another, but between safe sameness and deliberate weirdness when the brief calls for a shake-up? That seems to be the bet. Not more AI for its own sake. Just a way to ask for the oddball on purpose, instead of waiting for it to happen by accident.

Newsletter

Stay in the loop

Join our newsletter and get resources, curated content, and inspiration delivered straight to your inbox.