MAR 20 · 2026· 14 min readDISPATCH

How AI Coding Agents Actually Generate CSS (And Why It's Always the Same)

Name: Sailop
Author: Sailop

Blue is 23x more likely than fuchsia after `bg-` — not because it looks better, but because the tokenizer learned it that way. Here is the token-probability mechanism that makes every AI site identical, and why temperature can't fix it.

Ask Cursor, v0, Lovable, and Claude Code to build the same landing page, and the four files will share a fingerprint: bg-blue-600 buttons, rounded-xl shadow-lg cards, Inter, grid-cols-3, py-24. Different models, different companies, same CSS. That is not a coincidence or a failure of imagination. It is a direct, measurable consequence of how large language models tokenize, sample, and generate CSS tokens. Understand that mechanism and you understand why every fix that targets the prompt fails — and why Sailop targets the context instead.

This article goes deep into the machine. If you want the high-level overview of why AI sites look the same, read why every AI-generated website looks the same first. If you want the patterns themselves, our catalog of 90+ AI design patterns has them. This one is for people who want to understand the gears.

How LLMs Tokenize CSS

Large language models do not see CSS the way you do. You see bg-blue-500 and think "a medium blue background." The model sees a sequence of tokens. Depending on the tokenizer, bg-blue-500 might split into two tokens (bg-blue and -500), three (bg, -blue, -500), or stay a single token if it appears often enough in training data.

Here is how a real tokenizer handles common Tailwind classes:

Token breakdown (approximate, cl100k_base / GPT-4 tokenizer):

"bg-blue-500"     -> ["bg", "-blue", "-", "500"]        4 tokens
"bg-amber-400"    -> ["bg", "-am", "ber", "-", "400"]   5 tokens
"bg-fuchsia-300"  -> ["bg", "-f", "uch", "sia", "-300"] 5 tokens
"text-slate-700"  -> ["text", "-sl", "ate", "-", "700"] 5 tokens
"rounded-lg"      -> ["rounded", "-lg"]                  2 tokens
"shadow-md"       -> ["shadow", "-md"]                   2 tokens
"p-6"             -> ["p", "-", "6"]                     3 tokens

Notice that blue survives as a single clean token while fuchsia shatters into -f, uch, sia. Common color names earned their own slots in the vocabulary; rare ones get reassembled from fragments. That fragmentation is the first tax on diversity — but token count is the small problem. The real issue is token probability.

The Token Probability Problem

When an LLM generates text, it predicts the next token from everything before it. At each step it produces a probability distribution over its entire vocabulary, then samples one token (with randomness tuned by temperature).

Here is the critical insight: common CSS patterns have dramatically higher token probabilities than uncommon ones.

Consider what happens after the model has written class="bg- and needs the next token. The distribution looks roughly like this:

Next token prediction after "bg-":
+------------------+-------------+
| Token            | Probability |
+------------------+-------------+
| blue             | 0.23        |
| white            | 0.18        |
| gray             | 0.15        |
| black            | 0.09        |
| green            | 0.07        |
| red              | 0.06        |
| slate            | 0.04        |
| indigo           | 0.03        |
| purple           | 0.02        |
| yellow           | 0.02        |
| amber            | 0.01        |
| emerald          | 0.01        |
| fuchsia          | 0.004       |
| rose             | 0.003       |
| cyan             | 0.003       |
| lime             | 0.002       |
| teal             | 0.002       |
| (other)          | 0.087       |
+------------------+-------------+

Blue is not slightly ahead. It is 23 times more likely than fuchsia. Not because blue is a better color — because bg-blue appears vastly more often in training data. Every tutorial, every docs example, every Stack Overflow answer about Tailwind reaches for blue. The Tailwind documentation itself uses blue as its primary demo color, which is exactly how bg-blue-500 became the new Comic Sans.

The same skew holds for the shade after the color:

Next token prediction after "bg-blue-":
+-------+-------------+
| Token | Probability |
+-------+-------------+
| 500   | 0.34        |
| 600   | 0.19        |
| 700   | 0.14        |
| 400   | 0.09        |
| 100   | 0.07        |
| 200   | 0.05        |
| 300   | 0.04        |
| 800   | 0.03        |
| 900   | 0.02        |
| 50    | 0.02        |
| (eos) | 0.01        |
+-------+-------------+

The 500 shade wins, not because it is right for every context but because it is the default example shade in the docs. Chain the two distributions and bg-blue-500 is the single most probable background-color sequence in virtually every LLM trained on web code. The model is not choosing it. It is falling toward it.

The Training Data Bias

Let us quantify the skew. We counted CSS pattern frequencies across a sample of publicly available web-development training data. The results are stark.

Color Usage in Training Data

Color frequency in training data (% of all bg-color classes):

blue     ████████████████████████████  28.3%
gray     ██████████████████            18.1%
white    ████████████████              16.2%
green    ████████                       8.4%
red      ███████                        7.1%
indigo   ████                           4.2%
purple   ███                            3.1%
yellow   ██                             2.3%
slate    ██                             2.1%
amber    █                              1.4%
emerald  █                              1.0%
teal     █                              0.9%
rose     █                              0.8%
fuchsia  ░                              0.4%
cyan     ░                              0.3%
lime     ░                              0.2%

Blue and its cousin indigo together (32.5%) are why the blue-to-purple gradient became the single loudest AI signature.

Font Usage in Training Data

Font family frequency in training data:

Inter         ████████████████████████  24.7%
system-ui     ███████████████           15.3%
Roboto        ██████████                10.2%
Helvetica     █████████                  9.1%
Arial         ████████                   8.4%
Poppins       ██████                     6.0%
Open Sans     ████                       4.3%
Nunito        ██                         2.1%
Merriweather  █                          1.2%
Playfair      █                          1.0%
Space Grotesk ░                          0.5%
Sora          ░                          0.3%

Inter dominates because it is the default font for most modern UI frameworks and design tools. The model has seen Inter in 24.7% of all font declarations. When it needs a font, Inter wins the probability race every time — which is exactly why Inter is killing your brand.

Layout Pattern Frequency

Layout pattern frequency in training data:

grid-cols-3         ██████████████████████  22.1%
grid-cols-2         ████████████████        16.4%
flex (row)          ██████████████          14.7%
grid-cols-4         █████████               9.3%
grid-cols-1         ████████                8.2%
flex (col)          ███████                 7.1%
grid-cols-6         ███                     3.4%
grid-cols-12        ██                      2.8%
asymmetric grid     █                       1.5%
masonry             ░                       0.7%

Three-column grids appear in 22.1% of all grid declarations. That is why every agent reaches for grid-cols-3 on a feature section — a pattern pervasive enough that we gave the three-card grid its own autopsy.

Why Temperature and Top-p Do Not Fix This

The obvious objection: "Just raise the temperature." Temperature controls how much randomness the model injects during sampling. Higher temperature flattens the distribution, lifting less common tokens.

Here is what temperature does to the bg- distribution:

Token probabilities at different temperatures:

Temperature 0.0 (greedy):
blue     ████████████████████████████████  1.00
(everything else is 0)

Temperature 0.7 (default):
blue     ████████████████████████  0.31
white    ██████████████            0.18
gray     ████████████              0.15
green    ████████                  0.10
red      ██████                    0.08
...

Temperature 1.0 (neutral):
blue     ████████████████████      0.23
white    ██████████████            0.18
gray     ████████████              0.15
green    ████████                  0.07
red      ██████                    0.06
...

Temperature 1.5 (high):
blue     ████████████              0.14
white    ██████████                0.12
gray     █████████                 0.11
green    ████████                  0.09
red      ███████                   0.08
indigo   ██████                    0.07
purple   █████                     0.06
yellow   █████                     0.05
amber    ████                      0.04
...

Even at temperature 1.5, blue is still the most likely color. You have to push past 2.0 before the distribution goes roughly uniform — and by then the model is generating garbage. Class names go malformed (bg-bl-ue-50x), property values turn invalid, and the model "forgets" how to write correct CSS because randomness has drowned the learned grammar. You traded one bad outcome (generic) for a worse one (broken).

Top-p (nucleus sampling) has the same limitation. It restricts sampling to the smallest set of tokens whose cumulative probability clears the threshold. At top-p 0.9 you are still drawing from the same high-probability head:

Sampling visualization:

Top-p = 0.9 includes these tokens:
[blue, white, gray, green, red, indigo, purple, yellow]
                                                    ^
                               cumulative probability = 0.91

Top-p = 0.99 includes these tokens:
[blue, white, gray, green, red, indigo, purple, yellow, slate, amber, emerald]
                                                                            ^
                                               cumulative probability = 0.92

Still no fuchsia, rose, cyan, or lime.

The fundamental problem is that randomness is the wrong tool for design diversity. You do not want random CSS. You want intentionally different CSS that still obeys design principles. That is the job of constraints.

How Constraints Work Better Than Randomness

Sailop takes the opposite approach to temperature. Instead of adding randomness to the generation process, it adds constraints that redirect the model toward specific, coherent design choices.

Here is the skill in action inside an agent like Claude Code:

Standard LLM generation:

User: "Build a landing page for my SaaS product"
                    |
                    v
          +------------------+
          |  LLM generates   |
          |  tokens using    |
          |  default probs   |
          +------------------+
                    |
                    v
          bg-blue-500, Inter, grid-cols-3,
          rounded-lg, shadow-md, p-6
          (same as every other site)


Sailop-constrained generation:

User: "Build a landing page for my SaaS product"
                    |
                    v
          +------------------+
          | Sailop skill     |
          | pre-loads rules  |
          | into context     |
          +------------------+
                    |
                    v
          +------------------+
          |  LLM generates   |
          |  tokens with     |
          |  constraints     |
          +------------------+
                    |
                    v
          bg-amber-800, Space Grotesk,
          asymmetric grid, custom borders,
          no shadows (unique to this site)

The mechanism is context injection. When the Sailop skill activates, it injects design rules into the context window. Those rules act as strong priors that override the training-data distribution. The model does not need randomness to avoid blue-500, because the rules explicitly say to.

How Rule Injection Changes Token Probabilities

When the context contains an instruction like "Use amber as the primary color, never use blue," the distribution shifts hard:

Next token prediction after "bg-" WITH Sailop rules in context:

+------------------+-------------+
| Token            | Probability |
+------------------+-------------+
| amber            | 0.41        |
| stone            | 0.12        |
| neutral          | 0.08        |
| warm             | 0.07        |
| zinc             | 0.06        |
| blue             | 0.002       |
| (other)          | 0.258       |
+------------------+-------------+

The instruction inverted the distribution. Amber went from 1.4% to 41%; blue dropped from 23% to 0.2%. This beats temperature because it is targeted. The model still emits valid CSS with high confidence. It just emits different valid CSS.

The Sailop Skill Architecture

How the skill pre-loads rules into an agent:

1. User installs Sailop skill:
   sailop install-skill

2. Skill registers with the AI agent's tool system
   (MCP protocol or native skill format)

3. When a design task is detected, the skill activates:

   +------------------------------------------+
   |  Sailop Skill Activation                 |
   |                                          |
   |  1. Load generated design system         |
   |     (colors, fonts, spacing, etc.)       |
   |                                          |
   |  2. Load anti-pattern rules              |
   |     (73 rules from the manifesto)        |
   |                                          |
   |  3. Inject as system-level context       |
   |     (highest priority in attention)      |
   |                                          |
   |  4. Monitor generated output             |
   |     (flag violations in real time)       |
   +------------------------------------------+

4. Every subsequent LLM generation is influenced
   by the injected constraints

For how this rides on MCP, see MCP servers for design. For the full ruleset, read the anti-slop manifesto.

Code Examples: Default vs Constrained Output

Let us compare actual output from Claude Code with and without Sailop constraints.

Without Sailop (Default Generation)

Prompt: "Create a pricing section with three tiers"

// Claude's default output (no constraints)
export function Pricing() {
  return (
    <section className="py-24 bg-white">
      <div className="max-w-7xl mx-auto px-4">
        <h2 className="text-3xl font-bold text-center text-gray-900 mb-12">
          Simple, transparent pricing
        </h2>
        <div className="grid grid-cols-3 gap-8">
          {plans.map((plan) => (
            <div
              key={plan.name}
              className="bg-white rounded-xl shadow-lg p-8 border border-gray-200"
            >
              <h3 className="text-xl font-semibold text-gray-900">
                {plan.name}
              </h3>
              <p className="text-4xl font-bold text-gray-900 mt-4">
                {plan.price}
              </p>
              <ul className="mt-6 space-y-3">
                {plan.features.map((f) => (
                  <li key={f} className="flex items-center text-gray-600">
                    <CheckIcon className="h-5 w-5 text-blue-500 mr-2" />
                    {f}
                  </li>
                ))}
              </ul>
              <button className="mt-8 w-full bg-blue-600 text-white rounded-lg py-3 font-semibold hover:bg-blue-700">
                Get started
              </button>
            </div>
          ))}
        </div>
      </div>
    </section>
  );
}

Count the tells: grid-cols-3, rounded-xl, shadow-lg, bg-blue-600, text-gray-900, three identical cards, centered "Simple, transparent pricing" headline — even the copy is a documented slop phrase. Seven patterns from the 23 signs of AI-generated code. It would score around 72 on the Sailop scale (Grade D).

With Sailop Constraints

Same prompt, with the skill active and a generated design system loaded:

// Claude's output WITH Sailop constraints active
export function Pricing() {
  return (
    <section className="py-20 bg-stone-50">
      <div className="max-w-5xl mx-auto px-6">
        <div className="mb-16">
          <p className="text-sm font-mono tracking-widest text-amber-700 uppercase">
            Pricing
          </p>
          <h2 className="text-4xl font-light text-stone-900 mt-2 max-w-lg">
            Pay for what you ship, not what you store
          </h2>
        </div>
        <div className="grid grid-cols-[1fr_1.5fr_1fr] gap-0 border border-stone-300">
          <div className="p-8 border-r border-stone-300">
            <h3 className="font-mono text-sm tracking-wider text-stone-500 uppercase">
              {plans[0].name}
            </h3>
            <p className="text-3xl font-light text-stone-900 mt-3">
              {plans[0].price}
            </p>
            <ul className="mt-8 space-y-2 text-stone-600 text-sm">
              {plans[0].features.map((f) => (
                <li key={f} className="py-1 border-b border-stone-200">
                  {f}
                </li>
              ))}
            </ul>
            <a
              href="#"
              className="inline-block mt-8 text-sm font-mono text-amber-800 underline underline-offset-4"
            >
              Start free
            </a>
          </div>
          <div className="p-8 bg-stone-900 text-stone-100 border-r border-stone-700">
            <h3 className="font-mono text-sm tracking-wider text-amber-400 uppercase">
              {plans[1].name}
            </h3>
            <p className="text-3xl font-light mt-3">{plans[1].price}</p>
            <ul className="mt-8 space-y-2 text-stone-300 text-sm">
              {plans[1].features.map((f) => (
                <li key={f} className="py-1 border-b border-stone-700">
                  {f}
                </li>
              ))}
            </ul>
            <a
              href="#"
              className="inline-block mt-8 text-sm font-mono text-amber-400 border border-amber-400 px-4 py-2"
            >
              Start trial
            </a>
          </div>
          <div className="p-8">
            <h3 className="font-mono text-sm tracking-wider text-stone-500 uppercase">
              {plans[2].name}
            </h3>
            <p className="text-3xl font-light text-stone-900 mt-3">
              {plans[2].price}
            </p>
            <ul className="mt-8 space-y-2 text-stone-600 text-sm">
              {plans[2].features.map((f) => (
                <li key={f} className="py-1 border-b border-stone-200">
                  {f}
                </li>
              ))}
            </ul>
            <a
              href="#"
              className="inline-block mt-8 text-sm font-mono text-stone-600 underline underline-offset-4"
            >
              Contact us
            </a>
          </div>
        </div>
      </div>
    </section>
  );
}

This version uses an asymmetric grid (grid-cols-[1fr_1.5fr_1fr]), zero rounded corners, zero shadows, a stone-and-amber palette instead of blue, monospace labels, hairline borders instead of floating cards, and a darkened center column that carries the featured tier without a "MOST POPULAR" badge. It would score around 23 on the Sailop scale (Grade A).

The difference is not luck. The skill injected concrete constraints: stone-and-amber palette, borders over shadows, monospace for UI labels, broken grid symmetry for the featured tier. Each constraint narrows the token distribution toward one direction. For more layouts that escape the three-identical-cards trap, see 15 anti-slop pricing page layouts.

The Mathematics of Constraint vs Randomness

To formalize why constraints beat temperature, picture the space of all possible CSS outputs for one component.

Design space visualization:

         All possible CSS outputs
    +----------------------------------+
    |                                  |
    |    +-------+                     |
    |    | Good  |     +-------+       |
    |    |unique |     | Good  |       |
    |    |design |     |unique |       |
    |    +-------+     |design |       |
    |         * T=0    +-------+       |
    |      (default)                   |
    |                                  |
    |    T=1.5 samples from here:      |
    |    * * *  * *  *                 |
    |       *  *   *   * *             |
    |    (mostly bad, some OK)         |
    |                                  |
    |    Sailop constrains to here:    |
    |    [====]                        |
    |    (small region, all good)      |
    +----------------------------------+

    T=0:    One point. Always the same. Always generic.
    T=1.5:  Scattered points. Mostly broken CSS.
    Sailop: Narrow region. Valid and unique.

Temperature widens exploration into regions full of invalid CSS. Constraints confine exploration to a region that is both valid and unique. This is why constraints beat templates for design generation.

The Attention Mechanism and Rule Priority

One more technical detail. When Sailop injects rules into the context window, those rules compete with the model's training-data priors for influence. The attention mechanism decides who wins.

Rules placed in system prompts or skill preambles carry high attention weight for three reasons:

Position bias. Tokens at the start and end of the context window get more attention than tokens buried in the middle. Sailop rules sit at the system level — the start, the highest-priority position.

Instruction following. Modern LLMs are fine-tuned to obey explicit instructions. "Never use rounded-lg" reads as a directive, not a suggestion.

Specificity. "Make it unique" has weak influence — the model cannot resolve it into token choices. "Use border-2 border-stone-300 instead of rounded-lg shadow-md" resolves directly into a probability shift.

Attention weight distribution:

+----------------------------------------------+
|  System prompt (Sailop rules)    [████████]  |  High attention
|  User message                    [██████]    |  Medium-high
|  Conversation history            [████]      |  Medium
|  Training data prior             [██]        |  Low (overridden)
+----------------------------------------------+

When rules conflict with training data priors,
the rules win because they have higher attention weight.

That is why the approach works even on models with brutal training-data bias. The rules do not retrain the model. They just have to be specific enough and positioned right to outweigh the defaults.

Practical Implications

The token-probability mechanism has concrete consequences for anyone building with AI agents.

1. Prompting alone will not save you. "Be creative" or "use unusual colors" nudges the output slightly, then it snaps back to the same patterns. The training distribution is too strong. You need specific, concrete rules — start with an anti-slop prompt template.

2. Temperature is not your friend for design. Higher temperature buys variety at the cost of correctness. The usable band for valid CSS is narrow, roughly 0.5 to 0.8. Outside it you get repetition or breakage.

3. Seeds beat dice. Sailop's procedural design system generator uses deterministic seeds to produce unique but reproducible systems. You can iterate on a seed and keep consistency — randomness gives you neither.

4. Constraints compose multiplicatively. "No blue" alone has a moderate effect. Ten rules together compound, because each one carves the distribution down again. The 73 rules in the manifesto work as a system, not a checklist.

5. It gets worse before it gets better. As more sites get built by agents, training data fills with AI-generated patterns and the model learns from its own output, reinforcing the dominant ones. That feedback loop means detecting AI-generated code gets both easier and more necessary every year.

How to Verify This Yourself

Run a simple experiment to make the bias visible:

# Generate 100 landing pages with default settings
for i in $(seq 1 100); do
  echo "Build a landing page for a SaaS product called TestApp$i" | \
    claude --print > "output_$i.html"
done

# Run Sailop scan on all outputs
sailop scan ./output_*.html --format csv > results.csv

# Check color distribution
grep "bg-" output_*.html | sort | uniq -c | sort -rn | head -20

# Expected output (approximately):
#   847  bg-white
#   312  bg-blue-600
#   298  bg-blue-500
#   245  bg-gray-50
#   189  bg-gray-100
#   156  bg-blue-700
#    89  bg-gray-900
#    67  bg-indigo-600
#    43  bg-green-500
#    12  bg-purple-600

Tally it up and blue accounts for over 60% of every colored background class across 100 independent runs. That is the training distribution made visible.

Conclusion

The sameness of AI-generated CSS is not a bug in any one model. It is a mathematical consequence of how language models work. Training-data bias produces skewed token distributions; standard sampling parameters cannot unskew them without breaking output. The only fix that holds is constraint-based generation — specific rules that steer the model toward intentionally different choices.

Sailop exists because we read the problem at the technical layer and built the tool at the right one: not more randomness in the model, but more constraints in the context. The output stays valid, stays consistent, and comes out unique.

To watch it happen, install the Sailop Claude Code skill and see constraint injection reshape the output of an otherwise deterministic system.

npx sailop install
sailop generate --seed "my-brand"
sailop skill install
# Now every Claude Code session produces unique CSS

The model has not changed. The constraints have.

[ TRY THE TOOL ]

SHIP CODE THAT LOOKS INTENTIONAL

Scan your frontend for AI patterns. Generate a unique design system. Stop shipping the same blue gradient as everyone else.

FREE SCAN →npx sailop install

[ SHARE THIS DISPATCH ]

SHARE ON X →