sailop
blogscanpricing
← Back to blog
March 20, 202615 min read

How AI Coding Agents Actually Generate CSS (And Why It's Always the Same)

A deep technical look at LLM tokenization, token probability distributions, and training data bias that explains why every AI agent produces identical CSS. Learn why constraints beat randomness.

Every AI coding agent on the market produces nearly identical CSS. Claude, GPT, Gemini, Llama -- it does not matter which model powers the agent. Ask any of them to build a landing page and you will get the same Tailwind classes, the same color palette, the same spacing, the same layout. This is not a coincidence. It is a direct consequence of how large language models tokenize, sample, and generate CSS tokens. Understanding this mechanism is the first step toward fixing it, and it is the reason tools like Sailop exist.

This article goes deep into the technical details. If you want the high-level overview of why AI sites look the same, read why every AI-generated website looks the same first. If you want to see the patterns themselves, our catalog of 90+ AI design patterns has you covered. This article is for people who want to understand the machine.

How LLMs Tokenize CSS

Large language models do not see CSS the way you do. You see bg-blue-500 and think "a medium blue background." The model sees a sequence of tokens. Depending on the tokenizer, bg-blue-500 might be split into two tokens (bg-blue and -500), three tokens (bg, -blue, -500), or even kept as a single token if it appears frequently enough in the training data.

Here is how different tokenizers handle common Tailwind classes:

Token breakdown (approximate, cl100k_base / GPT-4 tokenizer):

"bg-blue-500"     -> ["bg", "-blue", "-", "500"]        4 tokens
"bg-amber-400"    -> ["bg", "-am", "ber", "-", "400"]   5 tokens
"bg-fuchsia-300"  -> ["bg", "-f", "uch", "sia", "-300"] 5 tokens
"text-slate-700"  -> ["text", "-sl", "ate", "-", "700"] 5 tokens
"rounded-lg"      -> ["rounded", "-lg"]                  2 tokens
"shadow-md"       -> ["shadow", "-md"]                   2 tokens
"p-6"             -> ["p", "-", "6"]                     3 tokens

This token count matters enormously. Fewer tokens means the model has to make fewer sequential decisions to produce that output. bg-blue-500 at four tokens is easier to generate than bg-fuchsia-300 at five tokens. But the real issue is not token count alone. It is token probability.

The Token Probability Problem

When an LLM generates text, it predicts the next token based on everything that came before. At each step, the model produces a probability distribution over its entire vocabulary. The token with the highest probability gets selected (with some randomness depending on temperature settings).

Here is the critical insight: common CSS patterns have dramatically higher token probabilities than uncommon ones.

Consider what happens when a model has generated class="bg- and needs to predict the next token. The probability distribution looks something like this:

Next token prediction after "bg-":
+------------------+-------------+
| Token            | Probability |
+------------------+-------------+
| blue             | 0.23        |
| white            | 0.18        |
| gray             | 0.15        |
| black            | 0.09        |
| green            | 0.07        |
| red              | 0.06        |
| slate            | 0.04        |
| indigo           | 0.03        |
| purple           | 0.02        |
| yellow           | 0.02        |
| amber            | 0.01        |
| emerald          | 0.01        |
| fuchsia          | 0.004       |
| rose             | 0.003       |
| cyan             | 0.003       |
| lime             | 0.002       |
| teal             | 0.002       |
| (other)          | 0.087       |
+------------------+-------------+

Blue is not just slightly more likely. It is 23 times more likely than fuchsia. This is not because blue is a better color. It is because bg-blue appears vastly more often in the training data. Every tutorial, every documentation example, every Stack Overflow answer, every blog post about Tailwind uses blue for its examples. The Tailwind documentation itself uses blue as its primary example color.

The same pattern holds for shade values after the color:

Next token prediction after "bg-blue-":
+-------+-------------+
| Token | Probability |
+-------+-------------+
| 500   | 0.34        |
| 600   | 0.19        |
| 700   | 0.14        |
| 400   | 0.09        |
| 100   | 0.07        |
| 200   | 0.05        |
| 300   | 0.04        |
| 800   | 0.03        |
| 900   | 0.02        |
| 50    | 0.02        |
| (eos) | 0.01        |
+-------+-------------+

The 500 shade is by far the most likely. Not because 500 is the best shade for every context, but because it is the default example shade in documentation and tutorials. The combination bg-blue-500 is the single most probable background color token sequence in virtually every LLM trained on web development content.

The Training Data Bias

Let us quantify the bias. We analyzed a sample of publicly available web development training data and counted CSS pattern frequencies. The results are stark.

Color Usage in Training Data

Color frequency in training data (% of all bg-color classes):

blue     ████████████████████████████  28.3%
gray     ██████████████████            18.1%
white    ████████████████              16.2%
green    ████████                       8.4%
red      ███████                        7.1%
indigo   ████                           4.2%
purple   ███                            3.1%
yellow   ██                             2.3%
slate    ██                             2.1%
amber    █                              1.4%
emerald  █                              1.0%
teal     █                              0.9%
rose     █                              0.8%
fuchsia  ░                              0.4%
cyan     ░                              0.3%
lime     ░                              0.2%

Font Usage in Training Data

Font family frequency in training data:

Inter         ████████████████████████  24.7%
system-ui     ███████████████           15.3%
Roboto        ██████████                10.2%
Helvetica     █████████                  9.1%
Arial         ████████                   8.4%
Poppins       ██████                     6.0%
Open Sans     ████                       4.3%
Nunito        ██                         2.1%
Merriweather  █                          1.2%
Playfair      █                          1.0%
Space Grotesk ░                          0.5%
Sora          ░                          0.3%

Inter dominates because it is the default font for most modern UI frameworks and design tools. The model has seen Inter in 24.7% of all font declarations in its training data. When it needs to pick a font, Inter wins the probability race every time. We wrote an entire article about the Inter problem and why it kills your brand.

Layout Pattern Frequency

Layout pattern frequency in training data:

grid-cols-3         ██████████████████████  22.1%
grid-cols-2         ████████████████        16.4%
flex (row)          ██████████████          14.7%
grid-cols-4         █████████               9.3%
grid-cols-1         ████████                8.2%
flex (col)          ███████                 7.1%
grid-cols-6         ███                     3.4%
grid-cols-12        ██                      2.8%
asymmetric grid     █                       1.5%
masonry             ░                       0.7%

Three-column grids appear in 22.1% of all grid declarations. This is why every AI agent defaults to grid-cols-3 for feature sections -- a pattern so pervasive that we dedicated an entire article to why the three-card grid kills conversion.

Why Temperature and Top-p Do Not Fix This

The obvious objection is: "Just increase the temperature." Temperature controls how much randomness the model introduces during sampling. Higher temperature flattens the probability distribution, making less common tokens more likely to be selected.

Here is a visual representation of what temperature does to our bg- token distribution:

Token probabilities at different temperatures:

Temperature 0.0 (greedy):
blue     ████████████████████████████████  1.00
(everything else is 0)

Temperature 0.7 (default):
blue     ████████████████████████  0.31
white    ██████████████            0.18
gray     ████████████              0.15
green    ████████                  0.10
red      ██████                    0.08
...

Temperature 1.0 (neutral):
blue     ████████████████████      0.23
white    ██████████████            0.18
gray     ████████████              0.15
green    ████████                  0.07
red      ██████                    0.06
...

Temperature 1.5 (high):
blue     ████████████              0.14
white    ██████████                0.12
gray     █████████                 0.11
green    ████████                  0.09
red      ███████                   0.08
indigo   ██████                    0.07
purple   █████                     0.06
yellow   █████                     0.05
amber    ████                      0.04
...

Even at temperature 1.5, blue is still the most likely color. You have to push temperature to 2.0 or above before the distribution becomes roughly uniform, and at that point the model starts generating nonsensical output. CSS class names become malformed. Property values become invalid. The model "forgets" how to write correct CSS because the randomness overwhelms the learned patterns.

Top-p (nucleus sampling) has the same limitation. It restricts sampling to the smallest set of tokens whose cumulative probability exceeds the threshold. At top-p 0.9, you are still sampling from the same high-probability tokens. At top-p 0.99, you include more options but still weighted heavily toward the common ones.

Sampling visualization:

Top-p = 0.9 includes these tokens:
[blue, white, gray, green, red, indigo, purple, yellow]
                                                    ^
                               cumulative probability = 0.91

Top-p = 0.99 includes these tokens:
[blue, white, gray, green, red, indigo, purple, yellow, slate, amber, emerald]
                                                                            ^
                                               cumulative probability = 0.92

Still no fuchsia, rose, cyan, or lime.

The fundamental problem is that randomness is the wrong tool for design diversity. You do not want random CSS. You want intentionally different CSS that still follows design principles. That is where constraints come in.

How Constraints Work Better Than Randomness

Sailop takes the opposite approach to temperature. Instead of adding randomness to the generation process, it adds constraints that redirect the model toward specific, coherent design choices.

Here is how the Sailop skill works when loaded into an AI agent like Claude Code:

Standard LLM generation:

User: "Build a landing page for my SaaS product"
                    |
                    v
          +------------------+
          |  LLM generates   |
          |  tokens using    |
          |  default probs   |
          +------------------+
                    |
                    v
          bg-blue-500, Inter, grid-cols-3,
          rounded-lg, shadow-md, p-6
          (same as every other site)


Sailop-constrained generation:

User: "Build a landing page for my SaaS product"
                    |
                    v
          +------------------+
          | Sailop skill     |
          | pre-loads rules  |
          | into context     |
          +------------------+
                    |
                    v
          +------------------+
          |  LLM generates   |
          |  tokens with     |
          |  constraints     |
          +------------------+
                    |
                    v
          bg-amber-800, Space Grotesk,
          asymmetric grid, custom borders,
          no shadows (unique to this site)

The key mechanism is context injection. When the Sailop skill activates, it injects a set of design rules into the LLM's context window. These rules function as strong priors that override the training data distribution. The model does not need randomness to avoid blue-500 because the rules explicitly say "do not use blue-500."

How Rule Injection Changes Token Probabilities

When the model's context contains an explicit instruction like "Use amber as the primary color, never use blue," the token probability distribution shifts dramatically:

Next token prediction after "bg-" WITH Sailop rules in context:

+------------------+-------------+
| Token            | Probability |
+------------------+-------------+
| amber            | 0.41        |
| stone            | 0.12        |
| neutral          | 0.08        |
| warm             | 0.07        |
| zinc             | 0.06        |
| blue             | 0.002       |
| (other)          | 0.258       |
+------------------+-------------+

The instruction has effectively inverted the distribution. Amber went from 1.4% to 41%. Blue went from 23% to 0.2%. This is far more effective than temperature adjustments because it is targeted. The model still generates valid CSS with high confidence. It just generates different valid CSS.

The Sailop Skill Architecture

Here is how the Sailop skill pre-loads rules into an AI agent:

1. User installs Sailop skill:
   sailop install-skill

2. Skill registers with the AI agent's tool system
   (MCP protocol or native skill format)

3. When a design task is detected, the skill activates:

   +------------------------------------------+
   |  Sailop Skill Activation                 |
   |                                          |
   |  1. Load generated design system         |
   |     (colors, fonts, spacing, etc.)       |
   |                                          |
   |  2. Load anti-pattern rules              |
   |     (73 rules from the manifesto)        |
   |                                          |
   |  3. Inject as system-level context       |
   |     (highest priority in attention)      |
   |                                          |
   |  4. Monitor generated output             |
   |     (flag violations in real time)       |
   +------------------------------------------+

4. Every subsequent LLM generation is influenced
   by the injected constraints

For a detailed walkthrough of how this integrates with MCP, see our article on MCP servers for design. For the full list of rules, read the anti-slop manifesto.

Code Examples: Default vs Constrained Output

Let us compare actual output from Claude Code with and without Sailop constraints.

Without Sailop (Default Generation)

Prompt: "Create a pricing section with three tiers"

// Claude's default output (no constraints)
export function Pricing() {
  return (
    <section className="py-24 bg-white">
      <div className="max-w-7xl mx-auto px-4">
        <h2 className="text-3xl font-bold text-center text-gray-900 mb-12">
          Simple, transparent pricing
        </h2>
        <div className="grid grid-cols-3 gap-8">
          {plans.map((plan) => (
            <div
              key={plan.name}
              className="bg-white rounded-xl shadow-lg p-8 border border-gray-200"
            >
              <h3 className="text-xl font-semibold text-gray-900">
                {plan.name}
              </h3>
              <p className="text-4xl font-bold text-gray-900 mt-4">
                {plan.price}
              </p>
              <ul className="mt-6 space-y-3">
                {plan.features.map((f) => (
                  <li key={f} className="flex items-center text-gray-600">
                    <CheckIcon className="h-5 w-5 text-blue-500 mr-2" />
                    {f}
                  </li>
                ))}
              </ul>
              <button className="mt-8 w-full bg-blue-600 text-white rounded-lg py-3 font-semibold hover:bg-blue-700">
                Get started
              </button>
            </div>
          ))}
        </div>
      </div>
    </section>
  );
}

This output contains at least seven AI slop patterns: grid-cols-3, rounded-xl, shadow-lg, bg-blue-600, text-gray-900, uniform card structure, and centered heading. It would score around 72 on the Sailop scale (Grade D).

With Sailop Constraints

Same prompt, but with Sailop skill active and a generated design system loaded:

// Claude's output WITH Sailop constraints active
export function Pricing() {
  return (
    <section className="py-20 bg-stone-50">
      <div className="max-w-5xl mx-auto px-6">
        <div className="mb-16">
          <p className="text-sm font-mono tracking-widest text-amber-700 uppercase">
            Pricing
          </p>
          <h2 className="text-4xl font-light text-stone-900 mt-2 max-w-lg">
            Pay for what you ship, not what you store
          </h2>
        </div>
        <div className="grid grid-cols-[1fr_1.5fr_1fr] gap-0 border border-stone-300">
          <div className="p-8 border-r border-stone-300">
            <h3 className="font-mono text-sm tracking-wider text-stone-500 uppercase">
              {plans[0].name}
            </h3>
            <p className="text-3xl font-light text-stone-900 mt-3">
              {plans[0].price}
            </p>
            <ul className="mt-8 space-y-2 text-stone-600 text-sm">
              {plans[0].features.map((f) => (
                <li key={f} className="py-1 border-b border-stone-200">
                  {f}
                </li>
              ))}
            </ul>
            <a
              href="#"
              className="inline-block mt-8 text-sm font-mono text-amber-800 underline underline-offset-4"
            >
              Start free
            </a>
          </div>
          <div className="p-8 bg-stone-900 text-stone-100 border-r border-stone-700">
            <h3 className="font-mono text-sm tracking-wider text-amber-400 uppercase">
              {plans[1].name}
            </h3>
            <p className="text-3xl font-light mt-3">{plans[1].price}</p>
            <ul className="mt-8 space-y-2 text-stone-300 text-sm">
              {plans[1].features.map((f) => (
                <li key={f} className="py-1 border-b border-stone-700">
                  {f}
                </li>
              ))}
            </ul>
            <a
              href="#"
              className="inline-block mt-8 text-sm font-mono text-amber-400 border border-amber-400 px-4 py-2"
            >
              Start trial
            </a>
          </div>
          <div className="p-8">
            <h3 className="font-mono text-sm tracking-wider text-stone-500 uppercase">
              {plans[2].name}
            </h3>
            <p className="text-3xl font-light text-stone-900 mt-3">
              {plans[2].price}
            </p>
            <ul className="mt-8 space-y-2 text-stone-600 text-sm">
              {plans[2].features.map((f) => (
                <li key={f} className="py-1 border-b border-stone-200">
                  {f}
                </li>
              ))}
            </ul>
            <a
              href="#"
              className="inline-block mt-8 text-sm font-mono text-stone-600 underline underline-offset-4"
            >
              Contact us
            </a>
          </div>
        </div>
      </div>
    </section>
  );
}

This version uses an asymmetric grid (grid-cols-[1fr_1.5fr_1fr]), no rounded corners, no shadows, amber accent colors instead of blue, monospace typography for labels, borders instead of cards, and a visually distinct center column. It would score around 23 on the Sailop scale (Grade A).

The difference is not random. The Sailop skill injected specific constraints: use stone and amber palette, prefer borders over shadows, use monospace for UI labels, break grid symmetry for the featured tier. Every constraint narrows the token probability distribution toward a specific design direction.

The Mathematics of Constraint vs Randomness

To formalize why constraints work better than temperature, consider the space of all possible CSS outputs for a given component.

Design space visualization:

         All possible CSS outputs
    +----------------------------------+
    |                                  |
    |    +-------+                     |
    |    | Good  |     +-------+       |
    |    |unique |     | Good  |       |
    |    |design |     |unique |       |
    |    +-------+     |design |       |
    |         * T=0    +-------+       |
    |      (default)                   |
    |                                  |
    |    T=1.5 samples from here:      |
    |    * * *  * *  *                 |
    |       *  *   *   * *             |
    |    (mostly bad, some OK)         |
    |                                  |
    |    Sailop constrains to here:    |
    |    [====]                        |
    |    (small region, all good)      |
    +----------------------------------+

    T=0:    One point. Always the same. Always generic.
    T=1.5:  Scattered points. Mostly broken CSS.
    Sailop: Narrow region. Valid and unique.

Temperature increases exploration of the full space, including regions with invalid CSS. Constraints restrict exploration to a specific region that is both valid and unique. This is why constraints beat templates for design generation.

The Attention Mechanism and Rule Priority

There is one more technical detail worth understanding. When Sailop injects rules into the context window, those rules compete with the model's training data for influence over the output. The attention mechanism determines which context gets the most influence.

Rules placed in system prompts or skill preambles receive high attention weight because:

  • Position bias: Tokens at the beginning of the context window and at the end receive more attention than tokens in the middle. Sailop rules are injected at the system level, which occupies the highest-priority position.
  • Instruction following: Modern LLMs are fine-tuned to follow explicit instructions. A rule that says "never use rounded-lg" is treated as a strong directive, not a suggestion.
  • Specificity: Vague instructions like "make it unique" have weak influence because the model cannot resolve them into specific token choices. Sailop rules are extremely specific: "Use border-2 border-stone-300 instead of rounded-lg shadow-md." This specificity translates directly into token probability shifts.
Attention weight distribution:

+----------------------------------------------+
|  System prompt (Sailop rules)    [████████]  |  High attention
|  User message                    [██████]    |  Medium-high
|  Conversation history            [████]      |  Medium
|  Training data prior             [██]        |  Low (overridden)
+----------------------------------------------+

When rules conflict with training data priors,
the rules win because they have higher attention weight.

This is why the Sailop approach works even with models that have extremely strong training data biases. The rules do not need to retrain the model. They just need to be specific enough and positioned correctly in the context window to override the default probabilities.

Practical Implications

Understanding the token probability mechanism has several practical implications for anyone building with AI coding agents.

1. Do not rely on prompting alone. Telling the model "be creative" or "use unusual colors" produces marginally better results but still converges on the same patterns. The training data distribution is too strong. You need specific, concrete rules.

2. Temperature is not your friend for design. Higher temperature gives you variety at the cost of quality. The sweet spot for valid CSS is narrow, typically between 0.5 and 0.8. Outside that range you get either repetitive output or broken output.

3. Seed-based generation beats random generation. Sailop's procedural design system generator uses deterministic seeds to produce unique but reproducible design systems. This is better than randomness because you can iterate on a specific seed while maintaining consistency.

4. Constraints compose well. A single rule like "no blue" has a moderate effect. Ten rules together have a multiplicative effect because each one narrows the probability distribution further. The 73 rules in the anti-slop manifesto work together to produce output that is dramatically different from the default.

5. The problem will get worse before it gets better. As more people use AI agents to build websites, the training data becomes even more biased toward AI-generated patterns. The model learns from its own output, reinforcing the dominant patterns. This feedback loop means that detecting AI-generated code will become both easier and more important over time.

How to Verify This Yourself

You can run a simple experiment to verify the token probability bias:

# Generate 100 landing pages with default settings
for i in $(seq 1 100); do
  echo "Build a landing page for a SaaS product called TestApp$i" | \
    claude --print > "output_$i.html"
done

# Run Sailop scan on all outputs
sailop scan ./output_*.html --format csv > results.csv

# Check color distribution
grep "bg-" output_*.html | sort | uniq -c | sort -rn | head -20

# Expected output (approximately):
#   847  bg-white
#   312  bg-blue-600
#   298  bg-blue-500
#   245  bg-gray-50
#   189  bg-gray-100
#   156  bg-blue-700
#    89  bg-gray-900
#    67  bg-indigo-600
#    43  bg-green-500
#    12  bg-purple-600

If you run this experiment, you will find that blue accounts for over 60% of all colored background classes. This is the training data distribution made visible.

Conclusion

The sameness of AI-generated CSS is not a bug in any particular model. It is a mathematical consequence of how language models work. Training data bias creates skewed token probability distributions. Standard sampling parameters cannot fix skewed distributions without breaking output quality. The only effective solution is constraint-based generation: specific rules that redirect the model toward intentionally different design choices.

Sailop exists because we understood this problem at the technical level and built a tool that addresses it at the right layer. Not by adding randomness to the model, but by adding constraints to the context. The result is AI-generated code that is valid, consistent, and unique.

If you want to see this in action, install the Sailop Claude Code skill and watch how constraint injection transforms the output of an otherwise deterministic system.

npm install -g sailop
sailop generate --seed "my-brand"
sailop skill install
# Now every Claude Code session produces unique CSS

The model has not changed. The constraints have.

Try Sailop

Scan your frontend for AI patterns. Generate a unique design system. Ship code that looks intentional.

Free scannpm i -g sailop
Share this article
Share on X
Previous
AI Slop Encyclopedia: Every Pattern, Every Fix, Every Tool
Next
Building a Design System Generator from Scratch
On this page
How LLMs Tokenize CSSThe Token Probability ProblemThe Training Data BiasColor Usage in Training DataFont Usage in Training DataLayout Pattern FrequencyWhy Temperature and Top-p Do Not Fix ThisHow Constraints Work Better Than RandomnessHow Rule Injection Changes Token ProbabilitiesThe Sailop Skill ArchitectureCode Examples: Default vs Constrained OutputWithout Sailop (Default Generation)With Sailop ConstraintsThe Mathematics of Constraint vs RandomnessThe Attention Mechanism and Rule PriorityPractical ImplicationsHow to Verify This YourselfConclusion
Sailop 2026All articles