Posts from April 30, 2026

A different approach

4 min read

As mentioned in the previous post, I've been having a play around with Copilot/Claude vs Gemini when it comes to getting the agents to seek out "bad" code and improve it. In that first post on the subject, I highlighted how both tools noticed some real duplication of effort, both addressed it in more or less the same way, and neither of them took the clean-up to its logical conclusion (or, at the very least, neither cleaned it up in a way that I feel is acceptable).

The comparison of the two PRs (Gemini vs Claude via Copilot) is going to be a slow and occasional read, and if I notice something that catches my interest, I'll note it on this blog.

Initially, I was looking at which files were touched by both. With Gemini it was:

And with Copilot/Claude:

On the surface, it looks like Claude might have done a better job of finding untidy issues in the code. Of course a proper read/assessment of the outcome is needed to decide which is "better"; not to mention the application of a lot of personal taste.

So, with the initial/surface impression that "Claude went deeper", I took a look at the first file they had in common: content_path.py. This is documented as a module related to:

Shared path-resolution utilities for content output paths.

This module provides the generic building blocks used by page_path and post_path. Each content type supplies its own allowed-variable set and variable dict; this module handles the common validation, substitution, and safety checks.

There's 3 functions in there:

  • validate_path_template -- for validating a format string used in building a path.
  • resolve_path -- given a template and some values to populate variables in the template, create a path.
  • safe_output_path -- helper function for joining paths and ensuring they don't escape the output directory.

These seem like sensible functions to have in here, and I can imagine me writing a similar set in terms of the problem they seek to solve.

Both agents seemed to agree on what needed some work: validate_path_template. Both also seem to agree that building knowledge of which variable is required into the function itself isn't terribly flexible; I feel this is a reasonable review of the situation. However, the two agents seem to disagree on how this should be resolved.

Claude's take on this is that the function should grow an optional keyword argument called required_variable, which defaults to slug. It also adds an assert to test if the required variable exists in the allowed_variables (okay, I could quibble about this but given this is a code-check rather than a user-input check, eh, I can go with it). Finally it does the check using the new variable and also makes the error reporting a touch more generic too.

--- /Users/davep/content_path.py        2026-04-30 13:20:00.737955197 +0100
+++ src/blogmore/content_path.py        2026-04-30 13:20:04.560178727 +0100
@@ -17,13 +17,15 @@
     template: str,
     config_key: str,
     allowed_variables: frozenset[str],
-    item_name: str,
+    item_name: str = "",
+    *,
+    required_variable: str | None = "slug",
 ) -> None:
     """Validate a path format string for a content type.

     Checks that *template* is non-empty, well-formed, references only
-    variables from *allowed_variables*, and includes the mandatory
-    ``{slug}`` placeholder.
+    variables from *allowed_variables*, and (when *required_variable* is
+    not ``None``) includes the mandatory placeholder.

     Args:
         template: The path format string to validate.
@@ -33,11 +35,19 @@
             template.
         item_name: The human-readable name of the content type used in
             the uniqueness error message (e.g. ``"page"`` or ``"post"``).
+            Ignored when *required_variable* is ``None``.
+        required_variable: The variable name that must appear in the
+            template, or ``None`` if no variable is mandatory.  Defaults
+            to ``"slug"`` for backward compatibility.

     Raises:
         ValueError: If the template is empty, malformed, references an
-            unknown variable, or omits the ``{slug}`` placeholder.
+            unknown variable, or omits the required placeholder.
     """
+    assert required_variable is None or required_variable in allowed_variables, (
+        f"required_variable {required_variable!r} is not in allowed_variables"
+    )
+
     if not template:
         raise ValueError(f"{config_key} must not be empty")

@@ -61,9 +71,9 @@
             + f". Allowed variables are: {', '.join(sorted(allowed_variables))}"
         )

-    if "slug" not in field_names:
+    if required_variable is not None and required_variable not in field_names:
         raise ValueError(
-            f"{config_key} '{template}' must contain the {{slug}} variable so that "
+            f"{config_key} '{template}' must contain the {{{required_variable}}} variable so that "
             f"each {item_name} can be uniquely identified"
         )

Gemini, on the other hand, has a very similar idea but allows for the fact that the caller might want to specify multiple required variables. So in this case it adds required_variables (as a positional/keyword argument rather than a pure-keyword argument) and defaults it to a frozenset that contains "slug". The rest of the change is also about making the test for the required variables, and the reporting of the error, generic. It doesn't do anything about checking that the required variables are within the allowed variables.

--- /Users/davep/content_path.py        2026-04-30 13:20:00.737955197 +0100
+++ src/blogmore/content_path.py        2026-04-30 14:47:41.607748447 +0100
@@ -18,12 +18,13 @@
     config_key: str,
     allowed_variables: frozenset[str],
     item_name: str,
+    required_variables: frozenset[str] = frozenset({"slug"}),
 ) -> None:
     """Validate a path format string for a content type.

     Checks that *template* is non-empty, well-formed, references only
-    variables from *allowed_variables*, and includes the mandatory
-    ``{slug}`` placeholder.
+    variables from *allowed_variables*, and includes the
+    *required_variables*.

     Args:
         template: The path format string to validate.
@@ -33,10 +34,13 @@
             template.
         item_name: The human-readable name of the content type used in
             the uniqueness error message (e.g. ``"page"`` or ``"post"``).
+        required_variables: The set of variable names that MUST appear
+            in the template to ensure uniqueness. Defaults to
+            ``{"slug"}``.

     Raises:
         ValueError: If the template is empty, malformed, references an
-            unknown variable, or omits the ``{slug}`` placeholder.
+            unknown variable, or omits a required variable.
     """
     if not template:
         raise ValueError(f"{config_key} must not be empty")
@@ -61,10 +65,12 @@
             + f". Allowed variables are: {', '.join(sorted(allowed_variables))}"
         )

-    if "slug" not in field_names:
+    missing = required_variables - set(field_names)
+    if missing:
         raise ValueError(
-            f"{config_key} '{template}' must contain the {{slug}} variable so that "
-            f"each {item_name} can be uniquely identified"
+            f"{config_key} '{template}' must contain the "
+            + ", ".join(f"{{{v}}}" for v in sorted(missing))
+            + f" variable(s) so that each {item_name} can be uniquely identified"
         )

For the most part I think I prefer what Gemini is trying to do, although Claude's sanity check that the required variable is one of the possible variables makes sense. I kind of feel like both of them missed the point when it came to handling the fact that "slug" is required: given that validate_path is otherwise built to be pretty generic, I think I would have defaulted to nothing and simply left it up to the caller to be explicit that "slug" is required, because that matters in context of the caller. This feels like a pretty obvious case of a "business logic" vs "generic utility code" separation of concerns scenario.

As mentioned in passing in another post, it's interesting to see that neither of them noticed the opportunity to turn this:

unknown = set(field_names) - allowed_variables
if unknown:
    ...

into this:

if unknown := (set(field_names) - allowed_variables):
    ...

I know at least one person who would be happy about this fact.

So where does this leave me? At the moment I'm not inclined to merge either PR, but that's mainly because I want to carry on reading them and perhaps writing some more notes about what I encounter. What this does illustrate for me is something we know well enough anyway, but which I wanted to experiment with and see for myself: the initial implementation of any working code written by an agent seems optimised for that particular function or method, perhaps class if you're lucky. It will happily repeat the same code to solve similar problems, or perhaps even use very different approaches to solve the same problem. What it won't do well is recognise that this problem is solved elsewhere and so either use that other code by calling it, or perhaps modify it slightly to make it more generic and more applicable in more situations.

On the other hand, it has shown that with a bit of prompting (and keep in mind that the prompt that arrived at this comparison was really quite vague) it is possible to get an agent to "consider" the problem of duplication and boilerplate and to try and address that.

Having seen the two solutions on offer here, it's hard not to conclude that the best solution would be for me to take the PRs as flags marking places in the code that could be cleaned up, and do the tidy myself.

At least I have, as of the time of writing, 1,380 tests to check that I've not broken anything when I do hand-clean the code. But, hmm, there's a question: can I actually trust those tests? It's not like I wrote them.

Guess that's a whole other thing to worry about at some point...

Duplication of effort

3 min read

While I don't, for a moment, think that the work on BlogMore is complete, I think it's fair to say that the rate of new feature additions has slowed down. Which is fine, there's only so much I need from a self-designed/directed static site generator; at a certain point there's a danger of adding features for the sake of it.

Around this point I think I want to start to pay proper attention to the code quality and maintainability of the ongoing experiment.

As I mentioned the other day, while working through this, I had noticed plenty of bad habits that Copilot (and in this case pretty much always Claude Sonnet 4.6) has. All were very human (obviously), but also the sort of thing you'd expect a human developer to educate themselves out of.

Yesterday evening, out of idle curiosity, I installed Gemini CLI because I wanted to see what would happen if I pointed it at the v2.18.0 codebase and asked it to look for things to clean up, and then what would happen if I did the same with Copilot CLI.

I've saved the results as a PR for what Gemini came up with and what Copilot came up with1. I've not given them a proper read over yet, but while having a quick glance at them something leapt out at me: in the code before the request, there was this in utils.py:

def count_words(content: str) -> int:
    """Count the number of words in the given content.

    Strips common Markdown and HTML formatting before counting so that only
    prose words are included.  The same normalisation rules as
    :func:`calculate_reading_time` are applied.

    Args:
        content: The text content to analyse (may include Markdown/HTML).

    Returns:
        The number of words in the content.

    Examples:
        >>> count_words("Hello world")
        2
        >>> count_words("word " * 10)
        10
    """
    # Remove code blocks
    content = re.sub(r"```[\s\S]*?```", "", content)
    content = re.sub(r"`[^`]+`", "", content)

    # Remove markdown links but keep the text: [text](url) -> text
    content = re.sub(r"\[([^\]]+)\]\([^\)]+\)", r"\1", content)

    # Remove markdown images: ![alt](url) -> ""
    content = re.sub(r"!\[([^\]]*)\]\([^\)]+\)", "", content)

    # Remove HTML tags
    content = re.sub(r"<[^>]+>", "", content)

    # Remove markdown formatting characters
    content = re.sub(r"[*_~`#-]", " ", content)

    return len([word for word in content.split() if word])


def calculate_reading_time(content: str, words_per_minute: int = 200) -> int:
    """Calculate the estimated reading time for content in whole minutes.

    Uses the standard reading speed of 200 words per minute. Strips markdown
    formatting and counts only actual words to provide an accurate estimate.

    Args:
        content: The text content to analyze (can include markdown)
        words_per_minute: Average reading speed (default: 200 WPM)

    Returns:
        Estimated reading time in whole minutes (minimum 1 minute)

    Examples:
        >>> calculate_reading_time("Hello world")
        1
        >>> calculate_reading_time("word " * 400)
        2
    """
    # Remove code blocks (they typically take longer to read/understand)
    content = re.sub(r"```[\s\S]*?```", "", content)
    content = re.sub(r"`[^`]+`", "", content)

    # Remove markdown links but keep the text: [text](url) -> text
    content = re.sub(r"\[([^\]]+)\]\([^\)]+\)", r"\1", content)

    # Remove markdown images: ![alt](url) -> ""
    content = re.sub(r"!\[([^\]]*)\]\([^\)]+\)", "", content)

    # Remove HTML tags
    content = re.sub(r"<[^>]+>", "", content)

    # Remove markdown formatting characters
    content = re.sub(r"[*_~`#-]", " ", content)

    # Count words (split by whitespace and filter out empty strings)
    words = [word for word in content.split() if word]
    word_count = len(words)

    # Calculate minutes, rounding to the nearest minute with a minimum of 1
    minutes = max(1, round(word_count / words_per_minute))

    return minutes

I think this right here is a great example of why the code that these tools produce is generally kind of... meh. Let's just really appreciate for a moment the duplication of effort going on there. But it's even more fun. Look at the docstring2 for count_words: it says right there that the "same normalisation rules as calculate_reading_time are applied". It "knows" it copied the work that went into calculate_reading_time too, but never once did it then "think" to pull the common code out and have both of the functions call on that helper function.

Back to the parallel invitations to refactor, having asked:

please do a review of this codebase and see if there is any scope for refactoring so there's less duplication

Both Gemini and Claude noticed this and did something about it. Gemini came up with a:

def _strip_formatting(content: str) -> str:

with all the regex-based-markdown-stripping code in there and then rewrote count_words and calculate_reading_time to call on that. The Copilot/Claude cleanup did something very similar:

def _strip_markdown_formatting(content: str) -> str:

So it's a good thing that both of them "noticed" this duplication of effort and cleaned it up. What I do find interesting though is what the result was. Stripping docstrings and comments for a moment, here's what I was left with, by Gemini, for count_words and calculate_reading_time:

def count_words(content: str) -> int:
    content = _strip_formatting(content)
    return len([word for word in content.split() if word])

def calculate_reading_time(content: str, words_per_minute: int = 200) -> int:
    content = _strip_formatting(content)
    words = [word for word in content.split() if word]
    word_count = len(words)
    minutes = max(1, round(word_count / words_per_minute))
    return minutes

and here's what Copilot/Claude came up with:

def count_words(content: str) -> int:
    return len([word for word in _strip_markdown_formatting(content).split() if word])

def calculate_reading_time(content: str, words_per_minute: int = 200) -> int:
    words = [word for word in _strip_markdown_formatting(content).split() if word]
    return max(1, round(len(words) / words_per_minute))

In both cases calculate_reading_time is still doing the work of counting words when count_words is right there to be called! Don't even get me started on how the Gemini version of calculate_reading_time is so obsessed with assigning values to variables that only get used once in the next line3. Were I reviewing these PRs (oh, wait, I am reviewing these PRs!), I'd request the latter function be turned into:

def calculate_reading_time(content: str, words_per_minute: int = 200) -> int:
    return max(1, round(count_words(content) / words_per_minute))

I would imagine that there's a lot more of this going on in the code, and under ideal conditions this sort of thing would not have made its way into the codebase in the first place. Part of the point of this experiment was to mostly get the agent to do its own thing, without me doing full-on reviews of every PR. Were I to use this sort of tool in a workplace, or even on a FOSS project that wasn't intended to be this exact experiment, I'd be far more inclined to carefully review the result and request changes.

Or, perhaps, hear me out... I have a third agent that I teach to be just like me and I get it do the work of reviewing the PRs for me. What could possibly go wrong?


  1. Again, I guess I should stop referring to Copilot in this case and instead refer to Claude Sonnet. 

  2. Note to self: I need to educate the agents in how I prefer and always use the mkdocstrings style of cross-references

  3. Yes, I know, this is a favoured clean code kind of thing in some circles, but it can be taken to an unnecessary extreme.