Posts in series "Agentic Afterthoughts"

Five days with Copilot

11 min read; 11 GFI

Another itch to scratchΒΆ

As I mentioned yesterday, I've been a happy user of Pelican for a couple or so years now, but every so often there's a little change or tweak I'd like to make that requires diving deeper into the templates and the like and... I go "eh, I'll look at it some time soon". Another thought that often goes through my head at those times is "I should build my own static site generator that works exactly how I want" -- because really any hacker with a blog has to do that at some point.

Meanwhile... I've had free access to GitHub Copilot attached to my GitHub account for some time now, and I've hardly used it. At the same time -- the past few months especially -- I've been watching the rise of agents as coding tools, as well as the rise of advocates for them. Worse still, I've seen people I didn't expect to be advocates for giving up on coding turning to these tools and suddenly writing rationales in favour of them.

So, suddenly, the idea popped into my head: I should write my own static site generator that I'll use for my blog, and I should try and use GitHub Copilot to write 100% of the code, and documentation, and see how far I get. In doing so I might firm up my opinions about where we're all going with this.

The requirements were going to be pretty straightforward:

  • It should be a static site generator that turns Markdown files into a website.
  • It should be blog-first in its design.
  • It should support non-blog-post pages too.
  • It should be written in Python.
  • It should use Jinja2 for templates.
  • It should have a better archive system than I ever got out of my Pelican setup.
  • It should have categories, tags, and all the usual metadata stuff you'd expect from a site where you're going to share content from.

Of course, the requirements would drift and expand as I went along and I had some new ideas.

Getting startedΒΆ

To kick things off, I created my repo, and then opened Copilot and typed out a prompt to get things going. Here's what I typed:

Build a blog-oriented static site generation engine. It should be built in Python, the structure of the repository should match that of my preferences for Python projects these days (see https://github.com/davep/oldnews and take clues from the makefile; I like uv and ruff and Mypy, etc).

Important features:

  • Everything is written in markdown
  • All metadata for a post should come from frontmatter
  • It should use Jinja2 for the output templates

As you can see, rather than get very explicit about every single detail, I wanted to start out with a vague description of what I was aiming for. I did want to encourage it to try and build a Python repository how I normally would, so I pointed it at OldNews in the hope that it might go and comprehend how I go about things; I also doubled-down in the importance of using uv and mypy.

The result of this was... actually impressive. As you'll see in that PR, to get to a point where it could be merged, there was some back-and-forth with Copilot to add things I hadn't thought of initially, and to get it to iron out some problems, but for the most part it delivered what I was after. Without question it delivered it faster than I would have.

Some early issues where I had to point out problems to Copilot included:

  • The order of posts on the home page wasn't obvious to me, and absolutely wasn't reverse chronological order.
  • Footnotes were showing up kinda odd.
  • The main index for the blog was showing just posts titles, not the full text of the article as you'd normally expect from a blog.

Nothing terrible, and it did get a lot of the heavy lifting done and done well, but it was worth noting that a lot of dev-testing/QA needed to be done to be confident about its work, and doing this picked up on little details that are important.

An improvement to the MarkdownΒΆ

As an aside: during this first PR, I quickly noticed a problem where I was getting this error when generating the site from the Markdown:

Error generating site: mapping values are not allowed in this context
  in "<unicode string>", line 3, column 15

I just assumed it was some bug in the generated code and left Copilot to work it out. Instead it came back and educated me on something: I actually had bad YAML in the frontmatter of some of my posts!

This, by the way, wouldn't be the last time that Copilot found an issue with my input Markdown and so, having used it, improved my blog.

A major feature from a simple requestΒΆ

Another problem I ran into quickly was that previewing the generated site wasn't working well at all; all I could do was browse the files in the filesystem. So, almost as an offhand comment, in the initial PR, I asked:

Can we get a serve mode please so I can locally test the site?

Just like that, it went off and wrote a whole server for the project. While the server did need a lot of extra work to really work well1, the initial version was good enough to get me going and to iterate on the project as a whole.

The main workflowΒΆ

Having kicked off the project and having had some success with getting Copilot to deliver what I was asking for, I settled into a new but also familiar workflow. Whereas normally, when working on a personal project, I'll write an issue for myself, at some point pick it up and create a PR, review and test the PR myself then merge, now the workflow turned into:

  • Write an issue but do so in a way that when I assign it to Copilot it has enough information to go off and do the work.
  • Wait for Copilot to get done.
  • Review the PR, making change requests etc.
  • Make any fixes that are easier for me to fix by hand that describe to Copilot.
  • Merge.

In fact, the first step had some sub-steps to it too, I was finding. What I was doing, more than ever, was writing issues like I'd write sticky notes: with simple descriptions of a bug or a new feature. I'd then come back to them later and flesh them out into something that would act as a prompt for Copilot. I found myself doing this so often I ended up adding a "Needs prompt" label to my usual set of issue labels.

All of this made for an efficient workflow, and one where I could often get on with something else as Copilot worked on the latest job (I wasn't just working on other things on my computer; sometimes I'd be going off and doing things around the house while this happened), but... it wasn't fun. It was the opposite of what I've always enjoyed when it comes to building software. I got to dream up the ideas, I got to do the testing, I got to review the quality of the work, but I didn't get to actually lose myself in the flow state of coding.

One thing I've really come to understand during those 5 days of working on BlogMore was I really missed getting lost in the flow state. Perhaps it's the issue to PR to review to merge cycle I used that amplified this, perhaps those who converse with an agent in their IDE or in some client application keep a sense of that (I might have to try that approach out), but this feels like a serious loss to me when it comes to writing code for personal enjoyment.

The main problemsΒΆ

I think it's fair to say that I've been surprised at just how well Copilot understood my (sometimes deliberately vague) requests, at how it generally managed to take some simple plain English and turn it into actual code that actually did what I wanted and, mostly, actually worked.

But my experiences over the past few days haven't been without their problems.

The confidently wrong problemΒΆ

Hopefully we all recognise that, with time and experience, we learn where the mistakes are likely to turn up. Once you've written enough code you've also written plenty of bugs and been caught out by plenty of edge-cases that you get a spidey-sense for trouble as you write code. I feel that this kind of approach can be called cautiously confident.

Working with Copilot2, however, I often ran into the confidently wrong issue. On occasion I found it would proudly3 request review for some minor bit of work, proclaiming that it had done the thing or solved the problem, and I'd test it and nothing had materially changed. On a couple of occasions, when I pushed back, I found it actually doubting my review before finally digging in harder and eventually solving the issue.

I found that this took time and was rather tiring.

There were also times where it would do the same but not directly in respect to code. One example I can think of is when it was confident that Python 3.14 was still a pre-release Python as of February 2026 (it isn't).

This problem alone concerns me; this is the sort of thing where people without a good sense for when the agent is probably bullshitting will get into serious trouble.

The tries-too-hard problemΒΆ

A variation on the above problem works the other way: on at least one occasion I found that Copilot tried too hard to fix a problem that wasn't really its to fix.

In this case I was asking it to tidy up some validation issues in the RSS feed data. One of the main problems was root-relative URLs being in the content of the feed; for that they needed to be made absolute URLs. Copilot did an excellent job of fixing the problem, but one (and from what I could see only one) relative URL remained.

I asked it to take a look and it took a real age to work over the issue. To its credit, it dug hard and it dug deep and it got to the bottom of the problem. The issue here though was it tried too hard because, having found the cause of the problem (a typo in my original Markdown, which had always existed) it went right ahead and built a workaround for this one specific broken link.

Now, while I'm a fan of Postel's law, this is taking things a bit too far. If this was a real person I'd tasked with the job I would have expected and encouraged them to come back to me with their finding and say "dude, the problem is in your input data" and I'd have fixed my original Markdown.

Here though it just went right ahead and added this one weird edge case as something to handle.

I think this is something to be concerned about and to keep an eye on too. I feel there's a danger in having the agent rabbit-hole a fix for a problem that it should simply have reported back to me for further discussion.

The never-pushes-back problemΒΆ

Something I did find unsurprising but disconcerting was Copilot's unwillingness to push back, or at least defend its choices. Sometimes it would make a decision or a change and I'd simply ask it why it had done it that way, why it had made that choice. Rather than reply with its reasoning it would pretty much go "yeah, my bad, let me do it a way you're probably going to find more pleasing".

A simple example of this is one time when I saw some code like this:

@property
def some_property(self) -> SomeValue:
    from blogmore.utils import some_utility_function
    ...

I'm not a fan of imports in the body of methods unless there's a demonstrable performance reason. I asked Copilot why it had made this choice here and its reply was simply to say it had gone ahead and changed the code, moving the import to the top of the module.

I see plenty of people talk about how working with an agent is like pair-programming, but I think it misses out on what's got to be the biggest positive of that approach: the debate and exchange of ideas. This again feels like a concern to be mindful of, especially if someone less experienced is bringing code to you where they've used an agent as their pair buddy.

The overall impressionΒΆ

Now I'm at the end of the process, and using the result of this experiment to write this post4, I feel better informed about what these tools offer, and the pitfalls I need to be mindful of. Sometimes it wasn't a terrible way of working. For example, on the first day I started with this, at one point on a chilly but sunny Sunday afternoon, I was sat on the sofa, MacBook on lap, guiding an AI to write code, while petting the cat, watching the birds in the garden enjoy the content of the feeder, all while chatting with my partner.

That's not a terrible way to write code.

On the other hand, as I said earlier, I missed the flow state. I love getting lost in code for a few hours and this is not that. I also found the constant loop of prompt, wait, review, test, repeat, really quite exhausting.

As best as I can describe it: it feels like the fast food of software development. It gets the job done, it gets it done fast, but it's really not fulfilling.

At the end of the process I have a really useful tool, 100% "built with AI", under my guidance, which lets me actually be creative and build things I do create by hand. That's not a bad thing, I can see why this is appealing to people. On the other hand the process of building that tool was pretty boring and, for want of a better word... soulless.

ConclusionΒΆ

As I write this I have about 24 hours of access to GitHub Copilot Pro left. It seems this experiment used up my preview time and triggered a "looks like you're having fun, now you need to decide if you want to buy it" response. That's fair.

So now I'm left trying to decide if I want to pay to keep it going. At the level I've been using it at for building BlogMore it looks like it costs $10/mth. That actually isn't terrible. I spend more than that on other hobbies and other forms of entertainment. So, if I can work within the bounds of that tier, it's affordable and probably worth it.

What I'm not sure about yet is if I want to. It's been educational, I can 100% see how and where I'd use this for work (and would of course expect an employer to foot the bill for it or a similar tool), and I can also see how and where I might use it to quickly build a personal-use tool to enable something more human-creative.

Ultimately though I think I'm a little better informed thanks to this process, and better aware of some of the wins people claim, and also better informed so that I can be rightly incredulous when faced with some of the wilder claims.

Also, it'll help put some of my reading into perspective.


  1. Amusingly I uncovered another bug while writing this post. ↩

  2. I keep saying Copilot, but I think it's probably more correct to say "Claude Sonnet 4.5" as that's what seemed to be at play under the hood, if I'm understanding things correctly. ↩

  3. Yes, of course that's an anthropomorphism, you'll find plenty of them in this article as it's hard not to write about the subject in any other way; it's an easy shortcut to explain some ideas ↩

  4. Actually I'm writing this post as I always do: in Emacs. But BlogMore is in the background serving a local copy of my blog so I can check it in the browser, and rebuilding it every time I save a change. ↩

Copilot lied

3 min read; 10 GFI

This morning, with my experiment with Copilot having settled down a little bit, I thought I might try and use the result for another little experiment. For a long time now I've maintained a (currently lapsed) photoblog. It was always very much tied to "the site formerly known as twitter" and, since I fully erased my account after the site turned into a Nazi bar, I've not done anything to update how it gets populated.

So I got to thinking: I have a full backup of all the images in a couple of places; perhaps with a bit of coding (and some help from "the AIs") I can revive it using BlogMore?

I tinkered for a wee bit and mostly got something going (albeit I'm going to have to do some work to back-port the actual dates and times of some earlier images, and there's a load of work to do to somehow pull out all the tags). But then I hit a small hitch.

When building BlogMore I made the decision to let it write both the code and the documentation. It documented a couple of features I never asked for, but which seemed sensible so I never questioned. On the other hand neither did I test them at the time (because they weren't important to what I needed).

It's the exact reason I added this warning at the start of the documentation:

⚠️ Warning

BlogMore is an experiment in using GitHub Copilot to develop a whole project from start to finish. As such, almost every part of this documentation was generated by Copilot and what it knows about the project. Please keep this in mind.

From what I can see at the moment the documentation is broadly correct, and I will update and correct it as I work through it and check it myself. Of course, I will welcome reports of problems or fixes.

With this warning in mind, and with the intention of working through the documentation and testing its claims, I decided to test out one of the features when building up the new photoblog.

Whereas with this blog I keep all the posts in a flat structure, this time around I thought I'd try out this (taken from the Copilot-generated BlogMore documentation):


BlogMore is flexible about how you organise your posts. Here are some common patterns:

Flat structure (all posts in one directory):

posts/
  β”œβ”€β”€ hello-world.md
  β”œβ”€β”€ python-tips.md
  └── web-development.md

Note: Files can be date-prefixed (e.g., 2026-02-18-hello-world.md) and BlogMore will automatically remove the date prefix from the URL slug. The post will still use the date field from frontmatter for chronological ordering.

Organised by date:

posts/
  β”œβ”€β”€ 2024/
  β”‚   β”œβ”€β”€ 01/
  β”‚   β”‚   └── hello-world.md
  β”‚   └── 02/
  β”‚       └── python-tips.md
  └── 2025/
      └── 01/
          └── web-development.md

Organised by topic:

posts/
  β”œβ”€β”€ python/
  β”‚   β”œβ”€β”€ decorators.md
  β”‚   └── type-hints.md
  └── web/
      β”œβ”€β”€ css-grid.md
      └── javascript-tips.md

Using the hierarchy approach, especially with dates, seemed ideal! I'd drop the actual images in such a hierarchy, and also drop the Markdown posts in a parallel hierarchy too. Perfect!

So I set it all up to do that, fired up blogmore serve, visited the URL and... No posts yet. What the hell?

So I checked the code for BlogMore and, sure enough, despite the fact the documentation was selling me on this handy way to organise my posts, no such feature existed!

As an experiment I then asked Copilot what the heck was going on. Much as I expected, rather than coming back with an answer to the question, it went right ahead and fixed it instead. Which is fine, that's where I would have taken this, but I do wish it would answer the question first.

ℹ️ Note

I imagine I could get an answer to the question if I took a more conversational route with Copilot, rather than writing the question in an issue and then assigning that issue to it. I must remember to try that at some point.

So, yeah, unsurprisingly Copilot flat out lied1 in the documentation. I'm not in the least bit shocked by this and, as I said, I fully expected this. But it was amusing to have an example of this turn up so early in the documentation, in such a glaring way, and in a way that was so easily fixed (really, it was just a swap of Path.glob to Path.rglob).

As I play with this more it's going to be fun to see what other bold claims turn out to not be true; or perhaps even the reverse: what neat features lurk in the code that haven't been documented.


  1. Yes, as I mentioned yesterday, that's an anthropomorphism. Folk who take such things as an indication that you don't understand "AI" might want to think about what it is to be a human when communicating. ↩

Brute force and ignorance

4 min read; 9 GFI

My experiment with coding with an agent continues, and from the point of view of the usability of the results, I'd say it's going well. From the point of view of learning more about this sort of tool and the pitfalls: I'd also say it's going well.

The one thing I'm really starting to get an appreciation for is just how much it employs a sort of brute force and ignorance approach to getting to a solution for some sort of problem. I noticed this on and off as I watched the agent's chat, as it worked away.

One example I enjoyed was when it was trying to add a new dependency to BlogMore. For reasons I should probably try and understand1 it knew it should be using uv to add a dependency and it knew how to use it to update the lock file, but it didn't have uv to hand:

The lock file (uv.lock) needs to be updated to include rcssmin. Since uv isn't available, I'll need to note this in the PR. The lock file should be generated by uv when the pyproject.toml is updated.

So it decided the best solution was probably to... "hand-edit" the lock file.

The lock file needs to be updated with rcssmin. Since uv is not available in this environment, I need to update it manually or find another way.

I need to update the uv.lock file to include rcssmin. The uv.lock file needs to include:

  1. The rcssmin package entry with hashes
  2. The rcssmin dependency in the blogmore package section

Let me look at the PyPI page for rcssmin 1.2.2 to get the hash.

It then grabs rcssmin from PyPI and checks out the information it wants; after which it goes:

I need to update the uv.lock file with the rcssmin package information. The uv.lock file uses SHA256 hashes. Let me look at the existing format in more detail to understand what I need to add, and then I'll query PyPI for the hashes.

Actually, let me think about this differently. The uv.lock file is a generated file, and updating it manually is error-prone. The proper way to update it is to run uv sync or uv lock. Since uv is not available in this environment, I have two options:

  1. Let the CI update the lock file (risky if CI fails on the locked version)

So... around this point it has some sort of plan for hand-patching the lock file, but also recognises that this is a dodgy approach and it should be using uv to update the lock file.

I also really like the part where it says it has two options, then goes on to provide an exhaustive list of... one.

After this it runs this code:

import json, urllib.request
url = 'https://pypi.org/pypi/rcssmin/json'
try:
    with urllib.request.urlopen(url, timeout=10) as f:
        data = json.load(f)
    v = '1.2.2'
    releases = data['releases'].get(v, [])
    for r in releases:
        print(f'filename: {r[\"filename\"]}')
        print(f'  url: {r[\"url\"]}')
        print(f'  sha256: {r[\"digests\"][\"sha256\"]}')
        print(f'  size: {r[\"size\"]}')
        print(f'  upload_time: {r[\"upload_time_iso_8601\"]}')
except Exception as e:
    print(f'Error: {e}')

after which it concludes:

Now I have all the information I need to update the uv.lock file. I need to add:

  1. The rcssmin package entry in the lock file
  2. Add rcssmin to the blogmore package dependencies

I mean... okay, I guess. Again, I'll admit that I'm still getting used to setting this sort of thing up and working with it (and keep in mind none of this was done locally; this is all via GitHub Copilot, up on GitHub), but it's kind of funny and wild that the approach taken was to recognise that using uv to remake the lock file was the best approach so... let's hand-edit the lock file.

But that's not the best part. This is where it gets really fun. It had decided that the lock file must be in alphabetical order. Okay, sure, why not:

The lock file has packages sorted alphabetically, so rcssmin should go between pygments and python-dateutil (since "r" comes before "py" actually... no, wait - "r" comes after "p" in the alphabet. Let me check the order again).

Since "r" comes before "py."

Since "r" comes before "py."

o_O

O_o

Extra credit here for giving me a good giggle. I really appreciate the way that it catches itself mid-sentence and remembers how the alphabet actually works.

As to the outcome of all of this? Sure, the feature I wanted to add got added; it worked away and got to a working solution in the end. But the route it took was, I think it's fair to say, a "brute force and ignorance" approach.

I've not been spending too much time reading the agent's own chatter, but when I have I've found myself amused by the dead ends it'll wander down and then work its way back out. There is, without question, a recognisable process here: I believe it would be a dishonest developer who says they've never had times in the past, or just one of those off days, where they've fallen down a rabbit hole of a solution, only to realise it's the right solution implemented in the worst possible way. There's also a resemblance here to how more junior developers work a problem until they really develop their skills.

I think I'm going to keep an eye on the agent chat a bit more from now on. While I imagine things will only improve as these tools improve, for the moment it's a good source of coding comedy.


  1. Presumably there's things I can be doing to make its life easier. ↩

Not so elegant

1 min read; 13 GFI

One thing I've been noticing with my current experiment with GitHub Copilot is that it seems to know Python well enough to write code that gets the job done, and sometimes it knows it well enough to write more modern idiomatic Python code, but it also seems to write the inelegant version of it.

It's hard to pin down exactly, and of course it's a matter of taste (my idea of elegant might burn someone else's eyes), but on occasion, as I review the code, I find things that make me go "ugh".

Here's an example: there's a function that Copilot wrote to extract the first non-markup paragraph of an article (so that it can be used as a page description). One thing it needs to do is skip any initial images, etc. It takes a pretty brute force approach of looking at the start of each stripped line, but it gets the job done -- I can't really argue with that.

But here's how it does it:

# Skip markdown image syntax
if stripped.startswith("!["):
    continue

# Skip markdown linked image syntax ([![alt](img)](url))
if stripped.startswith("[!["):
    continue

# Skip HTML img tags
if stripped.startswith("<img"):
    continue

Now, this is good: it's using startswith. There are less-elegant approaches it could have used so I'll give it bonus points for using that method. The thing is though, it's testing each prefix one string at a time, pretty much rolling out a load of boilerplate code.

What bothers me here is that startswith will take a tuple of strings to test for. I find it curious that the generated code is idiomatic enough to know that startswith is a sensible option here, but at the same time it still writes the list of things to test out in a long-winded way.

This is exactly the sort of thing I'd call out in a normal code review. Technically, if this wasn't mostly a "let's see how it goes about this with minimal input from me" experiment, I'd have called it out here too (as an experiment, I might go back prompt it to "think" about this).

If I ever find myself using this sort of tool for generating code in a work setting, this is exactly the sort of thing I'll be watching for.

Documentation generation

1 min read; 10 GFI

While I've written a lot of documentation in my life, it's not something I enjoy. I want documentation to read well, I want documentation to be useful, I want documentation to be accurate. I also want there to be documentation at all and sometimes the other parts mean it doesn't get done for FOSS projects1.

When I started the experiment that is BlogMore, I very quickly hashed out some ideas on how it might generate some documentation, and the result was okay. In fact, if anything, it was a bit too much and there was a lot of repeated information.

So, this morning, before I sat down for the day's work, I quickly wrote an issue that would act as a prompt to Copilot to rewrite the documentation. This time I tried to be very clear about what I wanted where, but also left it to work out all the details.

The result genuinely impressed me. While I'll admit I haven't read it all in detail (and because of this have left the same warning right at the start), on the surface it looks a lot clearer and feels like a better journey to learning how to use BlogMore.

blogmore.davep.dev hosts the result of this.

I have a plan to work through the documentation and be sure it's all correct and all makes sense, but if it's as correct and as useful as I think, I might have to consider the idea of taking this approach more often. Writing down the plan for the documentation and then letting it appear in the background while I get on with my actual day makes a lot of sense.

I fear I might be warming to these tools, a little bit. :-/


  1. Although I've made a point of doing it for almost every one of my recent Textual-based projects. ↩

An eager fix

1 min read; 11 GFI

An eager fix

Yesterday, while looking at starting to post to my photoblog again, I noticed I'd missed a trick when I added the first and second sets of photos when I created seen-by.davep.dev: I'd left off any cover frontmatter from all of the posts!

While I doubt it's super important -- I can't imagine people will be sharing photos from the blog after all, especially not older ones -- it felt like I'd failed to use a useful feature that I'd made sure BlogMore had.

This was obviously easy to fix. I could just write a tool that would go through all of the Markdown files, find the image that is being displayed in the post, pull out the path to the file, and add a cover to the frontmatter. Not exactly the hardest thing to write, but kind of boring for a one-off fix.

So this felt like another good time to get Copilot to do the hard work. Liking this plan, I wrote an issue for the prompt and set it to work.

The result was unexpected, and in retrospect this is how I should have approached it in the first place; Copilot wrote the script, then ran it, and then submitted the PR including the script and all of the changes after running it! It's like it was super eager to do the fix so went ahead and just did it.

The resulting PR

This, for me, highlights a trick/approach I'm still not fully mindful of: where I might once have written some throwaway code to do a job, run the code, and then made use of the result, when it comes to using an agent I always have the option of saying what I want done to the content of the repository and just let it do it.

When I'm next faced with a problem like this, I think this is the approach I'll take: rather than ask it to write the tool to do the work, I'll just say what the work is I want done and let it go about it. This feels like where an agent should shine and where it's really useful.

Meanwhile I can get on with the fun stuff.

When your model leaves you

4 min read; 11 GFI

Yesterday evening, after dinner, but just before loading up a game released just a few hours earlier, I decided to kick off a change to BlogMore. Part of the ongoing experiment is this convenience aspect: where I can get some work going and then go and do something entirely different.

The nature of the change itself isn't important (I'll write about it when I release an update), but something that happened is. As usual, I did the prompt as an issue, and assigned it to Copilot. The first thing that was curious was that, after around 5 minutes, despite it having added the πŸ‘€ reaction (which it seems to use to indicate the agent has seen it has work to do), nothing happened. I didn't see an agent kick off doing the work. Eventually I gave up waiting and opened a Copilot session and asked it to set about dealing with the issue I'd raised.

It did as I requested, but apparently alongside another agent which had suddenly started doing the exact same work. Thinking it was a glitch (it's not like GitHub hasn't been having some trouble of late) I stopped the newest one and let the "original" go about its work.

A wee bit later, just before I started up my game and the stream, I checked in on how it was doing. It had finished, but in a quick test, I noticed a small bug. I prompted it to fix the problem, closed the tab, and went about the real business of killing bugs.

Fast forward a couple of hours, I was done with getting my arse kicked on Klendathu, I packed away my controller and headset and opened GitHub again to see where Copilot was at. It wasn't good:

@davep The model claude-sonnet-4.6 is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 79b81d32-e26a-4898-bd41-4bba088d08f6

<!-- copilot-coding-agent-error: model-not-available -->

Wait... what? I've been using this for weeks now, as best as I can tell I've generally been making all the changes and improvements using Claude Sonnet 4.6; there's never been an issue. Then, suddenly, in the middle of a PR, I don't have it?

Quickly checking elsewhere, sure enough, I had access to almost no models. I can't remember what there was, but it wasn't much and all the Claude ones had disappeared. Even if I tested in the Copilot CLI1 I saw a very limited set of models.

Around this time I had two reactions: one was something like "cool, this is an important part of the experiment, knowing how it goes if your models are taken from you", the other was a curious "but Claude and I have an understanding about this codebase I can't trust some other model I've not been using".

As it was getting late into the evening and I still wanted to watch an episode of Stargate SG-1 (yes, I'm doing a full rewatch given it's on Netflix) I closed the MacBook and decided to check in on the issue in the morning.

Fast forward one SG1 episode later and, just before I headed to bed, I decided to do a quick search. While it could be a problem with my own account, it felt like it was more of a general issue. It was more of a general issue. At that point (around 23:20 UTC), checking in the GitHub app on my phone, I could see that some, if not all, of the models I was used to seeing, had come back.

As of this morning, as of time of writing, it's all looking back to how it was.

All the models back

All of which is a great reminder, and something useful in the experiment: what does happen when some third party takes away the models you're using to get your work done? In the time I've been building up BlogMore I've come to trust the quality of Claude Sonnet (in the sense that I know when and where I have to pay closer attention to what it's done, and where it'll very likely have done a pretty good job2), so finding that I'd possibly have to switch to some other model that I've had no experience with... I genuinely had a moment of concern about how and where I was going to take BlogMore.

Ultimately it's not actually a serious concern for me: while I aim to maintain it for a very long time to come (it is how I'm creating the site for this blog after all), BlogMore isn't that critical. Moreover, I know I could cease using Copilot to create and maintain it and I could tidy up the code and hand-update and hand-manage it. There's a reason why I decided to really dive into this using a language I'm extremely confident with.

But for those applications some might now be relying on, developed by someone keen but as yet unskilled, what way forward for them if such a situation were to happen and not be resolved?


  1. Which I'm not really using at the moment, but do have installed to experiment with. ↩

  2. So just like working with humans, oddly enough. ↩

But is the code that bad?

5 min read; 10 GFI

There is, obviously and understandably, a lot of conversation online about AI and coding and agents and all that stuff. Much of it I get, much of it I agree with, I share the vast majority of the concerns. The impact on people, the impact on society, the impact on the environment, the impact on security... there's a good list of things to worry us there.

The one that crops up a lot though, that I don't quite get, is the constant claim I see that at best AI tools produce bad code, and at worst they produce unworkable code. That really isn't my recent experience.

Sure, going back to 2023 or 2024, when I first started toying with these new chatbot things some folks were raving about, the output was laughable. I can remember spending some fun times trying to coax whatever version of ChatGPT was on the go at the time into writing workable code and being amused by just how bad it was.

Even back in October last year, when I first tried out the free Copilot Pro that GitHub had given me to play with, I tried to get it to build a Textual application for me and it was terrible. The code was bad, it didn't really know how to use Textual properly, the application I was trying to get it to write as a test barely worked. It was a disaster.

A month later, in November of last year, I had a second go and better success. That time the (still not released, perhaps one day) application I was building was Swift-based and worked really well, but I can't really comment on the quality of the code or how idiomatically correct the code is in respect to the type of application it is (it's a wee game that runs on iOS, iPadOS, macOS).

By the time I tried my first serious experiment things seemed to be a little different. The code actually wasn't bad. It wasn't good, it was far from good, but it wasn't bad. Also, because it was Python, I was in a good place to judge the code.

Since I've started working on BlogMore I've noticed issues such as:

  • Lots of repetitive boilerplate code.
  • Lots of magic numbers.
  • Lots of magic strings.
  • Functions with redundant and unused parameters.
  • A default state of just adding more and more code to one file.
  • A habit of writing least-effort-possible type hints.
  • A habit of sometimes taking a hacky shortcut to solve a problem.
  • A habit of sometimes over-engineering a solution to a problem.
  • A weird obsession with importing inside functions.
  • An occasional weird obsession with guarding some imports with TYPE_CHECKING to work around non-existent circular imports.
  • An unwillingness to use newer Python capabilities (I've yet to see it make use of := without being prompted, for example).
  • A tendency to write what I would consider less-elegant code over more-elegant code.

The list isn't exhaustive, of course. The point here is that, as I've reviewed the PRs1, and read the code, I've seen things I wouldn't personally do. I've seen things I wouldn't personally write, I've seen things I've felt the need to push back on, I've seen things I've fully rejected and started over. Ultimately BlogMore isn't the code I would have written, but at the moment it is the application I would have written2.

So, here's the thing: every time I see someone writing a negative toot or post or article or whatever, and they talk about how the code it produces is unworkable, I find myself wondering about how they formed this opinion. Are they just writing the piece for the audience they want? Are they writing the piece based on their experience from months to years back, when these tools did seem to still be laughably bad? Are they simply cynically generating the piece using an LLM to bait for engagement? When I see this particular aspect of such a post it's a bit of a red flag about where they're coming from, kind of like how you suddenly realise that someone who seems to speak with authority might be full of shit when they start to spout questionable "facts" on a subject you understand well.

But wait! What about that list of dodgy stuff I've seen while building BlogMore with Copilot? What about all the reading and reviewing I've had to do, and what about the other crimes against Python coding I can probably still find in the codebase? Surely that is evidence that these tools produce terrible, unworkable, unusable code?

I mean, okay, I suppose I could reach that conclusion if I'd had a massively atypical experience in the software development industry and had never had to review anyone else's code, or had never needed to work on someone else's legacy code. Is what I'm seeing out of Copilot something I'd consider ideal code? Of course not. Is it worse than some of the worst code I've had to deal with since I started coding for a living in 1989? Hell no!

From what I'm seeing right now I'm getting code whose quality is... fine. Mostly it does the job fine. Often it needs a bit of coaxing in the right direction. Sometimes it gets totally confused and goes down a rabbit hole which needs to just be blocked off and we start again. Occasionally it needs rewriting to do the same thing but in a more maintainable way.

All of which sounds very familiar. I've had times where that describes my code (and I would massively distrust anyone who says they've never had the same outcomes in their time writing code). For sure it describes code I've had to take over, maintain or review.

It's almost like it was trained on lots of code written by humans.

Meanwhile... not every instance of using these tools to get code done needs to be about writing actual code. More and more I'm finding Google Gemini (for example) to be a really handy coding buddy and faster "Google this shit 'cos I can't remember this exact thing I want to achieve". I'll ask, I'll almost always get a pretty good answer, and then I can generally take that snippet of code and implement it how I want.

I've seldom had to walk away from that sort of interaction because it was getting me nowhere.

All of which is to say: I remain concerned about a great many things in the AI space at the moment, but I'm also as equally suspicious of someone who just flatly says "and the code it produces just doesn't work". If that's part of an article or post I'm left with the feeling that the author put zero actual effort into forming their opinion, let alone actually writing it.


  1. To varying degrees. Sometimes I have plenty of time to kill and I read the PR carefully, other times I glance it over, be happy there's nothing horrific there, and then decide to push back or merge based on the results of hand-testing and automated testing. ↩

  2. To be fair, it's the application I would still be writing and would be some time off finishing; there's no way it would be as feature-complete as it is now had I been 100% hand-coding it. ↩

Duplication of effort

3 min read; 11 GFI

While I don't, for a moment, think that the work on BlogMore is complete, I think it's fair to say that the rate of new feature additions has slowed down. Which is fine, there's only so much I need from a self-designed/directed static site generator; at a certain point there's a danger of adding features for the sake of it.

Around this point I think I want to start to pay proper attention to the code quality and maintainability of the ongoing experiment.

As I mentioned the other day, while working through this, I had noticed plenty of bad habits that Copilot (and in this case pretty much always Claude Sonnet 4.6) has. All were very human (obviously), but also the sort of thing you'd expect a human developer to educate themselves out of.

Yesterday evening, out of idle curiosity, I installed Gemini CLI because I wanted to see what would happen if I pointed it at the v2.18.0 codebase and asked it to look for things to clean up, and then what would happen if I did the same with Copilot CLI.

I've saved the results as a PR for what Gemini came up with and what Copilot came up with1. I've not given them a proper read over yet, but while having a quick glance at them something leapt out at me: in the code before the request, there was this in utils.py:

def count_words(content: str) -> int:
    """Count the number of words in the given content.

    Strips common Markdown and HTML formatting before counting so that only
    prose words are included.  The same normalisation rules as
    :func:`calculate_reading_time` are applied.

    Args:
        content: The text content to analyse (may include Markdown/HTML).

    Returns:
        The number of words in the content.

    Examples:
        >>> count_words("Hello world")
        2
        >>> count_words("word " * 10)
        10
    """
    # Remove code blocks
    content = re.sub(r"```[\s\S]*?```", "", content)
    content = re.sub(r"`[^`]+`", "", content)

    # Remove markdown links but keep the text: [text](url) -> text
    content = re.sub(r"\[([^\]]+)\]\([^\)]+\)", r"\1", content)

    # Remove markdown images: ![alt](url) -> ""
    content = re.sub(r"!\[([^\]]*)\]\([^\)]+\)", "", content)

    # Remove HTML tags
    content = re.sub(r"<[^>]+>", "", content)

    # Remove markdown formatting characters
    content = re.sub(r"[*_~`#-]", " ", content)

    return len([word for word in content.split() if word])


def calculate_reading_time(content: str, words_per_minute: int = 200) -> int:
    """Calculate the estimated reading time for content in whole minutes.

    Uses the standard reading speed of 200 words per minute. Strips markdown
    formatting and counts only actual words to provide an accurate estimate.

    Args:
        content: The text content to analyze (can include markdown)
        words_per_minute: Average reading speed (default: 200 WPM)

    Returns:
        Estimated reading time in whole minutes (minimum 1 minute)

    Examples:
        >>> calculate_reading_time("Hello world")
        1
        >>> calculate_reading_time("word " * 400)
        2
    """
    # Remove code blocks (they typically take longer to read/understand)
    content = re.sub(r"```[\s\S]*?```", "", content)
    content = re.sub(r"`[^`]+`", "", content)

    # Remove markdown links but keep the text: [text](url) -> text
    content = re.sub(r"\[([^\]]+)\]\([^\)]+\)", r"\1", content)

    # Remove markdown images: ![alt](url) -> ""
    content = re.sub(r"!\[([^\]]*)\]\([^\)]+\)", "", content)

    # Remove HTML tags
    content = re.sub(r"<[^>]+>", "", content)

    # Remove markdown formatting characters
    content = re.sub(r"[*_~`#-]", " ", content)

    # Count words (split by whitespace and filter out empty strings)
    words = [word for word in content.split() if word]
    word_count = len(words)

    # Calculate minutes, rounding to the nearest minute with a minimum of 1
    minutes = max(1, round(word_count / words_per_minute))

    return minutes

I think this right here is a great example of why the code that these tools produce is generally kind of... meh. Let's just really appreciate for a moment the duplication of effort going on there. But it's even more fun. Look at the docstring2 for count_words: it says right there that the "same normalisation rules as calculate_reading_time are applied". It "knows" it copied the work that went into calculate_reading_time too, but never once did it then "think" to pull the common code out and have both of the functions call on that helper function.

Back to the parallel invitations to refactor, having asked:

please do a review of this codebase and see if there is any scope for refactoring so there's less duplication

Both Gemini and Claude noticed this and did something about it. Gemini came up with a:

def _strip_formatting(content: str) -> str:

with all the regex-based-markdown-stripping code in there and then rewrote count_words and calculate_reading_time to call on that. The Copilot/Claude cleanup did something very similar:

def _strip_markdown_formatting(content: str) -> str:

So it's a good thing that both of them "noticed" this duplication of effort and cleaned it up. What I do find interesting though is what the result was. Stripping docstrings and comments for a moment, here's what I was left with, by Gemini, for count_words and calculate_reading_time:

def count_words(content: str) -> int:
    content = _strip_formatting(content)
    return len([word for word in content.split() if word])

def calculate_reading_time(content: str, words_per_minute: int = 200) -> int:
    content = _strip_formatting(content)
    words = [word for word in content.split() if word]
    word_count = len(words)
    minutes = max(1, round(word_count / words_per_minute))
    return minutes

and here's what Copilot/Claude came up with:

def count_words(content: str) -> int:
    return len([word for word in _strip_markdown_formatting(content).split() if word])

def calculate_reading_time(content: str, words_per_minute: int = 200) -> int:
    words = [word for word in _strip_markdown_formatting(content).split() if word]
    return max(1, round(len(words) / words_per_minute))

In both cases calculate_reading_time is still doing the work of counting words when count_words is right there to be called! Don't even get me started on how the Gemini version of calculate_reading_time is so obsessed with assigning values to variables that only get used once in the next line3. Were I reviewing these PRs (oh, wait, I am reviewing these PRs!), I'd request the latter function be turned into:

def calculate_reading_time(content: str, words_per_minute: int = 200) -> int:
    return max(1, round(count_words(content) / words_per_minute))

I would imagine that there's a lot more of this going on in the code, and under ideal conditions this sort of thing would not have made its way into the codebase in the first place. Part of the point of this experiment was to mostly get the agent to do its own thing, without me doing full-on reviews of every PR. Were I to use this sort of tool in a workplace, or even on a FOSS project that wasn't intended to be this exact experiment, I'd be far more inclined to carefully review the result and request changes.

Or, perhaps, hear me out... I have a third agent that I teach to be just like me and I get it do the work of reviewing the PRs for me. What could possibly go wrong?


  1. Again, I guess I should stop referring to Copilot in this case and instead refer to Claude Sonnet. ↩

  2. Note to self: I need to educate the agents in how I prefer and always use the mkdocstrings style of cross-references. ↩

  3. Yes, I know, this is a favoured clean code kind of thing in some circles, but it can be taken to an unnecessary extreme. ↩

A different approach

4 min read; 12 GFI

As mentioned in the previous post, I've been having a play around with Copilot/Claude vs Gemini when it comes to getting the agents to seek out "bad" code and improve it. In that first post on the subject, I highlighted how both tools noticed some real duplication of effort, both addressed it in more or less the same way, and neither of them took the clean-up to its logical conclusion (or, at the very least, neither cleaned it up in a way that I feel is acceptable).

The comparison of the two PRs (Gemini vs Claude via Copilot) is going to be a slow and occasional read, and if I notice something that catches my interest, I'll note it on this blog.

Initially, I was looking at which files were touched by both. With Gemini it was:

And with Copilot/Claude:

On the surface, it looks like Claude might have done a better job of finding untidy issues in the code. Of course a proper read/assessment of the outcome is needed to decide which is "better"; not to mention the application of a lot of personal taste.

So, with the initial/surface impression that "Claude went deeper", I took a look at the first file they had in common: content_path.py. This is documented as a module related to:

Shared path-resolution utilities for content output paths.

This module provides the generic building blocks used by page_path and post_path. Each content type supplies its own allowed-variable set and variable dict; this module handles the common validation, substitution, and safety checks.

There's 3 functions in there:

  • validate_path_template -- for validating a format string used in building a path.
  • resolve_path -- given a template and some values to populate variables in the template, create a path.
  • safe_output_path -- helper function for joining paths and ensuring they don't escape the output directory.

These seem like sensible functions to have in here, and I can imagine me writing a similar set in terms of the problem they seek to solve.

Both agents seemed to agree on what needed some work: validate_path_template. Both also seem to agree that building knowledge of which variable is required into the function itself isn't terribly flexible; I feel this is a reasonable review of the situation. However, the two agents seem to disagree on how this should be resolved.

Claude's take on this is that the function should grow an optional keyword argument called required_variable, which defaults to slug. It also adds an assert to test if the required variable exists in the allowed_variables (okay, I could quibble about this but given this is a code-check rather than a user-input check, eh, I can go with it). Finally it does the check using the new variable and also makes the error reporting a touch more generic too.

--- /Users/davep/content_path.py        2026-04-30 13:20:00.737955197 +0100
+++ src/blogmore/content_path.py        2026-04-30 13:20:04.560178727 +0100
@@ -17,13 +17,15 @@
     template: str,
     config_key: str,
     allowed_variables: frozenset[str],
-    item_name: str,
+    item_name: str = "",
+    *,
+    required_variable: str | None = "slug",
 ) -> None:
     """Validate a path format string for a content type.

     Checks that *template* is non-empty, well-formed, references only
-    variables from *allowed_variables*, and includes the mandatory
-    ``{slug}`` placeholder.
+    variables from *allowed_variables*, and (when *required_variable* is
+    not ``None``) includes the mandatory placeholder.

     Args:
         template: The path format string to validate.
@@ -33,11 +35,19 @@
             template.
         item_name: The human-readable name of the content type used in
             the uniqueness error message (e.g. ``"page"`` or ``"post"``).
+            Ignored when *required_variable* is ``None``.
+        required_variable: The variable name that must appear in the
+            template, or ``None`` if no variable is mandatory.  Defaults
+            to ``"slug"`` for backward compatibility.

     Raises:
         ValueError: If the template is empty, malformed, references an
-            unknown variable, or omits the ``{slug}`` placeholder.
+            unknown variable, or omits the required placeholder.
     """
+    assert required_variable is None or required_variable in allowed_variables, (
+        f"required_variable {required_variable!r} is not in allowed_variables"
+    )
+
     if not template:
         raise ValueError(f"{config_key} must not be empty")

@@ -61,9 +71,9 @@
             + f". Allowed variables are: {', '.join(sorted(allowed_variables))}"
         )

-    if "slug" not in field_names:
+    if required_variable is not None and required_variable not in field_names:
         raise ValueError(
-            f"{config_key} '{template}' must contain the {{slug}} variable so that "
+            f"{config_key} '{template}' must contain the {{{required_variable}}} variable so that "
             f"each {item_name} can be uniquely identified"
         )

Gemini, on the other hand, has a very similar idea but allows for the fact that the caller might want to specify multiple required variables. So in this case it adds required_variables (as a positional/keyword argument rather than a pure-keyword argument) and defaults it to a frozenset that contains "slug". The rest of the change is also about making the test for the required variables, and the reporting of the error, generic. It doesn't do anything about checking that the required variables are within the allowed variables.

--- /Users/davep/content_path.py        2026-04-30 13:20:00.737955197 +0100
+++ src/blogmore/content_path.py        2026-04-30 14:47:41.607748447 +0100
@@ -18,12 +18,13 @@
     config_key: str,
     allowed_variables: frozenset[str],
     item_name: str,
+    required_variables: frozenset[str] = frozenset({"slug"}),
 ) -> None:
     """Validate a path format string for a content type.

     Checks that *template* is non-empty, well-formed, references only
-    variables from *allowed_variables*, and includes the mandatory
-    ``{slug}`` placeholder.
+    variables from *allowed_variables*, and includes the
+    *required_variables*.

     Args:
         template: The path format string to validate.
@@ -33,10 +34,13 @@
             template.
         item_name: The human-readable name of the content type used in
             the uniqueness error message (e.g. ``"page"`` or ``"post"``).
+        required_variables: The set of variable names that MUST appear
+            in the template to ensure uniqueness. Defaults to
+            ``{"slug"}``.

     Raises:
         ValueError: If the template is empty, malformed, references an
-            unknown variable, or omits the ``{slug}`` placeholder.
+            unknown variable, or omits a required variable.
     """
     if not template:
         raise ValueError(f"{config_key} must not be empty")
@@ -61,10 +65,12 @@
             + f". Allowed variables are: {', '.join(sorted(allowed_variables))}"
         )

-    if "slug" not in field_names:
+    missing = required_variables - set(field_names)
+    if missing:
         raise ValueError(
-            f"{config_key} '{template}' must contain the {{slug}} variable so that "
-            f"each {item_name} can be uniquely identified"
+            f"{config_key} '{template}' must contain the "
+            + ", ".join(f"{{{v}}}" for v in sorted(missing))
+            + f" variable(s) so that each {item_name} can be uniquely identified"
         )

For the most part I think I prefer what Gemini is trying to do, although Claude's sanity check that the required variable is one of the possible variables makes sense. I kind of feel like both of them missed the point when it came to handling the fact that "slug" is required: given that validate_path is otherwise built to be pretty generic, I think I would have defaulted to nothing and simply left it up to the caller to be explicit that "slug" is required, because that matters in context of the caller. This feels like a pretty obvious case of a "business logic" vs "generic utility code" separation of concerns scenario.

As mentioned in passing in another post, it's interesting to see that neither of them noticed the opportunity to turn this:

unknown = set(field_names) - allowed_variables
if unknown:
    ...

into this:

if unknown := (set(field_names) - allowed_variables):
    ...

I know at least one person who would be happy about this fact.

So where does this leave me? At the moment I'm not inclined to merge either PR, but that's mainly because I want to carry on reading them and perhaps writing some more notes about what I encounter. What this does illustrate for me is something we know well enough anyway, but which I wanted to experiment with and see for myself: the initial implementation of any working code written by an agent seems optimised for that particular function or method, perhaps class if you're lucky. It will happily repeat the same code to solve similar problems, or perhaps even use very different approaches to solve the same problem. What it won't do well is recognise that this problem is solved elsewhere and so either use that other code by calling it, or perhaps modify it slightly to make it more generic and more applicable in more situations.

On the other hand, it has shown that with a bit of prompting (and keep in mind that the prompt that arrived at this comparison was really quite vague) it is possible to get an agent to "consider" the problem of duplication and boilerplate and to try and address that.

Having seen the two solutions on offer here, it's hard not to conclude that the best solution would be for me to take the PRs as flags marking places in the code that could be cleaned up, and do the tidy myself.

At least I have, as of the time of writing, 1,380 tests to check that I've not broken anything when I do hand-clean the code. But, hmm, there's a question: can I actually trust those tests? It's not like I wrote them.

Guess that's a whole other thing to worry about at some point...