Posts tagged with "GitHub"

New GitHub Copilot billing is popular

1 min read; 8 GFI

So today is the day, today is when GitHub Copilot swaps to its new billing system. Watching the relevant subreddit suggests this might not be popular.

Some folk think it isn't the smartest move.

Not a good choice

Some don't feel too friendly towards it any more.

Friendship ended

It looks like some of those friendships have lasted a while.

2021-2026

Some saw the opportunity to create content out of the situation.

A cancellation video

Some have figured out that the thing that costs money, costs money.

It is too expensive

Someone used up half their monthly allowance on just 8 requests.

Half used after 8 requests

Although, of course, there's always someone who has to do it better.

Half after 1 request

To be fair though, at least one person loves the new system.

Love the new system

As for my subscription, which came about after I initially experimented with free access to the tool, I've not actually cancelled yet, but I can't see me making use of it much more. I might try a couple of prompts with it, along the lines of what I was doing while working on BlogMore, just to get a feel for how different the usage is now.

Meanwhile, though, I've found that I'm getting on a lot better with Antigravity and getting the bits done I want to do. I suspect this is how I'll keep tinkering with BlogMore, until Google come to their senses anyway.

Reviewing the cost of BlogMore

3 min read; 8 GFI

Now that we're near the end of the free or cheap GitHub Copilot party, I thought it might be interesting to look at how much BlogMore has "cost" me to build, and what it would have cost under the proposed new pricing structure that is coming in next month. While I've looked at the comparison for last month, I've not looked at the whole period I've been seriously using it.

So, for this review, I'm looking at all the data I can pull out of GitHub for the months of February, March, April and May of this year. Development of BlogMore started back in February and, while it hasn't been 100% the cause of my use of Copilot premium requests, it's been almost all of it. For the purposes of this review I'm just going to take the approach that all I worked on was BlogMore.

Remember that, even when I had free access, I had a maximum of 300 premium requests per month. Once I lost free access I had the same number of requests for $10 a month.

Here's how those months broke down:

MonthPaidPremium Requests%agePredicted Price
February$0.0024983%$21.67
March$10.0014047%$56.38
April$10.0013244%$53.77
May$10.003411%$53.69
Total:$30.0055546%$185.51

So, give or take, something that I've actually spent $30.00 on could have, at best, cost me $185.51. That's assuming that the "cost" of the models I was using stays the same. You can see that the costs have risen already in that the predicted price from February, where I used 83% of my premium requests, is a touch under half the cost for this month, where I've used just 11%. From what I can see in the raw data, it's down to some models suddenly being considered more expensive (perhaps I was doing something that just consumed more tokens, I'm not 100% sure if I'm honest, but I don't recall anything that seemed like harder work).

Who knows what the real costs will be come June.

Now, technically, the actual cost under the new regime could or should be $156, because it would be 4 lots of the $39.00/month plan, which would better cover that use1. Again though, that's assuming the actual cost of using whatever models remains pretty stable. It also assumes that I'd want to spend that much each month, and that I would be correctly anticipating that I'd need that much.

Also, this isn't even the total cost of getting this project done. As I've written recently: I've been using Gemini CLI more this month, and while the usage there is a flat cost, until now, that's changing too.

Now, of course, these aren't the only games in town. I could "go to the source" and just get a sub for Claude Code or something, and as Tim pointed out over in the Fediverse, something like Cursor does a lot of this and is just $20/month. Which all sounds fine, but what happens when those fleeing GitHub Copilot or Gemini CLI/Antigravity head over to something like Cursor? Is it sensible to expect the pricing to stay the same2?

I guess, at this point, I'm just mulling over the same issue time and again, but from different angles. It does seem clear to me, though, that in less than 4 months, in my experiment of "what happens if I use agents to develop a Free Software tool I want?", the market has gone from being entirely reasonable to pretty much unjustifiable from a price point of view.


  1. As I understand it, the $39 gets you almost twice that value in "AI credits", so the base allotment plus the flex allotment would cover what I've used. 

  2. That's not even the main reason to be concerned about a switch to Cursor

next-gh-pr.el v1.0.0

1 min read; 12 GFI

Pretty much every project that I actively maintain on my GitHub account has a change log of some description. For a long time now, whenever I add a new entry to the log, I'll include a link to the PR that implements that change. Inevitably, this results in me adding the ChangeLog entry, creating the PR, then doing a follow-up change and commit now that I know the PR number, which allows me to add the link.

So I've created next-gh-pr.el to save me just a little time and let me be just a little more lazy. Inside it I've currently got a next-gh-pr-insert-markdown-link command which, when run, as you might imagine, inserts a link to the next likely PR URL as Markdown.

Working out the next URL is simple enough: get the latest issue and PR number, take whichever is the highest, and add 1. There is the wrinkle that discussions also cause this number to bump, and getting the latest discussion number is a little extra faff that I can't be bothered with right now, but my projects very seldom have discussions taking place anyway.

Gemini CLI vs GitHub Copilot (the result)

4 min read; 11 GFI

Following on from this morning's initial experiment, I think I'm settling on a winner. Rather than be annoying and have you scroll to the bottom to find out: it's Gemini CLI. Here's how I found the process played out, and why I'm settling for one over the other.


Gemini CLI

Initially this was an absolute mess. After letting it initially work on the problem, the resulting code didn't even really run. The first go, and the three follow-up prompt/result cycles that followed, all resulted in code that had runtime errors. I'm pretty sure it didn't even bother to try and do any adequate testing. This is odd given I've generally seen it do an okay job when it comes to writing and running tests.

Once I had the code in a stable state, with all type checking, linting and testing passing, it still didn't work. No matter how I tried to use the new facility it just didn't make a difference. No images were optimised. In the end I dived into the code, with the help of its attempt at debugging (it added print calls to try and get to the bottom of things -- how very human!), diagnosed what I thought was the issue (it was looking in the wrong location for the files to optimise), told it my hypothesis and let it check if I was right. It concluded I was and fixed the problem.

Since then I've had a working implementation of the initial plan.

Once that was in place it's been a pretty smooth journey. I've asked it questions about the implementation, had my concerns set to rest, had some concerns addressed and fixed, improved some things here and there, added new features, etc.

All of this has left me with 18% of my daily quota used up. While I think this is the highest I've ever got while using Gemini CLI, it still feels like I got a lot of things done for not a lot of quota use.

GitHub Copilot

Initially I thought this had managed to one-shot the problem. Once it had finished its initial work the code ran without incident and produced all the optimised files. Or so I thought. Doing a little more testing, though, it became clear it was only optimising a subset of the images and it didn't seem to be producing the actual HTML to use the images.

On top of this it didn't even follow the full plan that was laid out in the issue it was assigned. For example: once I'd got it doing the main part of the work, it became apparent that it had pretty much ignored the whole idea of using a cache to speed this process up. I had to remind it to do this.

At one point I switched from the in-PR web interaction with Copilot, and used the local CLI instead. When I ran that up it warned me that I was already 50% of the way through some sort of rate limit and this wouldn't reset for another 3 hours. I think I was about 40 minutes into letting it try and do the work at this point.

After a bit more testing and follow-up prompts, I got to a point where I had something that looked like it was working; albeit in a slightly different way from how Gemini CLI did it (the Copilot approach was writing the optimised images out to the extras directory, mixing them in with my own images; Gemini opted for having a separate directory for optimised images within the static hierarchy).


At this point I will admit to not having carefully reviewed the code of either agent; that's a job still to do. But while Gemini got off to a very rocky start, with a bit of guidance it seemed to arrive at an implementation I'm happy with, and one that seems to be working as intended. While it didn't anticipate all the edge cases, when I asked about them it easily found and implemented solutions for them. Moreover, the fact that I could do all of this and confidently know the "cost" made a huge difference. Copilot seems to generally approach this like a quota or rate limit should be a lovely surprise that will destroy your flow; Gemini has it there and in front of you, all the time.

As for the general idea that I'm working on: I think I'm going to implement it. Weirdly I'm slightly nervous about building the blog such that it won't be using the images I created, but I also recognise that that's a little irrational. Meanwhile I'm very curious about the impact this might have on the PageSpeed measurement of the blog. While it's far from horrific, image size optimisation and size declaration seem to be fairly high on the things that are impacting the performance score (currently sat at 89 for the front page of the blog, as I type this).

The other thing that gives me pause for thought about merging this in, and then subsequently using it, is that I've just finished migrating all images to webp, and so saving a lot of space in the built version of the blog. Generating all the responsive sizes of the images eats that up again. With this feature off, the built version of the blog stands at about 84MB; with it on, this rises to 133MB. That extra 49MB more than eats up the 24MB saving I made earlier.

On the other hand: storage is a thing for GitHub to worry about, what I'm worrying about here, and aiming to improve, is the reader's experience.

I'm going to sit on this for a short while and play around with it, at least until I get impatient and say "what the hell" and run with it.

Gemini CLI vs GitHub Copilot (redux)

1 min read; 10 GFI

Given I'm almost certainly going to drop GitHub Copilot starting next month, I'm using Gemini CLI more and more for BlogMore. Yesterday evening, I used it to plan out an idea for a change to the application. Now that I've migrated all images to WebP, I thought it might be interesting to look at the idea of having a responsive approach to images. This is something I don't know a whole lot about (never having needed to bother with it before), but it also happens that I need to read up on this anyway for something related to the day job; given this, it felt like a good time to experiment.

Together with Gemini CLI a plan was created.

This morning, over second coffee, I've kicked off the job of implementing it and, honestly, Gemini CLI is really struggling. It "implemented" the change pretty quickly, within minutes, but it just plain didn't work. Since then I've had it iterate over the issue four times and now it's struggling to make it work at all. It's still beavering away on this as I type, and consuming daily quota at a fair rate too.

So, while I still have GitHub Copilot, this feels like a good point to play them off against each other at least one more time. Having saved the plan Gemini wrote last night as an issue, I've assigned it to Copilot (using Claude Sonnet 4.6). As I type this, I have Gemini racing to get this working in a terminal window behind Emacs, meanwhile there's Claude doing its thing in GitHub's cloud.

It'll be interesting to see if Copilot manages to one-shot this, for sure Gemini is far off a one-shot implementation.

The Copilot bait and switch

4 min read; 10 GFI

Well, it's here: GitHub's tool to let you see how much better off you're going to be under the new Copilot billing system that comes in next month. It's... something.

But let's set the background first. I'm here (in Copilot usage space) as an observer, spending time on an experiment that started with the free pro tier and then transitioned into the "okay, I'll play along for $10 a month, the tool I'm building is fun to work on and is useful to me" phase. I doubted it would last forever -- the price was obviously too good to be true for too long -- but I wasn't expecting it to collapse quite so soon and in quite such a spectacular way.

When GitHub announced the move to usage-based billing I was curious to see if I'd be better off or worse off. It was hard to call really. My use of Copilot is sporadic, and as BlogMore has started to settle down and reach a state approaching feature-saturation the need to do heavy work on it has reduced. I did use it a fair bit last month, but that was more in tinkering and experimenting mode rather than full development mode1, so it's probably a good measure.

Checking the details on GitHub, it looks like I used a touch under 1/3 of my premium requests.

A table of my premium requests for April 2026

It also looks like the usage came in a couple of bursts lasting a few days, with a pretty flat period in the middle of the month.

Cumulative use for April

So, technically, GitHub won. I paid them $10 for 300 premium requests, I left a touch over 2/3 unused. I think it's fair to suggest that I'm a pretty lightweight user, even when I have a project under active development.

This is where the new usage-based preview tool comes in. Launched yesterday, it lets you take your existing usage stats and see how much it would have really cost you.

The app itself comes over as being hastily spat out with an agent and little communication between responsible teams. You'd think you just press a button when viewing some historical usage figures and get a display that shows you what it would cost under the new approach.

You'd think.

Nope. First you generate your report for a particular month. Then have to ask for it to be emailed to you as a CSV!

Requesting the email

Even that part isn't super reliable. When I tried it last night it took a wee while to turn up, and that was after about 10 attempts where I got an error message saying it couldn't generate the report. This morning I tried again and I've yet to see the email, 30 minutes later2.

Having done that you click through to another page/app where you have to upload the CSV, to GitHub, that GitHub just sent you in an email. Brilliant. It then gives you the good news.

So what is my 1/3 use of the premium request allowance going to save me under the new approach to billing?

Such a good deal

Amazing. I especially like the part where they spin it as: if I spent $39/month with them I would save money!

I guess I should take comfort that I'm not that one Reddit user whose $39 April would really have cost them almost $6,0003.

Watching this journey has been wild. The free Pro as a taster to get me onto $10/month I can go with, that's fair enough. For the longest time I never even paid it any attention. But watching GitHub give it to so many people, and especially so many students, and then watching them do shocked Pikachu when it cost them an arm and a leg and probably caused the degradation of the performance of their systems... who could possibly have seen this coming? Impossible to predict.

Back when I first wrote about my initial impressions of working with Copilot I wondered in the conclusion if I'd transition to a paying version of Copilot. I obviously did. At $10/month it was a very affordable tinker toy that gave me a new dimension to the hobby side of my love of creating things with code. But the prospect of paying $39/month for something in the region of 1/3 of requests that I had before: nah, I'm not into that.

It looks like this month will be the last month I keep a Copilot subscription. BlogMore will carry on being developed, I'll probably transition to leaning on Gemini CLI more (as I have been the last week anyway), and also start to get my hands dirty with the code more too.

This feels like one of the early signs of the bait and switch that the AI suppliers have been building up all along. Experimenting and better understanding how and why people use these tools has been seriously useful, and I can't help but feel that I accidentally started at just the right moment. Watching this happen, with actual experience of what's going on, is very educational. It's going to be super interesting to see if this same stunt gets pulled on a bigger scale, with all the companies that uncritically embraced AI at every level of their organisation.

It's going to be especially interesting to watch the AI leaders in those companies to see how they spin this, if and when the real costs are more widely applied.


  1. Is my recollection. I should probably review the ChangeLog and see what I actually did add in April. 

  2. Yes I checked spam. 

  3. In part because yikes, but also in part because at least I'm not the reason this is happening, unlike them. 

More syncing GitHub to GitLab and Codeberg

1 min read; 10 GFI

Following on from my first post about this, I've tweaked the script I'm using to backup a repo to GitLab and Codeberg:

#!/bin/sh

# Check if the current directory is a Git repository
if ! git rev-parse --is-inside-work-tree > /dev/null 2>&1; then
    echo "Error: This directory is not a Git repository."
    exit 1
fi

REPO_NAME="$1"

# If no repository name was provided, try to get it from the origin remote
if [ -z "$REPO_NAME" ]; then
    ORIGIN_URL=$(git remote get-url origin 2>/dev/null)
    if [ -n "$ORIGIN_URL" ]; then
        REPO_NAME=$(basename -s .git "$ORIGIN_URL")
    else
        echo "Error: No repository name provided and no 'origin' remote found."
        echo "Usage: $0 <repo-name>"
        exit 1
    fi
fi

echo "Configuring multi-forge backup sync for: $REPO_NAME"

# Set up the remote called backups. Anchor it to Codeberg.
git remote remove backups > /dev/null 2>&1
git remote add backups "ssh://git@codeberg.org/davep/${REPO_NAME}.git"

# Set up the push URLs.
git remote set-url --push backups "ssh://git@codeberg.org/davep/${REPO_NAME}.git"
git remote set-url --add --push backups "git@gitlab.com:davep/${REPO_NAME}.git"

# Only ever backup main.
git config remote.backups.push refs/heads/main:refs/heads/main

# Also backup all tags.
git config --add remote.backups.push 'refs/tags/*:refs/tags/*'

echo "----------------------------------------------------"
echo "Backups configured:"
git remote -v
echo "----------------------------------------------------"
echo "To perform the initial sync, run: git push backups"

### setup-forge-sync ends here

The changes from last time include:

  • The repo name now defaults to whatever is used for GitHub, so I don't have to copy/paste it or type it out.
  • It now backs up all the tags too.

I've been running with this for a couple of days now and it's proving really useful. Well, when Codeberg is available to push anything to...

An unreliable buddy

4 min read; 11 GFI

At some point this morning I was looking for something on this blog and stumbled on a post that had a broken link. Not an external link, but an internal link. This got me thinking: perhaps I should add some sort of linting tool to BlogMore? I figured this should be doable using much of the existing code: pretty much work out the list of internal links, run through all pages and posts, see what links get generated, look for internal links1, and see if they're all amongst those that are expected.

Later on in the day I prompted Copilot to have a go. Now, sure, I didn't tell it how to do it, instead I told it what I wanted it to achieve. I hoped it would (going via Claude, as I've normally let it) decide on what I felt was the most sensible solution (use the existing configuration-reading, page/post-finding and post-parsing code) and run with that.

It didn't.

Once again, as I've seen before, it seemed to understand and take into account the existing codebase and then copy bits from it and drop it in a new file. Worse, rather than tackle this using the relevant parts of the existing build engine, it concocted a whole new approach, again obsessing over throwing a regex or three at the problem.

I then spent the next 90 minutes or so, testing the results, finding false reports, finding things it missed, and telling it what I found and getting it to fix them. It did, but on occasion it seemed to special-case the fix rather than understand the general case of what was going on and address that.

Eventually, probably too late really, I gave up trying to nudge it in the right direction and, instead, decided it was time to be more explicit about how it should handle this2. The first thing that bothered me was that it seemed to ignore the configuration object. Where BlogMore has a method of loading the configuration into an object, which can be passed around the code, but with the linter it loaded it up, pulled it all apart, and then passed some of the values as a huge parameter list. Because... reasons?

Anyway, I told it to cut that shit out and prompted it about a few other things that looked pretty bad too. Copilot/Claude went off and worked away on this for a while, using up my 6th premium request of the session, and then eventually came back with an error telling me I'd hit a rate limit and to come back in a few hours.

GitHub rate limit

Could I have got it to where I wanted to be a bit earlier, with more careful prompting? No doubt. Will a lot of people? I suspect that's rather unlikely. This is one of the many things that make me pretty sceptical about this as the tool some sell it as, at least for the moment. I see often that it's written about or talked about as if it's a really useful coding buddy. It can be, at times, but it's hugely unreliable. Here I'm testing it by building something as a hobby, and I'm doing so knowing that there's no real consequence if it craps out on me. I'm also doing it safe in the knowledge that I could write the code myself, albeit at a far slower pace and with less available time. Not everyone this is aimed at has that going for them.

But these tools are still sold like they're the most reliable coding buddies going.

All that said: having hit the rate limit, and having squandered six premium requests on the problem with no real progress, I decided to use my Google Gemini coding allowance instead (which, in my experience so far, seems pretty generous). I threw more or less the same initial prompt at it, but this time I stressed that I really wanted it to use the existing engine where possible. It managed to pretty much one-shot the problem in about 9 minutes and used up just 2% of my daily quota3.

I've done a little more tidying up since, and I still need to properly review the result, but from what I can see of the initial results it's found all of the issues I wanted it to find, first time (something Claude didn't manage) and hasn't found any issues that don't exist (also something Claude didn't manage).

So I guess this time Gemini was the reliable buddy. But not knowing which buddy you can rely on makes for a pretty unreliable group of buddies.


  1. This process could, of course, work for external links too, but I'm not really too keen on having a tool that visits every single external link to see if it's still there. 

  2. Which is mostly fine; I'm doing this as an experiment in what it's capable of, and also I was sofa-hacking while having a conversation about naming Easter eggs in Minecraft. 

  3. Imagine that too! Imagine knowing exactly how much of your quota you've used at any given moment! Presumably GitHub don't show you where you are in respect to the rate limits on top of your monthly quota because grinding to a halt with no warning is more... fun? 

Syncing GitHub to GitLab and Codeberg

2 min read; 10 GFI

I've had a GitLab account since December 2017. This came about because of the new job I started in January 2018. They used a self-hosted internal instance of GitLab for all their code, so it made sense I get familiar with it (it wasn't hard; especially back in 2017 it was near enough a clone of GitHub in terms of what it did). Since then, though, I've never really done anything with it. I think I had a repo or two on there for a short while, but I must have nuked them at some point because the account has been empty for the longest time.

A Codeberg account, on the other hand, only got created the other day. Having created this, I got to thinking about how I might use it. In doing so I thought back to my GitLab account and then also got to thinking about where all my public code lives, and how "safe" it is.

Now, sure, the whole point of Git is that it's distributed. Forges are a useful thing to have and work with, but they shouldn't be the place where your code lives. On the other hand, I've had so many machines, and so many work environments, that it has become the case that my GitHub account has become the storage location for my code and projects.

Mostly this is fine. If GitHub were to disappear tomorrow I imagine we'd all have bigger things to be worrying about anyway. But the principle stands: why not distribute the load? Why not distribute the effort when it comes to sharing code I write?

So yesterday I finally decided on a plan: for the moment at least, I'm going to keep using GitHub as my "primary" location for working on stuff. It's where I'll have WiP branches, it's where I'll keep issues, it's where I'll encourage people to raise requests and stuff, it's where I'll host this blog. But I'm going to start syncing projects to both GitLab and Codeberg. I see this as having two benefits: anyone who doesn't want to interact with GitHub can now easily fork code, and if they wish they can raise issues and the like too. Meanwhile, in doing this, I'll also have the added benefit of my code being "backed up" in at least three different locations1.

The approach I've settled on, for the moment, is based around this little shell script:

#!/bin/sh

# Check if a repository name was provided
if [ -z "$1" ]; then
    echo "Error: No repository name provided."
    echo "Usage: $0 <repo-name>"
    exit 1
fi

REPO_NAME="$1"

# Check if the current directory is a Git repository
if ! git rev-parse --is-inside-work-tree > /dev/null 2>&1; then
    echo "Error: This directory is not a Git repository."
    exit 1
fi

echo "Configuring multi-forge backup sync for: $REPO_NAME"

# Set up the remote called backups. Anchor it to Codeberg.
git remote remove backups > /dev/null 2>&1
git remote add backups "ssh://git@codeberg.org/davep/${REPO_NAME}.git"

# Set up the push URLs.
git remote set-url --push backups "ssh://git@codeberg.org/davep/${REPO_NAME}.git"
git remote set-url --add --push backups "git@gitlab.com:davep/${REPO_NAME}.git"

# Only ever backup main.
git config remote.backups.push refs/heads/main:refs/heads/main

echo "----------------------------------------------------"
echo "Backups configured:"
git remote -v
echo "----------------------------------------------------"
echo "To perform the initial sync, run: git push backups main"

### setup-forge-sync ends here

I'm going to keep all repo names the same2. So when I use this script it'll set things up so I can git push backups and main will then get pushed up to both GitLab and Codeberg. I don't feel the need to be keeping any WiP branches in sync or kicking about, likewise any gh-pages branches.

While I'm sure I could have done something a little more automated, this feels like a neat and simple approach, and also allows me to curate what appears in the two other places over time (I suppose, eventually, I'll mirror everything that isn't a dead experimental repo, but meanwhile I'll prioritise projects that are either still very useful or which I'm actively developing and maintaining).


  1. Yes, I have other backups too, but they're always current-working-machine type backups. 

  2. Except, perhaps, for any repo whose name starts with .; I seem to recall that GitLab can't handle that, for some bizarre reason. Perhaps that's fixed now? 

Me vs Claude (redux)

1 min read; 7 GFI

It's a small thing, but here's round 2 of me vs Claude. This time I'm directing the agent to clean up the code that does word counts, getting it to use the Markdown to plain text code that exists in BlogMore, rather than the regex-based Markdown-stripper it was using. The approach it landed on made sense to me, adding another text extractor class, but one that ignores fenced codeblocks1. So, in addition to this code (I've removed all docstrings and comments for the sake of including here):

class _AllTextExtractor(HTMLParser):

    def __init__(self) -> None:
        super().__init__(convert_charrefs=True)
        self._chunks: list[str] = []

    def handle_data(self, data: str) -> None:
        self._chunks.append(data)

    @property
    def text(self) -> str:
        return re.sub(r"\s+", " ", "".join(self._chunks)).strip()

it also added this:

class _TextWithoutCodeExtractor(HTMLParser):

    def __init__(self) -> None:
        super().__init__(convert_charrefs=True)
        self._chunks: list[str] = []
        self._pre_depth: int = 0

    def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]) -> None:
        if tag == "pre":
            self._pre_depth += 1

    def handle_endtag(self, tag: str) -> None:
        if tag == "pre" and self._pre_depth > 0:
            self._pre_depth -= 1

    def handle_data(self, data: str) -> None:
        if self._pre_depth == 0:
            self._chunks.append(data)

    @property
    def text(self) -> str:
        return re.sub(r"\s+", " ", "".join(self._chunks)).strip()

The function that converts Markdown to plain text then decides which extractor to use, based on if the caller asked for codeblocks to be included or not.

All pretty reasonable.

Only... that text property on both those classes is identical. The __init__ method is the same save for one extra line. Even handle_data is more or less the same except for that guarding if.

I can't. I can't let that stand. It's almost copy/paste. For me, this is the ideal time to use just a little bit of inheritance. Here's my take (with classes renamed too, the leading _ didn't feel necessary for one thing):

class TextExtractor(HTMLParser):

    def __init__(self) -> None:
        super().__init__(convert_charrefs=True)
        self._chunks: list[str] = []

    def handle_data(self, data: str) -> None:
        self._chunks.append(data)

    @property
    def text(self) -> str:
        return re.sub(r"\s+", " ", "".join(self._chunks)).strip()


class TextSansCodeExtractor(TextExtractor):

    def __init__(self) -> None:
        super().__init__()
        self._pre_depth = 0

    def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]) -> None:
        if tag == "pre":
            self._pre_depth += 1

    def handle_endtag(self, tag: str) -> None:
        if tag == "pre" and self._pre_depth > 0:
            self._pre_depth -= 1

    def handle_data(self, data: str) -> None:
        if self._pre_depth == 0:
            super().handle_data(data)

Much better!

I was tempted to prompt Copilot/Claude about this and see what clean-up it would do, if it would arrive at similar code. But really it didn't seem like a good use of a premium request (perhaps I should have given Gemini a shot).

I see this kind of thing in the code quite a bit, and it speaks to what I've said before about what I'm seeing: the code it writes is... fine. It's okay. It does the job. The code runs. It's just not... to my taste, I guess.


  1. This is important for working out word counts and so read times. It doesn't make sense that embedded code counts towards those.