Posts tagged with "jq"

Recreating my blog stats

4 min read; 13 GFI

Introduction

Having recently added the dump command to BlogMore I've been thinking I should try and learn a little more about jq. It's one of those tools that's been on my radar for ages, which I've used on very rare occasions to get something done quickly, but which I've never really used in anger.

So I thought it might be fun to see about recreating some of the stats from the stats page using jq alone. Well, I say "alone", I mean "from the JSON data that is produced by the BlogMore dump command", and of course that makes it easier given it dumps some of the key calculated values. In other words I won't be using jq to calculate the word count, or reading time, or GFI, etc.

Post count

To start with, working out the number of posts in my blog is simple enough:

jq '. | length'
371

Category count

Getting the list of categories would be:

jq '[.[] .safe_category] | unique'
[
  "ai",
  "coding",
  "creative",
  "emacs",
  "gaming",
  "life",
  "meta",
  "music",
  "python",
  "tech",
  "til"
]

and so getting the count of them is simple enough:

jq '[.[] .safe_category] | unique | length'
11

Tags count

Getting the count of tags takes a little more work, as safe_tags is a list too, so I start out with a list of lists, which I need to flatten first.

jq '[.[] .safe_tags] | flatten | unique | length'
224

This, right away, is an interesting finding. In my stats page, as of the time of writing, the number of tags is reported as 243, but here I'm getting 224. Given I'm using the safe_tags property, which ensures all similar tags end up with the same value (so Hello World, hello world, and all variations, become hello-world), that would suggest the stats page isn't taking that into account. That's an issue to address.

A date/time interlude

Here's where things get a little interesting for a moment. In the output of the dump command from BlogMore, the dates of the posts are given in ISO 8601 format; specifically the date and time with offset format. From what I can tell, while jq does have some date/time parsing support, it can't handle that format specifically.

This means that if I try:

jq '.[0] .date | fromdate'

I just get:

jq: error (at <stdin>:27293): date "2015-06-18T14:53:00+01:00" does not match format "%Y-%m-%dT%H:%M:%SZ"

After some searching around it seems the only approach I can really take is to drop the timezone offset and pretend every time is a Z time:

jq '.[0] .date[:19] + "Z" | fromdate'
1434639180

From here I can then get a fully-parsed list of date/time values using gmtime:

jq '.[0] .date[:19] + "Z" | fromdate | gmtime'
[
  2015,
  5,
  18,
  14,
  53,
  0,
  4,
  168
]

This isn't ideal for what I'd like to do, it's going to skew some of the values related to time, but it's close enough for experimenting.

Posts per year

Now that I have a way of breaking the posting time into a workable array of values, getting the number of posts per year becomes:

jq -r '[.[] .date[:19] + "Z" | fromdate | gmtime[0]] | group_by(.) | .[] | "\(.[0]): \(length)"'
2015: 32
2016: 26
2017: 7
2018: 1
2019: 15
2020: 23
2022: 11
2023: 49
2024: 19
2025: 11
2026: 177

Although, to be fair to jq, that's kind of long-winded when I could just pull the year itself out of the posting time:

jq -r '[.[] .date[:4]] | group_by(.) | .[] | "\(.[0]): \(length)"'

Posts by month

At this point getting the posts by month of year seems obvious too:

jq -r '[.[] .date[5:7]] | group_by(.) | .[] | "\(.[0]): \(length)"'
01: 14
02: 12
03: 53
04: 57
05: 76
06: 33
07: 25
08: 25
09: 13
10: 29
11: 19
12: 15

Posts by weekday

For this, I need to go back to the more involved version of the posting date handling query, where I use gmtime to break down the time. It turns out that the penultimate value is the day of the week as a number. So, while it's not quite as readable in that I don't have day names, I can get the values:

jq -r '[.[] .date[:19] + "Z" | fromdate | gmtime[-2]] | group_by(.) | .[] | "\(.[0]): \(length)"'
0: 48
1: 54
2: 51
3: 48
4: 56
5: 56
6: 58

In this case Sunday is the first day (the 0 day here).

Posts by hour

Getting the posts by the hour is really just a variation on the date-chopping query used for the posts by year and the posts by month; it's all there in the string version of the date.

jq -r '[.[] .date[11:13]] | group_by(.) | .[] | "\(.[0]): \(length)"'
00: 1
06: 1
07: 6
08: 51
09: 35
10: 32
11: 25
12: 14
13: 22
14: 24
15: 25
16: 24
17: 18
18: 9
19: 23
20: 33
21: 21
22: 6
23: 1

First and last posting dates

Getting the date of the first and latest post seems nice and easy:

jq -r '[.[] .date[0:10]] | {first: min, last: max}'
{
  "first": "2015-06-18",
  "last": "2026-06-01"
}

Although, from what I can tell, jq doesn't have anything that makes date arithmetic easy so working out the elapsed time between the two isn't so straightforward. It can be done, but it's not as easy as it might be with a bit of Python code, for example. The best I could come up with was:

jq '[ .[] | .date[:19] + "Z" | fromdate ] | ((max - min) / (365.25 * 24 * 60 * 60))'
10.95438841990519

For an approximate value of "year", of course.

Word counts

From here on in many of the stats that can be pulled out from the JSON, with jq, become easier to handle. Each post has a word_count property, so I only need to do this:

jq -r '[.[] .word_count] | {least: min, most: max, average: (add / length)}'
{
  "least": 24,
  "most": 2792,
  "average": 475.0700808625337
}

Reading times

A post's reading time can be accessed by reading_time, so it's as easy to handle as the word counts:

jq '[.[] .reading_time] | {least: min, most: max, average: (add / length)}'
{
  "least": 1,
  "most": 11,
  "average": 1.8921832884097034
}

Gunning fog index

The Gunning fog index is available as the gfi property so there's no work to do to figure it out. It is, however, a floating point value and I want counts in each integer "bucket". That can be done with round.

jq -r '[.[] .gfi | round] | group_by(.) | .[] | "\(.[0]): \(length)"'
3: 1
4: 2
5: 3
6: 7
7: 30
8: 46
9: 67
10: 70
11: 75
12: 35
13: 18
14: 11
15: 1
16: 3
17: 2

As for working out the mean, median and mode... while I worked out the above queries by reading the docs, experimenting, and using Gemini on occasion to either help me understand an error message or to explain why an approach works the way it did, I'm going to have to leave this one 100% to Gemini. Here's its approach to using jq to work out those averages:

jq '
  [ .[] | .gfi | select(. != null) ] as $raw_gfi
  | [ $raw_gfi[] | round ] as $rounded_gfi
  | ($raw_gfi | length) as $count

  # 1. Mean Calculation
  | (($raw_gfi | add) / $count) as $mean

  # 2. Median Calculation
  | ($raw_gfi | sort) as $sorted_gfi
  | (if $count % 2 == 1 then
       $sorted_gfi[($count - 1) / 2]
     else
       ($sorted_gfi[($count / 2) - 1] + $sorted_gfi[$count / 2]) / 2
     end) as $median

  # 3. Mode Calculation (using the rounded values)
  | [ $rounded_gfi
      | group_by(.)
      | map({gfi: .[0], frequency: length})
      | sort_by(.frequency)
      | reverse
      | .[]
    ] as $frequencies
  | [ $frequencies[] | select(.frequency == $frequencies[0].frequency) | .gfi ] as $modes

  # Final Object Assembly
  | {
      count: $count,
      mean: $mean,
      median: $median,
      mode: $modes
    }
'
{
  "count": 371,
  "mean": 9.908842231503396,
  "median": 9.979198312236287,
  "mode": [
    11
  ]
}

As of the time of writing: that's bang on what I get in the stats. Honestly though, by this point, I think I'd be reaching for Python or something similar to do this sort of work. For sure, I can't say if this is a good jq query, if it's in any way idiomatic, or even if it's error-free. The numbers match what BlogMore says though.

Conclusion

This has been a useful exercise in getting to know a little more about jq, and I can see myself reaching for it to do quick little jobs now that I've finally taken some time to dive into it. As it turns out, it's also been a useful little audit of the content of the stats page because I've even found a bug that needs addressing; so that's a bonus.

BlogMore v2.34.0

1 min read; 12 GFI

I've released BlogMore v2.34.0. This is a small update to make some changes to the newly-added dump command.

The first change is a small fix to the url_path property, which wasn't being populated; now it is.

The second change adds two new properties to the output which relate to links that can be found inside posts: internal_links and external_links. As the names suggest, the first gives you a list of all the internal links that can be found in a given post, with the values given being the same format as the id used for every post in the dump. For example:

"internal_links": [
  "posts/2026/05/2026-05-20-blogmore-v2-25-0.md",
  "posts/2026/05/2026-05-22-blogmore-v2-26-0.md"
]

This should give everything needed to write tools that do things similar to the back-links system in BlogMore itself.

The list of external links is, obviously, a list of all the links in the post that are external to the blog. It looks like this:

"external_links": [
  "https://blogmore.davep.dev/",
  "https://validator.w3.org",
  "https://json-ld.org/",
  "https://microformats.org/wiki/microformats2",
  "https://microformats.org/wiki/rel-me"
]

There is, of course, some overlap with the link dumping command, but given that the information is available it seemed to make sense to provide it here; it also means that it's available in a more structured form.

Also providing this sort of information in the JSON output means there's a lot of flexibility when it comes to analysing all the posts in my blog. For example, I can now easily satisfy my curiosity if I want to know exactly which posts in my blog have no links whatsoever.

blogmore dump | jq -r '.[] | select((.internal_links | length) == 0 and (.external_links | length) == 0) | .id'

posts/2015/2015-06-18-a-mild-chrome-annoyance.md
posts/2015/2015-06-23-and-now-for-some-ios.md
posts/2015/2015-07-01-odd-ipod-update.md
posts/2015/2015-08-03-best-update-ever.md
posts/2015/2015-09-04-unknown-promo.md
posts/2015/2015-10-19-microsoft-accounts.md
posts/2015/2015-11-11-voice-search-failing-on-nexus-6.md
posts/2015/2015-11-12-i-miss-until-next-alarm.md
posts/2016/2016-04-28-i-now-own-a-macbook.md
posts/2017/2017-12-12-on_to_something_new.md
posts/2020/2020-06-24-swift-til-3.md
posts/2020/2020-06-26-switch-til-5.md
posts/2020/2020-06-28-swift-til-7.md
posts/2020/2020-07-05-swift-til-10.md
posts/2022/2022-06-03-failed-successfully.md
posts/2023/2023-07-21-encouragement-i-guess.md
posts/2023/2023-07-29-home-pod-stuck-installing.md
posts/2023/2023-10-20-constant-siri-voice-loss.md
posts/2026/05/2026-05-04-my-new-favourite-game-on-steam.md
posts/2026/05/2026-05-11-steam-controller-is-close.md
posts/2026/05/2026-05-25-this-is-not-fun.md

Sure, I don't know what I'd do with this information, but at least I can easily ask the question.

BlogMore v2.33.0

1 min read; 11 GFI

I've released v2.33.0 of BlogMore, which extends the stats page some more, and also adds a tool so a user can do all sorts of fun things with the raw data of their posts.

The addition to the stats page is a list of years along with the top five words that characterise the focused subject for those years. This is done using TF-IDF. While the results for my blog don't come as a surprise, I am pleased to see that it does turn out pretty much how I would have expected:

Blogging focus per year

This feels like another fun way to learn something about the post history for your own blog.

Which got me thinking: there's probably any number of other niche and bizarre things that could be done with the content of a blog to gain some insight as to its history, and I really shouldn't just keep adding more and more things to the stats page. But what if I wanted a way to run some code over all the posts? Wouldn't it be useful if I could get all of the parsed post data in JSON format so I can play with it?

With that idea in mind, I added the dump command. When run, it will print to stdout a full dump of all of your posts, as JSON, so you can then write your own nifty tool to read it back in and do any number of interesting checks, tests, reformats or manipulations. For example, if I wanted to use jq to pull out the metadata for this particular post I could:

blogmore dump | jq '.[] | select(.id == "posts/2026/05/2026-05-29-blogmore-v2-33-0.md") | .metadata'

which gives me this result:

{
  "title": "BlogMore v2.33.0",
  "date": "2026-05-29 14:14:10+0100",
  "category": "Coding",
  "tags": "BlogMore, Coding, PyPI, Python",
  "cover": "/attachments/2026/05/29/focus.webp"
}

I can see this being pretty useful to blogmore.el at some point, if I want to add some better querying tools or similar (not that I'd want to run a dump every time, it does require that a full parse and render has to happen).

Hopefully this will be useful to someone else. I know I'll be toying with it to find out other things about my posting history.