Posts tagged with "Coding"

git2gantt -- Simple tool to visualise coding runs

2 min read; 13 GFI

At the start of this year, as part of a much bigger process to review the work that had taken place over the previous 12 months, I was asked (at work) to provide some information about how much time I'd spent on various projects. Now, for me, there's really only one project, but there's lots of different tools and libraries that I've written to support the main work I do. All of these are split into different repositories in the company-internal instance of GitLab. This meant that getting a rough idea of what I was working on and when would be easy enough -- it's all there in the commit history.

Given that this information would make up a couple of slides at most during a far bigger presentation, I wanted something that would be snappy and easy for non-developers to follow and understand. I spent a bit of time pondering some options and decided that (ab)using a gantt chart layout would make sense.

That choice was made all the more easier given that GitLab supports the use of mermaid charts within its Markdown. This meant I could quickly write some code that took the git log of each repository, turned it into mermaid code, and then render it (by hand, this was all about getting things done quickly) via GitLab.

This sounded like it could be a fun personal project. The result was some Python code called git2gantt.

As mentioned above, the output isn't anything too clever, it's just code that can be used to create a plot via mermaid. For example, running git2gantt over itself:

gantt
  title git2gantt output
  dateFormat YYYY-MM-DD

  section git2gantt
  Development: devgit2gantt20190208, 2019-02-08, 2019-02-13
  Development: devgit2gantt20190214, 2019-02-14, 2019-02-15
  Development: devgit2gantt20190303, 2019-03-03, 2019-03-04
  Development: devgit2gantt20191203, 2019-12-03, 2019-12-04

Usage is pretty straightforward:

Usage of the tool

As you can see, it can be run over multiple repos at once, and there's also an option to have it consider every branch within each repository. Another handy option is the ability to limit the output to just one author -- perhaps you just want to document what you've done on a repo, not the contributions of other people.

Also especially handy, if you don't want to bore people with too much detail, is the "fuzz" option. This lets you tell git2gannt how relaxed you want it to be when it tries to decide how long a run of work on a repo lasted. So, perhaps, you're working on and off on a library that supports some other system you're documenting, but you might only be making changes every other day or so. With the correct fuzz value you can make it clear you were working on the library for a couple of weeks, despite there only being a commit every other day.

An example of running the output over a handful of projects would look something like this:

Output of the tool

This is one of those tools I knocked up quickly to get a job done, and haven't quite got round to finishing off fully. One thing I'd really like to do is add mermaid support directly within it, so that it actually has the option to emit plots, not just mermaid code (or, perhaps, drop the mermaid approach and use something else entirely).

Meanwhile though, if you're looking for something quick and dirty that will help you visualise what you've been working on and when for a good period of time... perhaps this will help.

Being phony, and Lispy regular expressions

2 min read; 9 GFI

While it does seem that they're a little out of fashion these days, in some circles anyway, I'm still an avid fan of make and make files. Even in environments where I don't need a Makefile to actually build anything, I'll use one (or more) to help create handy shortcuts for getting stuff done.

Looking at the main Makefile for one of my major work projects, there's 45 targets that help fire off various jobs (all of them self-documenting using a variation on an approach I read a while back).

In most cases the targets aren't real targets. That's to say, they don't build the thing they're called. They are PHONY targets. So, as makes sense, I make a point of marking them all as such. I follow the convention that has the .PHONY marker appear on the line before the target; this feels cleaner to me and easier to follow and maintain.

But.... I'm lazy. And I use Emacs. Typing out .PHONY foo all the time feels like far too much work. So, some time ago, I quickly threw together make-phony.el.

With this I could be really lazy. I could type out the Makefile target and then, with my cursor on it, press a key combination and have the .PHONY marker put in place.

Does it save much time? Yeah, probably not really. But it was a fun little exercise and an excuse to write a little bit of Emacs Lisp.

There's one thing I made a point of doing in the heart of this too: using rx. For anyone who doesn't know of it, think of it as a very Lispy way of writing regular expressions. I won't even try and explain it all here because others have done an excellent job already. What I will do is say this: if you're in the habit of writing some Emacs Lisp, or even tinkering with your configuration, and you find yourself writing a regular expression, consider looking at rx -- it's well worth the time to get to know it.

Slowly, as time goes on, I'm weeding out "vanilla" regular expressions from my config and code and moving over to using rx. I feel, quite rightly I think, that something like this:

(rx
 (or
  ;; Ignore hidden files.
  (group bol ".")
  ;; I never want to edit the desktop.
  (group "Desktop/" eol)
  ;; Ignore compiled files.
  (group "." (or "pyc" "elc") eol)
  (group ".egg-info/" eol)))

is much easier to write, read and maintain, than this:

"\\(^\\.\\)\\|\\(Desktop/$\\)\\|\\(\\.\\(?:\\(?:\\(?:el\\|py\\)c\\)\\)$\\)\\|\\(\\.egg-info/$\\)"

I mean, even if the regular expression above can be written in a more efficient way (and I imagine it can), as someone working in a Lisp environment, I'd much sooner write and work with the rx version.

Getting started

1 min read; 14 GFI

By coincidence, in a couple of different places over the last couple of weeks, the subject of "how do I progress in learning to program?" has cropped up. For me, I think the approaches and solutions tend to be the same for when I want to get my head around a new language: read good examples of idiomatic code, read other related materials, find a problem you care about and implement a solution (ideally something you'll directly benefit from, or at least others may benefit from). Hence the 5x5 puzzle and Norton Guide reader projects I mentioned in my previous post.

Of course, not everyone has problems that they need solving in a way that would work for this approach. So another approach I've recommended in the past is to go looking on somewhere like GitHub and find projects that promote "low-hanging fruit" issues in a way that's designed to be friendly for those who are new to development, new to contributing or new to the problem domain.

While looking for examples of this yesterday I stumbled on Awesome for Beginners. This looks like a great list and one I'm going to keep bookmarked for future reference. Now, this particular list does seem to have an emphasis on pulling in people who are new to contributing to a project rather than new to development, but it does strike me as a good place to start looking no matter where you're coming from.

I know I'm going to start having a wander around that list. It's always nice to contribute and I feel there's real personal benefit in actively solving a problem that someone else has and welcomes help with.

Going on a journey

3 min read; 13 GFI

It's hardly a revelation to say that learning a new programming language, or even learning software development at all, is even more difficult if you don't have an actual problem to solve. I know I'm not alone in having pet projects that, when faced with a new environment, I'll code up a version of that project as a way to get familiar with and understand a language's idioms while implementing something I know well.

Personally, my two favourites are a puzzle called 5x5 (here, here, here, here, here, here and here), and writing a library or even a full application to read Norton Guide database files (here, here, here, here, here, here, here and here). Both are fun to work on, have practical uses, and both have the benefit of being solved problems (for me) that let me concentrate on the "how do I do X in this language/toolkit/environment/framework/etc?".

Even with those two as my goto projects, I'm always open to new small problems that might be fun to apply to languages I do know, or languages I want to get to know (internally at work we have a fun "league" of sorts, writing a particular hamming distance calculation tool in different languages, for example).

A few days ago, via this repo on GitHub, I discovered this fun little problem. Right away I could see the benefit in it. As a "go away and code up a solution" interview question it strikes me as near-perfect. It's obviously not hard to solve, but it touches on some basic but important aspects of software development and so will allow the developer to show off how they approach things.

There's so many different approaches to it too. Even in a single language, I could imagine having some fun writing the smallest code to solve the problem, the most idiomatic code to solve the problem, the most supportable and well-documented code to solve the problem, etc. And then there's the thing I talk about above: knowing the solution and knowing it's easy, you can then use it to learn the idiomatic way of solving the problem in new languages.

Even better, the README of the original repo links to solutions others have written. Knowing the problem, and knowing the solution, you can then go and read other people's code and learn something about different styles and different languages.

Over the next few weeks, as I get free time, I think I might just do this. Take the "Journeys" problem and write versions in different languages I work with, or know, and also use it to get to know languages I've yet to know or use heavily (I'm especially keen to try a version in Julia -- a language I really like the look of and want to find a reason to use).

Meanwhile, yesterday, I had a quick go at a first version in Python (aimed at Python 3.8 or higher): https://github.com/davep/journeys.py

I set out to try and write something that was fairly idiomatic Python, which uses tools that I tend to employ when working on Python projects (pipenv, make, etc), and which also used something I've never quite found a need for so far in my usual coding, but which I can see being useful and helpful.

I even threw in a couple of uses of PEP 572!

I can see me tinkering with this some more over the next few days. I can even see me writing a very different implementation in Python, just for the fun of it.

I think that's what I like about this little problem. It's a good way to do a bit of programming exercise; it's like the perfect way to do the programming equivalent of going for a short run.

My Pylint shame

2 min read; 10 GFI

I first got into Python in the mid-to-late 1990s. It's so far back that I think the copy of Programming Python that I have (sadly in storage at the moment) might be a first edition. I probably fell out of the habit of using Python some time in the early 2000s (that was when I met Ruby). It was only 22 months ago that I started using Python a lot thanks to a change of employer.

As you might imagine, much had changed in the 15+ years since I'd last written a line of Python in anger. So, early on, I made a point of making Pylint part of my development process. All my projects have a make lint make target. All of my projects lint the code when I push to master in the company GitLab instance. These days I even use flycheck to keep me honest as I write my code; mostly gone are the days where I don't know of problems until I do a make lint.

Leaning on Pylint in the early days of my new position made for a great Python refresher for me. Now, I still lean on it to make sure I don't make daft mistakes.

But...

Pylint and I don't always agree. And that's fine. For example, I really can't stand Pylint's approach to whitespace, and that is a hill I'll happily die on. Ditto the obsession with lines being no more than 80 characters wide (120 should be fine thanks). As such any project's .pylintrc has, as a bare minimum, this:

[FORMAT]
max-line-length=120

[MESSAGES CONTROL]
disable=bad-whitespace

Beyond that though, aside from one or two extras that pertain to particular projects, I'm happy with what Pylint complains about.

There are exceptions though. There are times, simply due to the nature of the code involved, that Pylint's insistence on code purity isn't going to work. That's where I use its inline block disabling feature. It's handy and helps keep things clean (I won't deploy code that doesn't pass 10/10), but there is always this nagging doubt: if I've disabled a warning in the code, am I ever going to come back and revisit it?

To help me think about coming back to such disables now and again, I thought it might be interesting to write a tool that'll show which warnings I disable most. It resulted in this fish abbr:

abbr -g pylintshame "rg --no-messages \"pylint:disable=\" | awk 'BEGIN{FS=\"disable=\";}{print \$2}' | tr \",\" \"\n\" | sort | uniq -c | sort -hr"

The idea here being that it produces a "Pylint hall of shame", something like this:

  12 wildcard-import
  12 unused-wildcard-import
   8 no-member
   6 invalid-name
   5 no-self-use
   4 import-outside-toplevel
   4 bare-except
   2 unused-argument
   2 too-many-public-methods
   2 too-many-instance-attributes
   2 not-callable
   2 broad-except
   1 wrong-import-position
   1 wrong-import-order
   1 unused-variable
   1 unexpected-keyword-arg
   1 too-many-locals
   1 arguments-differ

To break the pipeline down:

rg --no-messages "pylint:disable="

First off, I use ripgrep (if you don't, you might want to have a good look at it -- I find it amazingly handy) to find everywhere in the code in and below the current directory (the --no-messages switch just stops any file I/O errors that might result from permission issues -- they're not interesting here) that contains a line that has a Pylint block disable (if you tend to format yours differently, you'll need to tweak the regular expression, of course).

I then pipe it through awk:

awk 'BEGIN{FS="disable=";}{print $2}'

so I can lazily extract everything after the disable=.

Next up, because it's a possible list of things that can be disabled, I use tr:

tr "," "\n"

to turn any comma-separated list into multiple lines.

Having got to this point, I sort the list, uniq the result, while prepending a count (-c), and then sort the result again, in reverse and sorting the numbers based on how a human would read the result (-hr).

sort | uniq -c | sort -hr

It's short, sweet and hacky, but does the job quite nicely. From now on, any time I get curious about which disables I'm leaning on too much, I can use this to take stock.

pydscheck -- A quick hack that keeps slowly growing

3 min read; 11 GFI

Something I always try to do when I'm coding is be consistent. I feel this is important. While people's coding standards may differ, I think different approaches are easier to handle if someone has been consistent with their style across all of their code.

This also stands for documentation too.

In my current position, I do a lot of Python coding, and one of the things I like about Python (there are things I don't like too, but that's not for now) is that it has doc-strings (just like my favourite language). I use them extensively, ensuring every function and method has some form of documentation, and generally I use Sphinx to generate documentation from those doc-strings.

Early on I was bothered by the fact that, just by the simple act of making typos, I wasn't keeping the form of the doc-strings consistent. And in this case it was a really simple thing that was bugging me. Normally, if I'm writing a single-line doc-string, I'll write like this:

def one_liner():
    """Here is a one-line doc-string."""

So far, so good. But, if the doc-string is a multi-liner, I prefer the ending quotes to be on a line of their own, like this:

def multi_liner():
    """Here is the first line.
    Here is another line.
    Here is the final line.
    """"

But, sometimes, by accident, I'd leave a doc-string like this:

def multi_liner():
    """Here is the first line.
    Here is another line.
    Here is the final line.""""

While it's really not a big deal, it would bug me and every time I found one like this I'd "fix" it.

Eventually, it bugged me enough that I decided I was going to write a little tool to find all such instances in my code and report them. My first approach was to think "I could just do this with some regexp magic", which was really a bad idea. Then I though, I know, I should use this as an excuse to to play with Python's ast library.

That worked really well! I had the first version of the code up and running in no time. It was simple but did the job. It ran through Python code I threw at it and alerted me to both missing doc-strings, and doc-strings with the ending I didn't like.

That served me for a while, until one day I realised that it wasn't quite doing the job correctly; it was only really looking at top-level functions and top-level methods in classes. Sometimes, not often, but sometimes, I'll define functions within functions, and I feel they deserve documentation too. So then I modified the code to ensure it walked every part of the AST.

Since then, when I've run into new things and had new ideas, pydscheck has grown and grown. I've added checks that all mentioned parameters have a type; I've added checks that any function/method that returns something actually documents the return value; I've added checks that any documentation of a returned value includes its type; I've added checks that any function or method that yields a value documents that fact; I've added checks that ensure that every parameter is documented in some way.

Each time I've done this it's helped uncover issues in my code's documentation that could be cleaner, and it's also given me a pet project to slowly better understand Python's AST.

It could be that there are better tools out there, I'd have thought that a good doc-string linting tool would be something someone had already written. But this time around I was happy to NIH it because I needed a fun learning exercise that would also have some benefits for my day-to-day work.

I'll caveat this with the fact that it's very particular to how I work and how I like my documentation to look, but if it sounds useful, here it is: https://github.com/davep/pydscheck.

There's still lots I could do with it. First off I should really properly package it up so it can be installed as a command line tool via pip. Other things that would be handy would be to allow some form of customisation of how it works. I'm sure there's other fun things I can do with it too.

That's part of the fun of having a pet project: you can tinker when you like and also get benefits from it as you use it.

gitweb.el -- Quickly visit a repo's forge from Emacs

1 min read; 14 GFI

gh.fish, which I wrote about yesterday, actually sprang from something I initially wrote for Emacs. I'm often spending my time switching between Emacs and the command line (which is fast and easy -- I normally work on macOS and have Emacs and iTerm2 running full screen, and I can switch between them without ever taking my hands off the keyboard), so it makes sense to have some handy commands repeated in both places.

So, originally, I'd written gitweb.el to open the current repository's "forge" in the web browser.

As with the fish version, how it works is quite simple. I use shell-command-to-string to call git and find the origin URL for the current repo, and then manipulate it a bit to turn it into a normal browser-friendly URL. Finally, if I get something workable, I use browser-url to have the resulting page open in the browsing environment of choice.

I have the command bound to a key combination that's similar to the ones I use with Magit and forge, so in terms of muscle-memory it's easy for me to remember what to press when I quickly want to skip over from a Magit view to the repo forge itself.

Similar to what I wrote a couple of days back, I think this again illustrates how handy Emacs is as a work environment. While it's absolutely true that there are other development environments out there that offer similar extensibility, Emacs is the one I'm comfortable with, and it has a long history of offering this.

gh.fish -- Quickly visit a repo's forge

1 min read; 9 GFI

These days fish is my shell of choice. I started out with bash back in the 1990s, went through a bit of a zsh/oh-my-zsh phase, but earlier this year finally settled on fish.

At some point I might write a post about my fish config, and why fish works well for me. But that's an idea for another time.

In this post I thought I'd share a little snippet of code that can come in handy now and again.

Sometimes I find myself inside a git repo, in the shell, and I want to get to the "forge" for that repo. This is most often either on GitHub, or in a company-local installation of GitLab. To get there quickly I wrote gh.fish:

##############################################################################
# Attempt go visit the origin hub for the current repo.

function gh -d "Visit the repo in its origin hub"

    # Check that there is some sort of origin.
    set origin (git config --get remote.origin.url)

    # If we didn't get anything...
    if not test "$origin"
        # ...complain and exit.
        echo "This doesn't appear to be a git repo with an origin"
        return 1
    end

    # Open in the browser.
    open "https://"(string replace ":" "/" (string replace -r '\.git$' "" (string split "@" $origin)[ 2 ]))

end

### gh.fish ends here

The idea is pretty simple: I see if the repo has an origin of some description and, if it has, I slice and dice it into something that looks like the URL you'd expect to find for a GitHub or GitLab repo. Finally I use open to open the URL in the environment's browser of choice.

pypath.el -- A little Emacs hack to help with Django

2 min read; 10 GFI

One of the things I really like about coding with Emacs is how I can easily identify a repeated task and turn it into a command in my environment, saving me a load of work down the line.

pypath.el is one such example.

In my day job I write a lot of Django code. As part of that, I write a good number of unit tests too. Sometimes I'll write the tests as I'm writing the code they test, other times I'm writing them afterwards; it all really depends on where my head's at and how the code is flowing.

When I'm writing those tests, I often want to test them as I go. Given that starting up a test session can take a while, and given that running all the tests in the system can take a while, it's really handy if I can run that single test I'm working on.

This is easy enough with Django. In my work environment it's normally something like:

$ pipenv run ./manage.py test -v 2 app.test.some.sub.module.TestClass.test_method

Only... typing out the:

app.test.some.sub.module.TestClass.test_method

part is a bit of a pain. Sure, once you've typed it the once you can use your shell of choice (mine being fish and on occasion eshell) to recall it from history, but typing it out the first time is the annoying part.

So this was the point where I took 1/2 hour or so to code up pypath.el to solve the problem for me. It gives me two commands:

  • pypath: which simply places the dotted path of the current defun, within the context of being part of a Django system, into the clipboard buffer.
  • pypath-django-test: which works similar to the above but places the whole Django testing command into the clipboard.

With the above, I can work on a test, hit the latter command above, flip to my command line, paste and I'm running the test.

Of course, I'm sure there's plenty of other handy ways to do this. Doubtless there's work environments where the test can be run right there, in the edit buffer, without flipping away, and which takes into account the fact that there's a pipenv-managed virtual environment involved, etc. If there is, that's great, but I don't think it'd work with how I work.

And that's one of the things I really love about Emacs, and why it's still my work environment after almost 25 years of on and off use: with very little work on my part I can create a couple of commands that work exactly how I need them to. While it's great to create generally-useful code for Emacs that lots of people benefit from, sometimes the real value is that you can code up your own particular quirk and just get on with stuff.

To conclude: this post isn't to show off pypath.el; really this post is to sing the praises of Emacs and why it still works so well for me after all these years.

A little speed issue with openpyxl

3 min read; 13 GFI

It's been very quiet on the blogging front, I'm afraid, mostly for the reasons I wrote about back in December last year. In that time I've been really very busy with work (in a good way, in a very good way) and there's not a whole lot of time to be toying with pet projects at home.

However, finding myself with a spare hour or so, I wanted to write about something I did run into as part of some development at work, and which I thought might be worth writing about in case it helps someone else.

Recently I've needed to write a library of code for loading data from Excel Workbooks. Given that the vast majority of coding I do at the moment is in Python, it made sense to make use of openpyxl. The initial prototype code I wrote worked well and it soon grew into a full-blown library that'll be used in a couple of work-related projects.

But one thing kept niggling me... It just wasn't as fast as I'd expected. The workbooks I'm pulling data from aren't that large, and yet it was taking a noticeable number of seconds to read in the data, and when I let the code have a go at a directory full of such workbooks... even the fan on the machine would ramp up.

It didn't seem right.

I did a little bit of profiling and could see that the code was spending most of its time deep in the guts of some XML-parsing functions. While I know that an xlsx file is pretty much an XML document, it seemed odd to me that it would take so much time and effort to pull the data out from it.

Given that I had other code to be writing, and given that the workbook-parsing code was "good enough" for the moment, I moved on for a short while.

But, a couple of weeks back, I had a bit of spare time and decided to revisit it. I did some more searching on openpyxl and speed issues and almost everything I found said that the common problem was failing to open the workbook in read_only mode. That can't have been my problem because I'd being doing that from the very start.

Eventually I came across a post somewhere (sorry, I've lost it for now -- I'll try and track it down again) that suggested that openpyxl was very slow to read from a workbook if you were reading one cell at a time, rather than using generators. The suggestion being that every time you pull a value form a cell, it has to parse the whole sheet up to that cell. Generators, on the other hand, would allow access to all the cells during one parse.

This seemed a little unlikely to me -- I'd have expected the code to cache the parsing results or something like that -- but it also would explain what I was seeing. So I decided to give it a test.

openpyxl-speed-issue is a version of the tests I wrote and ran and they absolutely show that there's a huge difference between cell-by-cell access vs generator access.

Code like this:

for row in range( 1, sheet.max_row + 1 ):
    for col in range( 0, sheet.max_column ):
        value = sheet[ row ][ col ].value

is far slower than something like this:

for row in wb[ "Test Sheet" ].rows:
    for cell in row:
        value = cell.value

Here's an example of the difference in time, as seen on my iMac:

$ make test
pipenv run time ./read-using-generators
        1.59 real         0.44 user         0.04 sys
pipenv run time ./read-using-peeking
       25.02 real        24.88 user         0.10 sys

As you can see, the cell-by-cell approach is about 16 times slower than the generator approach.

In most circumstances the generator approach would make most sense anyway, and in any other situation I probably would have used it and never have noticed this. However, the nature of the workbooks I need to pull data from means I need to "peek ahead" to make decisions about what I'm doing, so a more traditional loop over, with an index, made more sense.

I can easily "fix" this by using the generator approach to build up a two-dimensional array of cells, acquired via the generator; so I can still do what I want and benefit from using generators.

In conclusion: given that I found it difficult to find information about my speed issue, and given that the one off-hand comment I saw that suggested it was this wasn't exactly easy to find, I thought I'd write it all down too and create a repository of some test code to illustrate the issue. Hopefully someone else will benefit from this in the future.