pydscheck -- A quick hack that keeps slowly growing

Posted on 2019-10-26 13:19 +0100 in Python • Tagged with Python, documentation • 3 min read

Something I always try to do when I'm coding is be consistent. I feel this is important. While people's coding standards may differ, I think different approaches are easier to handle if someone has been consistent with their style across all of their code.

This also stands for documentation too.

In my current position, I do a lot of Python coding, and one of the things I like about Python (there are things I don't like too, but that's not for now) is that it has doc-strings (just like my favourite language). I use them extensively, ensuring every function and method has some form of documentation, and generally I use Sphinx to generate documentation from those doc-strings.

Early on I was bothered by the fact that, just by the simple act of making typos, I wasn't keeping the form of the doc-strings consistent. And in this case it was a really simple thing that was bugging me. Normally, if I'm writing a single-line doc-string, I'll write like this:

def one_liner():
    """Here is a one-line doc-string."""

So far, so good. But, if the doc-string is a multi-liner, I prefer the ending quotes to be on a line of their own, like this:

def multi_liner():
    """Here is the first line.
    Here is another line.
    Here is the final line.
    """"

But, sometimes, by accident, I'd leave a doc-string like this:

def multi_liner():
    """Here is the first line.
    Here is another line.
    Here is the final line.""""

While it's really not a big deal, it would bug me and every time I found one like this I'd "fix" it.

Eventually, it bugged me enough that I decided I was going to write a little tool to find all such instances in my code and report them. My first approach was to think "I could just do this with some regexp magic", which was really a bad idea. Then I though, I know, I should use this as an excuse to to play with Python's ast library.

That worked really well! I had the first version of the code up and running in no time. It was simple but did the job. It ran through Python code I threw at it and alerted me to both missing doc-strings, and doc-strings with the ending I didn't like.

That served me for a while, until one day I realised that it wasn't quite doing the job correctly; it was only really looking at top-level functions and top-level methods in classes. Sometimes, not often, but sometimes, I'll define functions within functions, and I feel they deserve documentation too. So then I modified the code to ensure it walked every part of the AST.

Since then, when I've run into new things and had new ideas, pydscheck has grown and grown. I've added checks that all mentioned parameters have a type; I've added checks that any function/method that returns something actually documents the return value; I've added checks that any documentation of a returned value includes its type; I've added checks that any function or method that yields a value documents that fact; I've added checks that ensure that every parameter is documented in some way.

Each time I've done this it's helped uncover issues in my code's documentation that could be cleaner, and it's also given me a pet project to slowly better understand Python's AST.

It could be that there are better tools out there, I'd have thought that a good doc-string linting tool would be something someone had already written. But this time around I was happy to NIH it because I needed a fun learning exercise that would also have some benefits for my day-to-day work.

I'll caveat this with the fact that it's very particular to how I work and how I like my documentation to look, but if it sounds useful, here it is: https://github.com/davep/pydscheck.

There's still lots I could do with it. First off I should really properly package it up so it can be installed as a command line tool via pip. Other things that would be handy would be to allow some form of customisation of how it works. I'm sure there's other fun things I can do with it too.

That's part of the fun of having a pet project: you can tinker when you like and also get benefits from it as you use it.


pypath.el -- A little Emacs hack to help with Django

Posted on 2019-10-19 10:35 +0100 in Emacs • Tagged with Emacs, Python, Django, Lisp • 2 min read

One of the things I really like about coding with Emacs is how I can easily identify a repeated task and turn it into a command in my environment, saving me a load of work down the line.

pypath.el is one such example.

In my day job I write a lot of Django code. As part of that, I write a good number of unit tests too. Sometimes I'll write the tests as I'm writing the code they test, other times I'm writing them afterwards; it all really depends on where my head's at and how the code is flowing.

When I'm writing those tests, I often want to test them as I go. Given that starting up a test session can take a while, and given that running all the tests in the system can take a while, it's really handy if I can run that single test I'm working on.

This is easy enough with Django. In my work environment it's normally something like:

$ pipenv run ./manage.py test -v 2 app.test.some.sub.module.TestClass.test_method

Only... typing out the:

app.test.some.sub.module.TestClass.test_method

part is a bit of a pain. Sure, once you've typed it the once you can use your shell of choice (mine being fish and on occasion eshell) to recall it from history, but typing it out the first time is the annoying part.

So this was the point where I took 1/2 hour or so to code up pypath.el to solve the problem for me. It gives me two commands:

  • pypath: which simply places the dotted path of the current "defun", within the context of being part of a Django system, into the clipboard buffer.
  • pypath-django-test: which works similar to the above but places the whole Django testing command into the clipboard.

With the above, I can work on a test, hit the latter command above, flip to my command line, paste and I'm running the test.

Of course, I'm sure there's plenty of other handy ways to do this. Doubtless there's work environments where the test can be run right there, in the edit buffer, without flipping away, and which takes into account the fact that there's a pipenv-managed virtual environment involved, etc. If there is, that's great, but I don't think it'd work with how I work.

And that's one of the things I really love about Emacs, and why it's still my work environment after almost 25 years of on and off use: with very little work on my part I can create a couple of commands that work exactly how I need them to. While it's great to create generally-useful code for Emacs that lots of people benefit from, sometimes the real value is that you can code up your own particular quirk and just get on with stuff.

To conclude: this post isn't to show off pypath.el; really this post is to sing the praises of Emacs and why it still works so well for me after all these years.


A little speed issue with openpyxl

Posted on 2018-06-02 13:16 +0100 in Python • Tagged with Python, openpyxl • 4 min read

It's been very quiet on the blogging front, I'm afraid, mostly for the reasons I wrote about back in December last year. In that time I've been really very busy with work (in a good way, in a very good way) and there's not a whole lot of time to be toying with pet projects at home.

However, finding myself with a spare hour or so, I wanted to write about something I did run into as part of some development at work, and which I thought might be worth writing about in case it helps someone else.

Recently I've needed to write a library of code for loading data from Excel Workbooks. Given that the vast majority of coding I do at the moment is in Python, it made sense to make use of openpyxl. The initial prototype code I wrote worked well and it soon grew into a full-blown library that'll be used in a couple of work-related projects.

But one thing kept niggling me... It just wasn't as fast as I'd expected. The workbooks I'm pulling data from aren't that large, and yet it was taking a noticeable number of seconds to read in the data, and when I let the code have a go at a directory full of such workbooks... even the fan on the machine would ramp up.

It didn't seem right.

I did a little bit of profiling and could see that the code was spending most of its time deep in the guts of some XML-parsing functions. While I know that an xlsx file is pretty much an XML document, it seemed odd to me that it would take so much time and effort to pull the data out from it.

Given that I had other code to be writing, and given that the workbook-parsing code was "good enough" for the moment, I moved on for a short while.

But, a couple of weeks back, I had a bit of spare time and decided to revisit it. I did some more searching on openpyxl and speed issues and almost everything I found said that the common problem was failing to open the workbook in read_only mode. That can't have been my problem because I'd being doing that from the very start.

Eventually I came across a post somewhere (sorry, I've lost it for now -- I'll try and track it down again) that suggested that openpyxl was very slow to read from a workbook if you were reading one cell at a time, rather than using generators. The suggestion being that every time you pull a value form a cell, it has to parse the whole sheet up to that cell. Generators, on the other hand, would allow access to all the cells during one parse.

This seemed a little unlikely to me -- I'd have expected the code to cache the parsing results or something like that -- but it also would explain what I was seeing. So I decided to give it a test.

openpyxl-speed-issue is a version of the tests I wrote and ran and they absolutely show that there's a huge difference between cell-by-cell access vs generator access.

Code like this:

for row in range( 1, sheet.max_row + 1 ):
    for col in range( 0, sheet.max_column ):
        value = sheet[ row ][ col ].value

is far slower than something like this:

for row in wb[ "Test Sheet" ].rows:
    for cell in row:
        value = cell.value

Here's an example of the difference in time, as seen on my iMac:

$ make test
pipenv run time ./read-using-generators
        1.59 real         0.44 user         0.04 sys
pipenv run time ./read-using-peeking
       25.02 real        24.88 user         0.10 sys

As you can see, the cell-by-cell approach is about 16 times slower than the generator approach.

In most circumstances the generator approach would make most sense anyway, and in any other situation I probably would have used it and never have noticed this. However, the nature of the workbooks I need to pull data from means I need to "peek ahead" to make decisions about what I'm doing, so a more traditional loop over, with an index, made more sense.

I can easily "fix" this by using the generator approach to build up a two-dimensional array of cells, acquired via the generator; so I can still do what I want and benefit from using generators.

In conclusion: given that I found it difficult to find information about my speed issue, and given that the one off-hand comment I saw that suggested it was this wasn't exactly easy to find, I thought I'd write it all down too and create a repository of some test code to illustrate the issue. Hopefully someone else will benefit from this in the future.