On to something new (redux)

Posted on 2022-10-05 09:24 +0100 in Life • Tagged with coding, work, life, Python, news • 4 min read

Just over five years ago I got a message from my then employer to say I was going to be made redundant after 21 years working for them. After the 3 month notice period the final day came. Meanwhile, I found something new that looked terrifying but interesting. In the end it was less terrifying and way more interesting than I imagined it would be. It was fun too.

But... (there's always a but isn't there?)

In the four and change years I've been there the company got bought out, and then the result of that got bought up. As I've mentioned before I'm generally not a "big company" kind of person; in all my years I've found that I'm happier working in a smaller place. After a couple of buyouts my employer had gone from being 10s of people in size to 100s of people in size (and technically 10s of 1,000s of people in size depending on how you look at it).

This change in ownership and size meant the culture became... well, let's just say not as friendly as you tend to enjoy when it's a smaller group of folk. On top of that I was starting to notice that my efforts were making less of an impact as things got bigger, and I started to feel like my contributions weren't really relevant any more. There were some problematic things happening too: undermining of efforts, removal of responsibilities without consultation or communication, that sort of thing. Plus worse. There's little point in going into the detail, but it's fair to say that work wasn't as fun as it used to be.

That felt like a good time to start to look around. If work makes you feel unhappy and you can look around... look around.

Thing is, I wasn't sure what to look for. I was in the comfortable position of, unlike last time, not needing to find something, so I could take my time at least. Over the course of the last year I've spoken to many different companies and organisations, some big (yes, I know, I said I don't like big places -- sometimes what's on offer deserves a fair hearing), some small, but none of them quite said "this feels like me". In some cases the whole thing didn't have the right vibe, in others the industry either didn't interest me, or felt uncomfortable given my personal values. In one particular case a place looked interesting until I checked the CTO's socials and OMG NO NO NO AVOID AVOID (that was a fun one).

Then I saw Will McGugan saying he was hiring to expand Textualize. This caught my interest right away for two good reasons.

I can't remember how long I've been following Will on Twitter; I likely stumbled on him as I got back into Python in 2018 and I also remember noting that he was a Python hacker just up the road from me. We'd vaguely chatted on Twitter, briefly, in that "Twitter acquaintance" way we all often do (I remember one brief exchange about fungus on The Meadows), and he'd seemed like a good sort. A small company run by a "good sort" kinda person felt like a damn good reason.

The second reason was Textual itself. I'd been watching Will develop it, in open, with great interest. I had (and still have) a plan to write a brand new CHUI-based (okay fine TUI-based as the kids like to say these days!) Norton Guide reader, all in Python, and Textual looked like the perfect framework to do the UI in. The chance to be involved with it sounded awesome.

Now, I said two reasons, but there's also a third I guess: Will's pitch for applying to Textualize felt so damn accessible! I'm on the older end of the age range of this industry; for much of my working life as a developer I've worked in isolation from other developers; while I first touched Python in the 90s, I've only been using it in anger since 2018 and still feel like I've got a lot to learn. Despite all these things, and more, saying "aye Dave this is beyond you" I felt comfortable dropping Will a line.

Which resulted in a chat.

Which resulted in some code tinkering and chatting.

Which resulted in...

Something new.

So, yeah, as of 2022-10-10 I'm on yet another new adventure. Time for me to really work on my Python coding as I work with Will and the rest of the team as part of Textualize.

Or, as I put it on Twitter a few days ago: I'm going to be a Python impostor syndrome speedrunner!


The PEP 8 hill I will die on

Posted on 2020-08-23 16:54 +0100 in Python • Tagged with Python • 3 min read

I first learnt Python back in the mid-to-late 90s, used it in place of Perl once I was comfortable with it, and then we sort of drifted apart when I first met Ruby. It's only in the last couple of years that I've got back into it, and in a huge way, thanks to my (not-quite-so-) new job. Despite the quirks and oddness (as I perceive them), I actually quite like Python and it's one of those languages that just flows off my fingers. I'm sure you know the same thing, perhaps not with Python, but there will be languages that just flow for you, and those that take a bit more effort and concentration. Python... feels okay to me.

I also appreciate that there's been a long-standing style guide. I quite like PEP 8 as a read, and think there's a lot of good ideas in there; much of the content sits with how I'd approach things if I was tasked to come up with such a document. With this in mind, I'm a fairly heavy user of pylint and it in turn leans on PEP 8 (amongst other things) and I'm happy to accept most of its judgements. Not all of its judgements, but even when I disagree with it I try and keep track of how far I'm drifting.

But there is absolutely one hill I will happily die on when it comes to PEP 8: the concept of "extraneous whitespace" in lists and expressions. Just.... no! Oh gods no!

To borrow a line of code from the journey problem I dabbled with a while back, PEP 8 would have me write something like this:

def perform(commands: List[str],state: State) -> State:

Now, I'm sure plenty of people won't see a problem with this at all; but all I can see is an almost-claustrophobic parameter list. What's with the parameters being jammed up against the opening and closing parens? Why have the dinky little comma lost between two different things? Why have it look like a long stream of letters and punctuation? Why....

No.

Just no.

I can't.

Rightly or wrongly, I just need for the code to breathe a bit. When I type this:

def perform( commands: List[ str ], state: State ) -> State:

suddenly if feels like there's fresh air in the code, like it flows gently out of my head, off my fingers, through the keyboard and into the buffer.

In my head, and to my eyes, the code is.... relaxed.

Do I have a rational reason for this? Nope. Then again I don't see one for doing it the other way either; I can't think of one and I don't see one in the source document. So, that's a warning I always turn off with pylint and it's a style I carry through all my Python code; and I think that's the important point here: anyone reading and working with my code should see the same style all the way through. It might differ from PEP 8 on this point, but at least it's the same all the way.

And, really, that's okay: PEP 8 is there to be ignored. ;-)

PS: This is a small part of another blog post I was meaning to write, and might still do, about my (still ongoing) experience of getting lsp-mode up and running in Emacs and having it play nice with Python projects. I have that working, but it was a bit of a learning curve and epic battle over a couple of days, and one that had me first encounter pycodestyle. I may still tell the tale...


When the man page fibs

Posted on 2020-07-10 20:58 +0100 in Coding • Tagged with homebrew, macOS, Unix, Python • 3 min read

Earlier this week something in my development environment, relating to Homebrew, Python, pyenv and pipenv, got updated and broke a handful of repositories. Not in a way that I couldn't recover from, just in a way that was annoying, got in the way of my workflow, and needed attention. (note to self: how I set up for Python/Django development on a machine might be a good post in the future)

Once I was sure what the fix was (pretty much: nuke the virtual environment and recreate it with pipenv, being very explicit about the version of Python to use) the next step was to figure out how many repositories were affected; not all were and there wasn't an obvious pattern to it. What was obvious was that the problem came down to python in the .venv directory pointing to a binary that didn't exist any more.

Screenshot 2020-07-10 at 20.21.15.png

So... tracking down problematic repositories would be simple enough, just look for every instance of .venv/bin/python and be sure it points to something rather than nothing; if it points to nothing I need to remake the virtual environment.

I quickly knocked up a script that was based around looking over the results of a find, and initially decided to use file to perform the test on python. It seemed to make sense, as I wrote the script I checked the man page for file(1) on macOS and sure enough, this exists:

-E On filesystem errors (file not found etc), instead of handling the error as regular output as POSIX mandates and keep going, issue an error message and exit.

Given that file dereferences links by default, that should get me an error for a broken link, right? Bit hacky I guess, but it was the first thing that came to mind for a quick bit of scripting and would do the trick. Only...

$ file -E does-not-exist
file: invalid option -- E
Usage: file [bcCdEhikLlNnprsvzZ0] [-e test] [-f namefile] [-F separator] [-m magicfiles] [-M magicfiles] file...
       file -C -m magicfiles
Try `file --help' for more information.

Wat?!? But it's right there! It says so in the manual! -E is documented right in the manual page! And yet it's not in the valid switch list as put out by the command, and it's an invalid option. The hell?

So I go back and look at the man page again and then I notice it isn't in the list of switches in the synopsis.

SYNOPSIS
file [-bcdDhiIkLnNprsvz] [--extension] [--mime-encoding] [--mime-type] [-f namefile] [-m magicfiles] [-P name=value] [-M magicfiles] file
file -C [-m magicfiles]
file [--help]

I then did the obvious tests. Did I have file aliased in some way? No. Was some other thing that works and acts like file in my path? No. Was I absolutely 100% using /usr/bin/file? Yes.

Long story short: it seems the man page for file, on macOS, fibs about what switches it supports; it says that -E is a valid option, but it's not there.

What's even odder is the man page says it documents v5.04 of the command, but --version reports v5.37. Meanwhile, if I check on a GNU/Linux box I have access to, it does support -E, reports it in the switches, documents it in the man page (in both the synopsis and in the main body of the page) and it is v5.25 (and so is its man page).

So that was something like 20 minutes lost to a very small problem, for which there was no real solution, but was time that had to be spent to get to the bottom of it.

In the end I went with what I probably should have gone with in the first place: stat -L.

for venv in $(find . -name .venv)
do
    if ! stat -L "$venv/bin/python" > /dev/null 2>&1
    then
        echo "$(dirname $venv)"
    fi
done

And now I have that script in my ~/bin directory, ready for the next time Homebrew and friends conspire to throw my day off for a while.


git2gantt -- Simple tool to visualise coding runs

Posted on 2019-12-08 13:44 +0000 in Python • Tagged with Python, documentation • 3 min read

At the start of this year, as part of a much bigger process to review the work that had taken place over the previous 12 months, I was asked (at work) to provide some information about how much time I'd spent on various projects. Now, for me, there's really only one project, but there's lots of different tools and libraries that I've written to support the main work I do. All of these are split into different repositories in the company-internal instance of GitLab. This meant that getting a rough idea of what I was working on and when would be easy enough -- it's all there in the commit history.

Given that this information would make up a couple of slides at most during a far bigger presentation, I wanted something that would be snappy and easy for non-developers to follow and understand. I spent a bit of time pondering some options and decided that (ab)using a gantt chart layout would make sense.

That choice was made all the more easier given that GitLab supports the use of mermaid charts within its Markdown. This meant I could quickly write some code that took the git log of each repository, turned it into mermaid code, and then render it (by hand, this was all about getting things done quickly) via GitLab.

This sounded like it could be a fun personal project. The result was some Python code called git2gantt.

As mentioned above, the output isn't anything too clever, it's just code that can be used to create a plot via mermaid. For example, running git2gannt over itself:

gantt
  title git2gantt output
  dateFormat YYYY-MM-DD

  section git2gantt
  Development: devgit2gantt20190208, 2019-02-08, 2019-02-13
  Development: devgit2gantt20190214, 2019-02-14, 2019-02-15
  Development: devgit2gantt20190303, 2019-03-03, 2019-03-04
  Development: devgit2gantt20191203, 2019-12-03, 2019-12-04

Usage is pretty straightforward: Screenshot 2019-12-08 at
13.18.12.png As you can see, it can be run over multiple repos at once, and there's also an option to have it consider every branch within each repository. Another handy option is the ability to limit the output to just one author -- perhaps you just want to document what you've done on a repo, not the contributions of other people.

Also especially handy, if you don't want to bore people with too much detail, is the "fuzz" option. This lets you tell git2gannt how relaxed you want it to be when it tries to decide how long a run of work on a repo lasted. So, perhaps, you're working on and off on a library that supports some other system you're documenting, but you might only be making changes every other day or so. With the correct fuzz value you can make it clear you were working on the library for a couple of weeks, despite there only being a commit every other day.

An example of running the output over a handful of projects would look something like this:

Screenshot 2019-12-08 at 13.34.41.png

This is one of those tools I knocked up quickly to get a job done, and haven't quite got round to finishing off fully. One thing I'd really like to do is add mermaid support directly within it, so that it actually has the option to emit plots, not just mermaid code (or, perhaps, drop the mermaid approach and use something else entirely).

Meanwhile though, if you're looking for something quick and dirty that will help you visualise what you've been working on and when for a good period of time... perhaps this will help.


Going on a journey

Posted on 2019-11-10 14:32 +0000 in Python • Tagged with Python • 3 min read

It's hardly a revelation to say that learning a new programming language, or even learning software development at all, is even more difficult if you don't have an actual problem to solve. I know I'm not alone in having pet projects that, when faced with a new environment, I'll code up a version of that project as a way to get familiar with and understand a language's idioms while implementing something I know well.

Personally, my two favourites are a puzzle called 5x5 (here, here, here, here, here, here and here), and writing a library or even a full application to read Norton Guide database files (here, here, here, here, here, here, here and here). Both are fun to work on, have practical uses, and both have the benefit of being solved problems (for me) that let me concentrate on the "how do I do X in this language/toolkit/environment/framework/etc?".

Even with those two as my goto projects, I'm always open to new small problems that might be fun to apply to languages I do know, or languages I want to get to know (internally at work we have a fun "league" of sorts, writing a particular hamming distance calculation tool in different languages, for example).

A few days ago, via this repo on GitHub, I discovered this fun little problem. Right away I could see the benefit in it. As a "go away and code up a solution" interview question it strikes me as near-perfect. It's obviously not hard to solve, but it touches on some basic but important aspects of software development and so will allow the developer to show off how they approach things.

There's so many different approaches to it too. Even in a single language, I could imagine having some fun writing the smallest code to solve the problem, the most idiomatic code to solve the problem, the most supportable and well-documented code to solve the problem, etc. And then there's the thing I talk about above: knowing the solution and knowing it's easy, you can then use it to learn the idiomatic way of solving the problem in new languages.

Even better, the README of the original repo links to solutions others have written. Knowing the problem, and knowing the solution, you can then go and read other people's code and learn something about different styles and different languages.

Over the next few weeks, as I get free time, I think I might just do this. Take the "Journeys" problem and write versions in different languages I work with, or know, and also use it to get to know languages I've yet to know or use heavily (I'm especially keen to try a version in Julia -- a language I really like the look of and want to find a reason to use).

Meanwhile, yesterday, I had a quick go at a first version in Python (aimed at Python 3.8 or higher): https://github.com/davep/journeys.py

I set out to try and write something that was fairly idiomatic Python, which uses tools that I tend to employ when working on Python projects (pipenv, make, etc), and which also used something I've never quite found a need for so far in my usual coding, but which I can see being useful and helpful.

I even threw in a couple of uses of PEP 572!

I can see me tinkering with this some more over the next few days. I can even see me writing a very different implementation in Python, just for the fun of it.

I think that's what I like about this little problem. It's a good way to do a bit of programming exercise; it's like the perfect way to do the programming equivalent of going for a short run.


My Pylint shame

Posted on 2019-11-04 20:39 +0000 in Python • Tagged with Python, fish • 3 min read

I first got into Python in the mid-to-late 1990s. It's so far back that I think the copy of Programming Python that I have (sadly in storage at the moment) might be a first edition. I probably fell out of the habit of using Python some time in the early 2000s (that was when I met Ruby). It was only 22 months ago that I started using Python a lot thanks to a change of employer.

As you might imagine, much had changed in the 15+ years since I'd last written a line of Python in anger. So, early on, I made a point of making Pylint part of my development process. All my projects have a make lint make target. All of my projects lint the code when I push to master in the company GitLab instance. These days I even use flycheck to keep me honest as I write my code; mostly gone are the days where I don't know of problems until I do a make lint.

Leaning on Pylint in the early days of my new position made for a great Python refresher for me. Now, I still lean on it to make sure I don't make daft mistakes.

But...

Pylint and I don't always agree. And that's fine. For example, I really can't stand Pylint's approach to whitespace, and that is a hill I'll happily die on. Ditto the obsession with lines being no more than 80 characters wide (120 should be fine thanks). As such any project's .pylintrc has, as a bare minimum, this:

[FORMAT]
max-line-length=120

[MESSAGES CONTROL]
disable=bad-whitespace

Beyond that though, aside from one or two extras that pertain to particular projects, I'm happy with what Pylint complains about.

There are exceptions though. There are times, simply due to the nature of the code involved, that Pylint's insistence on code purity isn't going to work. That's where I use its inline block disabling feature. It's handy and helps keep things clean (I won't deploy code that doesn't pass 10/10), but there is always this nagging doubt: if I've disabled a warning in the code, am I ever going to come back and revisit it?

To help me think about coming back to such disables now and again, I thought it might be interesting to write a tool that'll show which warnings I disable most. It resulted in this fish abbr:

abbr -g pylintshame "rg --no-messages \"pylint:disable=\" | awk 'BEGIN{FS=\"disable=\";}{print \$2}' | tr \",\" \"\n\" | sort | uniq -c | sort -hr"

The idea here being that it produces a "Pylint hall of shame", something like this:

  12 wildcard-import
  12 unused-wildcard-import
   8 no-member
   6 invalid-name
   5 no-self-use
   4 import-outside-toplevel
   4 bare-except
   2 unused-argument
   2 too-many-public-methods
   2 too-many-instance-attributes
   2 not-callable
   2 broad-except
   1 wrong-import-position
   1 wrong-import-order
   1 unused-variable
   1 unexpected-keyword-arg
   1 too-many-locals
   1 arguments-differ

To break the pipeline down:

rg --no-messages "pylint:disable="

First off, I use ripgrep (if you don't, you might want to have a good look at it -- I find it amazingly handy) to find everywhere in the code in and below the current directory (the --no-messages switch just stops any file I/O errors that might result from permission issues -- they're not interesting here) that contains a line that has a Pylint block disable (if you tend to format yours differently, you'll need to tweak the regular expression, of course).

I then pipe it through awk:

awk 'BEGIN{FS="disable=";}{print $2}'

so I can lazily extract everything after the disable=.

Next up, because it's a possible list of things that can be disabled, I use tr:

tr "," "\n"

to turn any comma-separated list into multiple lines.

Having got to this point, I sort the list, uniq the result, while prepending a count (-c), and then sort the result again, in reverse and sorting the numbers based on how a human would read the result (-hr).

sort | uniq -c | sort -hr

It's short, sweet and hacky, but does the job quite nicely. From now on, any time I get curious about which disables I'm leaning on too much, I can use this to take stock.


pydscheck -- A quick hack that keeps slowly growing

Posted on 2019-10-26 13:19 +0100 in Python • Tagged with Python, documentation • 3 min read

Something I always try to do when I'm coding is be consistent. I feel this is important. While people's coding standards may differ, I think different approaches are easier to handle if someone has been consistent with their style across all of their code.

This also stands for documentation too.

In my current position, I do a lot of Python coding, and one of the things I like about Python (there are things I don't like too, but that's not for now) is that it has doc-strings (just like my favourite language). I use them extensively, ensuring every function and method has some form of documentation, and generally I use Sphinx to generate documentation from those doc-strings.

Early on I was bothered by the fact that, just by the simple act of making typos, I wasn't keeping the form of the doc-strings consistent. And in this case it was a really simple thing that was bugging me. Normally, if I'm writing a single-line doc-string, I'll write like this:

def one_liner():
    """Here is a one-line doc-string."""

So far, so good. But, if the doc-string is a multi-liner, I prefer the ending quotes to be on a line of their own, like this:

def multi_liner():
    """Here is the first line.
    Here is another line.
    Here is the final line.
    """"

But, sometimes, by accident, I'd leave a doc-string like this:

def multi_liner():
    """Here is the first line.
    Here is another line.
    Here is the final line.""""

While it's really not a big deal, it would bug me and every time I found one like this I'd "fix" it.

Eventually, it bugged me enough that I decided I was going to write a little tool to find all such instances in my code and report them. My first approach was to think "I could just do this with some regexp magic", which was really a bad idea. Then I though, I know, I should use this as an excuse to to play with Python's ast library.

That worked really well! I had the first version of the code up and running in no time. It was simple but did the job. It ran through Python code I threw at it and alerted me to both missing doc-strings, and doc-strings with the ending I didn't like.

That served me for a while, until one day I realised that it wasn't quite doing the job correctly; it was only really looking at top-level functions and top-level methods in classes. Sometimes, not often, but sometimes, I'll define functions within functions, and I feel they deserve documentation too. So then I modified the code to ensure it walked every part of the AST.

Since then, when I've run into new things and had new ideas, pydscheck has grown and grown. I've added checks that all mentioned parameters have a type; I've added checks that any function/method that returns something actually documents the return value; I've added checks that any documentation of a returned value includes its type; I've added checks that any function or method that yields a value documents that fact; I've added checks that ensure that every parameter is documented in some way.

Each time I've done this it's helped uncover issues in my code's documentation that could be cleaner, and it's also given me a pet project to slowly better understand Python's AST.

It could be that there are better tools out there, I'd have thought that a good doc-string linting tool would be something someone had already written. But this time around I was happy to NIH it because I needed a fun learning exercise that would also have some benefits for my day-to-day work.

I'll caveat this with the fact that it's very particular to how I work and how I like my documentation to look, but if it sounds useful, here it is: https://github.com/davep/pydscheck.

There's still lots I could do with it. First off I should really properly package it up so it can be installed as a command line tool via pip. Other things that would be handy would be to allow some form of customisation of how it works. I'm sure there's other fun things I can do with it too.

That's part of the fun of having a pet project: you can tinker when you like and also get benefits from it as you use it.


pypath.el -- A little Emacs hack to help with Django

Posted on 2019-10-19 10:35 +0100 in Emacs • Tagged with Emacs, Python, Django, Lisp • 2 min read

One of the things I really like about coding with Emacs is how I can easily identify a repeated task and turn it into a command in my environment, saving me a load of work down the line.

pypath.el is one such example.

In my day job I write a lot of Django code. As part of that, I write a good number of unit tests too. Sometimes I'll write the tests as I'm writing the code they test, other times I'm writing them afterwards; it all really depends on where my head's at and how the code is flowing.

When I'm writing those tests, I often want to test them as I go. Given that starting up a test session can take a while, and given that running all the tests in the system can take a while, it's really handy if I can run that single test I'm working on.

This is easy enough with Django. In my work environment it's normally something like:

$ pipenv run ./manage.py test -v 2 app.test.some.sub.module.TestClass.test_method

Only... typing out the:

app.test.some.sub.module.TestClass.test_method

part is a bit of a pain. Sure, once you've typed it the once you can use your shell of choice (mine being fish and on occasion eshell) to recall it from history, but typing it out the first time is the annoying part.

So this was the point where I took 1/2 hour or so to code up pypath.el to solve the problem for me. It gives me two commands:

  • pypath: which simply places the dotted path of the current "defun", within the context of being part of a Django system, into the clipboard buffer.
  • pypath-django-test: which works similar to the above but places the whole Django testing command into the clipboard.

With the above, I can work on a test, hit the latter command above, flip to my command line, paste and I'm running the test.

Of course, I'm sure there's plenty of other handy ways to do this. Doubtless there's work environments where the test can be run right there, in the edit buffer, without flipping away, and which takes into account the fact that there's a pipenv-managed virtual environment involved, etc. If there is, that's great, but I don't think it'd work with how I work.

And that's one of the things I really love about Emacs, and why it's still my work environment after almost 25 years of on and off use: with very little work on my part I can create a couple of commands that work exactly how I need them to. While it's great to create generally-useful code for Emacs that lots of people benefit from, sometimes the real value is that you can code up your own particular quirk and just get on with stuff.

To conclude: this post isn't to show off pypath.el; really this post is to sing the praises of Emacs and why it still works so well for me after all these years.


A little speed issue with openpyxl

Posted on 2018-06-02 13:16 +0100 in Python • Tagged with Python, openpyxl • 4 min read

It's been very quiet on the blogging front, I'm afraid, mostly for the reasons I wrote about back in December last year. In that time I've been really very busy with work (in a good way, in a very good way) and there's not a whole lot of time to be toying with pet projects at home.

However, finding myself with a spare hour or so, I wanted to write about something I did run into as part of some development at work, and which I thought might be worth writing about in case it helps someone else.

Recently I've needed to write a library of code for loading data from Excel Workbooks. Given that the vast majority of coding I do at the moment is in Python, it made sense to make use of openpyxl. The initial prototype code I wrote worked well and it soon grew into a full-blown library that'll be used in a couple of work-related projects.

But one thing kept niggling me... It just wasn't as fast as I'd expected. The workbooks I'm pulling data from aren't that large, and yet it was taking a noticeable number of seconds to read in the data, and when I let the code have a go at a directory full of such workbooks... even the fan on the machine would ramp up.

It didn't seem right.

I did a little bit of profiling and could see that the code was spending most of its time deep in the guts of some XML-parsing functions. While I know that an xlsx file is pretty much an XML document, it seemed odd to me that it would take so much time and effort to pull the data out from it.

Given that I had other code to be writing, and given that the workbook-parsing code was "good enough" for the moment, I moved on for a short while.

But, a couple of weeks back, I had a bit of spare time and decided to revisit it. I did some more searching on openpyxl and speed issues and almost everything I found said that the common problem was failing to open the workbook in read_only mode. That can't have been my problem because I'd being doing that from the very start.

Eventually I came across a post somewhere (sorry, I've lost it for now -- I'll try and track it down again) that suggested that openpyxl was very slow to read from a workbook if you were reading one cell at a time, rather than using generators. The suggestion being that every time you pull a value form a cell, it has to parse the whole sheet up to that cell. Generators, on the other hand, would allow access to all the cells during one parse.

This seemed a little unlikely to me -- I'd have expected the code to cache the parsing results or something like that -- but it also would explain what I was seeing. So I decided to give it a test.

openpyxl-speed-issue is a version of the tests I wrote and ran and they absolutely show that there's a huge difference between cell-by-cell access vs generator access.

Code like this:

for row in range( 1, sheet.max_row + 1 ):
    for col in range( 0, sheet.max_column ):
        value = sheet[ row ][ col ].value

is far slower than something like this:

for row in wb[ "Test Sheet" ].rows:
    for cell in row:
        value = cell.value

Here's an example of the difference in time, as seen on my iMac:

$ make test
pipenv run time ./read-using-generators
        1.59 real         0.44 user         0.04 sys
pipenv run time ./read-using-peeking
       25.02 real        24.88 user         0.10 sys

As you can see, the cell-by-cell approach is about 16 times slower than the generator approach.

In most circumstances the generator approach would make most sense anyway, and in any other situation I probably would have used it and never have noticed this. However, the nature of the workbooks I need to pull data from means I need to "peek ahead" to make decisions about what I'm doing, so a more traditional loop over, with an index, made more sense.

I can easily "fix" this by using the generator approach to build up a two-dimensional array of cells, acquired via the generator; so I can still do what I want and benefit from using generators.

In conclusion: given that I found it difficult to find information about my speed issue, and given that the one off-hand comment I saw that suggested it was this wasn't exactly easy to find, I thought I'd write it all down too and create a repository of some test code to illustrate the issue. Hopefully someone else will benefit from this in the future.