• A little speed issue with openpyxl

    It’s been very quiet on the blogging front, I’m afraid, mostly for the reasons I wrote about back in December last year. In that time I’ve been really very busy with work (in a good way, in a very good way) and there’s not a whole lot of time to be toying with pet projects at home.

    However, finding myself with a spare hour or so, I wanted to write about something I did run into as part of some development at work, and which I thought might be worth writing about in case it helps someone else.

    Recently I’ve needed to write a library of code for loading data from Excel Workbooks. Given that the vast majority of coding I do at the moment is in Python, it made sense to make use of openpyxl. The initial prototype code I wrote worked well and it soon grew into a full-blown library that’ll be used in a couple of work-related projects.

    But one thing kept niggling me… It just wasn’t as fast as I’d expected. The workbooks I’m pulling data from aren’t that large, and yet it was taking a noticeable number of seconds to read in the data, and when I let the code have a go at a directory full of such workbooks… even the fan on the machine would ramp up.

    It didn’t seem right.

    I did a little bit of profiling and could see that the code was spending most of its time deep in the guts of some XML-parsing functions. While I know that an xlsx file is pretty much an XML document, it seemed odd to me that it would take so much time and effort to pull the data out from it.

    Given that I had other code to be writing, and given that the workbook-parsing code was “good enough” for the moment, I moved on for a short while.

    But, a couple of weeks back, I had a bit of spare time and decided to revisit it. I did some more searching on openpyxl and speed issues and almost everything I found said that the common problem was failing to open the workbook in read_only mode. That can’t have been my problem because I’d being doing that from the very start.

    Eventually I came across a post somewhere (sorry, I’ve lost it for now – I’ll try and track it down again) that suggested that openpyxl was very slow to read from a workbook if you were reading one cell at a time, rather than using generators. The suggestion being that every time you pull a value form a cell, it has to parse the whole sheet up to that cell. Generators, on the other hand, would allow access to all the cells during one parse.

    This seemed a little unlikely to me – I’d have expected the code to cache the parsing results or something like that – but it also would explain what I was seeing. So I decided to give it a test.

    openpyxl-speed-issue is a version of the tests I wrote and ran and they absolutely show that there’s a huge difference between cell-by-cell access vs generator access.

    Code like this:

    for row in range( 1, sheet.max_row + 1 ):
        for col in range( 0, sheet.max_column ):
            value = sheet[ row ][ col ].value
    

    is far slower than something like this:

    for row in wb[ "Test Sheet" ].rows:
        for cell in row:
            value = cell.value
    

    Here’s an example of the difference in time, as seen on my iMac:

    $ make test
    pipenv run time ./read-using-generators
            1.59 real         0.44 user         0.04 sys
    pipenv run time ./read-using-peeking
           25.02 real        24.88 user         0.10 sys
    

    As you can see, the cell-by-cell approach is about 16 times slower than the generator approach.

    In most circumstances the generator approach would make most sense anyway, and in any other situation I probably would have used it and never have noticed this. However, the nature of the workbooks I need to pull data from means I need to “peek ahead” to make decisions about what I’m doing, so a more traditional loop over, with an index, made more sense.

    I can easily “fix” this by using the generator approach to build up a two-dimensional array of cells, acquired via the generator; so I can still do what I want and benefit from using generators.

    In conclusion: given that I found it difficult to find information about my speed issue, and given that the one off-hand comment I saw that suggested it was this wasn’t exactly easy to find, I thought I’d write it all down too and create a repository of some test code to illustrate the issue. Hopefully someone else will benefit from this in the future.

  • On to something new

    Today is a fairly significant day for me. For the past 21 (and a bit) years I’ve worked for the same company. I know that, for lots of people in my industry, that’s a hell of a long time to be in one position. It was a setup that served me pretty well; the company was small (I’m not much of a big company person), I generally got to dictate the development tools and direction, and I also got to work from home.

    The working from home part was especially helpful as other life events involved me and happened around me. The job I’ve been in has seen a marriage come and go, and a move from Hampshire to Lincolnshire, and then from Lincolnshire to Midlothian. There’s been a few adventures along the way.

    But, three months ago, I got a call to say that I was to be made redundant. This sucked, obviously. It also appeared pretty stupid on the part of my employer: a company that is based around software had decided it no longer wanted an in-house software developer. A bold, and I think unwise, choice. From this point on they’ve decided to only go with contract developers and only contract developers from overseas (or so the story goes).

    As it goes, things have turned out okay. Today is my last day with my old employer and, as I’m sat here typing this out while having some lunch, I’m actually glad that it’s coming to a close and that I can get to move on.

    I have a new employer, and will be starting in my new position during the first week of January. It’ll be very different. It’ll all be different. Not only will I be back working in an office (one where jeans and t-shirts are the norm, thankfully!), I’m also going to be working in an industry (as a developer still, of course) that I have no background in and no real knowledge of. It’s going to be a seriously exciting challenge.

    New book

    Over the next couple of weeks I’ve got a fair amount of background reading (and video watching) to be doing. This isn’t so much that I can get on with the job of development (although there will be a good bit of that to do too – who doesn’t want to be trying to understand new options for development?), it’s more about understanding the language of the industry I’ll be in and so that I can understand the needs of my users.

    Goodbye fashion retail and all related things. Hello genetics and… well, I don’t know, I’m going to find out. :)

  • More revamping of my Emacs config

    I’ve been pretty quiet on here since I last wrote about how I’d done a further revamp of my Emacs config, so I thought that subject would be a good reason to write another blog post.

    It’ll be a mostly short one, and one to muse over something that’s been bugging me for a while now: my decision to lean heavily on customize to set all sorts of settings.

    Initially, when I nuked my original config over a year ago, it seemed to make a lot of sense. Let all the tweaks and set values “hide” in a file of their own and try and keep the hand-edited config files as small and as clean as possible. Recently though I’ve got to thinking that this obscures too much, hides too much detail, and removes the ability to actually document what I’m doing and why. It also does make it tricky to adapt some settings to different platforms or even environments on a single platform.

    Another problem I’ve run into is this: when I made the second round of changes and decided to lean heavily on use-package, I soon ran into the minor issue of some packages not making sense, or even being needed, on some platforms (stuff that’s useful on my macOS machines isn’t always useful on my Windows machines, that sort of thing). While use-package can handle this easily thanks to the :if keyword, I’m still left with the fact that package-selected-packages still gets populated.

    Having package-selected-packages contain a list of installed packages likely makes sense if you’re using just the Emacs package system and you’re not doing the installing with use-package and :ensure. But with use-package and :ensure I feel like I’ve got far more control over things and can adapt what gets installed when depending on which Emacs I’m running where.

    But, because I’m syncing my ~/.emacs.d/.custom.el to all my machines too, any use-package that has a :if to not bother using a package has little effect because the package still ends up being listed/loaded/seen as part of the installation.

    Ideally, I think, I’d like to be able to have package-selected-packages held in its own file, or I’d only ever use ~/.emacs.d/.custom.el for local stuff (and so stop syncing it).

    Starting today I’m going about a process of moving as much as I can out of ~/.emacs.d/.custom.el and into hand-edited files. In some respects I guess I am going back to how I used to manage Emacs configuration, but this time it’s not a massive monolithic file-of-Lisp, it’s neatly broken down into sensible sections and it’s also biased towards a “grab and config this package” approach.

    Meanwhile, I’ve not seen any good discussions online about customize vs “hand-edit”, which strikes me as a little odd as it feels like the perfect “religious issue” for people to have endless disagreements over. I guess, over the next couple or so weeks, I’ll find out if switching back was a good idea.

  • Another revamp of my emacs config

    Just under a year ago I decided to totally rewrite my GNU emacs config. As I wrote at the time, it’d been following me around all sorts of machines since the early 1990s, starting life on an OS/2 Warp machine and travelling via MS-DOS, GNU/Linux, Windows and, these days, macOS.

    The changes I made last year have served me really well, but there were two related issues with it that bothered me a little: the fact that I was maintaining a local library of elisp code in the repository and, worse still, I was storing the packages I’d installed from elpa and melpa in the repository as well.

    While this did mean it was pretty easy for me to start up a new installation of emacs on a machine – all I had to do was clone the repo and run up emacs – I wasn’t happy with the duplication involved. I didn’t like holding code in my .emacs.d repo that was already held in package archives.

    The solution I saw was in two parts:

    1. Get some of my code, that might be useful to others, into melpa.
    2. Somehow sort my own package archive for my personal code.

    Over the past week or so I’ve worked on this approach. It initially started with me tackling item 1 above: I tidied up and submitted obfusurl.el, protocols.el, services.el, thinks.el and uptimes.el. This was a really helpful process in that it allowed me to brush up on my elisp and emacs knowledge. It’s a good 15+ years since I last wrote any significant elisp code and things have moved on a little in that time.

    Having done that I’d managed to move a handful of my own packages out of my local library of code, and so out of my .emacs.d repo, but it left me with the problem of what to do with the rest of it.

    That’s when I discovered package-x and:

    ,----[ C-h f package-upload-buffer RET ]
    | package-upload-buffer is an interactive compiled Lisp function in
    | ‘package-x.el’.
    |
    | (package-upload-buffer)
    |
    | Upload the current buffer as a single-file Emacs Lisp package.
    | If ‘package-archive-upload-base’ does not specify a valid upload
    | destination, prompt for one.
    `----
    

    (plus package-upload-file too, of course). This meant I could, in effect, start my own personal package archive and look at tackling issue 2 above.

    This did give me one small problem though: how and where would I host the archive? I did consider hosting it on a DigitalOcean droplet, but that felt a little like overkill for something so simple. And then I realised: GitHub Pages! All I needed to do was keep the package archive in its own repo (which I would have done anyway) and then make the whole repo the source for a GitHub Pages site. A quick test later and… it worked!

    So, by this point, I’d farmed some of my code off to melpa, and now had the rest of it in “delpa” (which I’d called my personal archive). I could now use the emacs package management system to install third party packages and also my own.

    But I was still left with one issue: I was still holding the installed packages inside my .emacs.d repo by way of ensuring that all machines were in sync in terms of what was installed. Now I needed to work out how to solve that.

    Around this time, as luck would have it, @tarsius had suggested I look at a package called use-package by @jwiegley. This was the bit I was missing.

    With use-package I would be able to declare which packages I needed, how they’d be installed and, most important of all, it could be set to handle the fact that the package wasn’t even installed. If a package is requested and there is no local install use-package is smart enough to get the emacs package system to install it.

    So, given that, all I need to do was create a startup file that would declare the packages I use and I’d have a setup that should, once I’d cloned .emacs.d, self-install.

    Except… yeah, one more issue. use-package isn’t part of GNU emacs yet so I’d need a method of getting it to auto-install so it could then handle everything else. As it was that was as easy as adding this to the start of my init.el.

    ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
    ;; Make sure the package system is up and running early on.
    (require 'package)
    (add-to-list 'package-archives '("melpa" . "http://melpa.org/packages/"))
    (add-to-list 'package-archives '("delpa" . "http://blog.davep.org/delpa/"))
    (package-initialize)
    
    ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
    ;; Bootstrap `use-package'
    (unless (package-installed-p 'use-package)
      (package-refresh-contents)
      (package-install 'use-package))
    

    With that in place I was able to nuke all my config on a machine, clone a fresh copy of .emacs.d (having now ceased tracking and storing the installed packages in that repo), run up emacs, wait a few moments and then find that everything was installed and ready to use.

    Perfect!

    My .emacs.d is now a lot smaller than it was before and, I think, even easier to maintain. Right now I think I’m very close to the ideal emacs config that I wanted to create when I did the complete rewrite a year ago.

  • I want to like Gboard

    I want to like Gboard. On paper it looks really rather good. It’s a keyboard from Google, it ties in with your account, it syncs things, it has clever searching for emoji and gifs and the like… what’s not to like?

    Problem is, I’ve been a user of SwiftKey since around 2011 (I think it was). I’m very used to how SwiftKey works and it also contains a lot of handy things. I like that it has smart completion, that it learns how I type a bit skewed and that it takes this into account, that I can turn off the fancy swipe typing and instead make use of handy gestures like swipe-left to delete a word. I like some of the themes a lot.

    Into the mix comes my iPad, which I use on occasion. The standard Apple keyboard is horrible and, sadly, I find SwiftKey on iOS just as frustrating. It seems to lack enough key features there (especially the word deletion gesture, as far as I can tell) that it’s also a bit annoying. My dream of a consistent typing experience across all devices just wasn’t happening – until I found Gboard on iOS.

    That felt almost right. And from what I could tell it worked almost exactly the same on iOS and Android. So it felt like a good time to try and force myself to use Gboard on my Google Pixel and Nexus 7.

    Sadly, though, I’m just not getting on with it. It’s okay. It’s not bad. It’s just… not good. I’m finding that it lacks enough useful things that it’s a frustrating experience. Little things like: when I enter Google Search, there’s no word completion in the keyboard (SwiftKey has that); the word deletion gesture (swipe left from the backspace key) seems very hit-and-miss; the most obvious completion for a word sometimes appears in the middle slot but, other times, in the left slot. And so on.

    Nothing huge. Nothing that’s a show-stopper. But a handful of a little things that make me miss the comfortable home that is SwiftKey.

    Don’t get me wrong, it does have some very handy and cleaver features too. The searching for emoji – including showing them up as word completions – is rather clever. The gif-search thing is all kinds of fun too (mostly used to annoy the hell out of my son on twitter).

    None of those quite make up for the bits I miss from SwiftKey though.

    All that said, I’ve being making a point of pushing on with Gboard, thinking that most of my issues might just be because I’m too used to my “old home”. Mostly this was working well, until I noticed something this morning. While reading the description for Gboard I noticed this handy thing in the “Pro Tips” section:

    Sync your learned words across devices to improve suggestions (enable in Gboard Settings→ Dictionary → Sync learned words).

    Useful! I’d assumed that this was the case anyway – it’s Google after all – but it’s good to know I can ensure it’s turned on. So I went to turn it on. This is what I found:

    Gboard WTF

    What the hell Google? Sure, I do have a Gsuite account on my phone – as in various apps have access to a Gsuite account (Gmail, Drive, etc…) – but it’s not the primary account on my phone and it’s not the account I’d really want to be doing the dictionary sync with anyway. If I’ve got dictionary sync I want it tied to the keyboard no matter the app I’m in, and no matter the account I’m using in that app. I want the keyboard to be tied to a specific account when it comes to sync (just like SwiftKey does it).

    This, I think, is a show-stopper for me.

    I can overlook the other niggles, I can learn to cope with it not being quite so perfect in some situations; but the blanket inability to do something as simple as cloud-sync the predictions and learn from how I type – things that are, these days, central to what Google’s about – it’s frankly stupid.

    I guess I’m going to have to keep Gboard as a backup keyboard for those times when I need to find the perfect gif.

    Google WTF

subscribe via RSS