Python and macOS

Posted on 2022-11-05 08:49 +0000 in Python • Tagged with Python, macOS, coding • 5 min read

Introduction

On Twitter, a few weeks back, @itsBexli asked me how I go about setting up Python for development on macOS. It's a great question and one that seems to crop up in various places, and since I first got into using macOS and then subsequently got back into coding lots in Python it's absolutely an issue I ran into.

With my previous employer, while I was the only developer, I wasn't the only person writing code and more than one other person had this issue so I eventually wrote up my approach to solving this problem. That document is on their internal GitLab, but I thought it worth me writing my personal macOS/Python "rules" up again, just in case they're useful to anyone else.

I am, of course, not the first person to tackle this, to document this, to write down a good approach to this. Before and after I settled on my approach I'd seen other people write about this. So... this post isn't here to try and replace those, it's simply to write down my own approach, so if anyone asks again I can point them here. I feel it's important to stress: this isn't the only way or thoughts on this issue, there are lots of others. Do go read them too and then settle on an approach that works for you.

One other point to note, which may or may not make a difference (and may or may not affect how this changes with time): for the past few years I've been a heavy user of pipenv to manage my virtual environments. This is very likely to change from now on, but keep in mind that what follows was arrived at from the perspective of a pipenv user.

So... with that admin aside...

The Problem

When I first got back into writing Python it was on macOS and, really early on, I ran into all the usual issues: virtual environments breaking because they were based on the system Python and it got updated, virtual environments based on the Homebrew-installed Python and it got updated, etc... Simply put, an occasional, annoying, non-show-stopping breaking of my development environment which would distract me when I'd sat down to just hack on some code, not do system admin!

My Solution

For me, what's worked for me without a problem over the past few years, in short, is this:

  1. NEVER use the system version of Python. Just don't.
  2. NEVER use the Homebrew's own version of Python (I'm not even sure this is an issue any more; but it used to be).
  3. NEVER use a version of Python installed with Homebrew (or, more to the point, never install Python with Homebrew).
  4. Manage everything with pyenv; which I do install from Homebrew.

The Detail

As mentioned earlier, what I'm writing here assumes that virtual environments are being managed with pipenv (something I still do for personal projects, for now, but this may change soon). Your choices and mileage may vary, etc... This is what works well for me.

The "one time" items

These are the items that need initially installing into a new macOS machine:

Homebrew

Unless it comes from the Mac App Store, I try and install everything via Homebrew. It's really handy for keeping track of what I've got installed, for recreating a development environment in general, and for keeping things up to date.

pyenv

With Homebrew installed the next step for me is to install pyenv. Doing so is as easy as:

$ brew install pyenv

Once installed, if it's not done it for you, you may need to make some changes to your login profile. I'm a user of fish so I have these lines in my setup. Simply put: it asks pyenv to set up my environment so my calls to Python go via its setup.

Plenty of help on how to set up pyenv can be found in its README.

Once I've done this I tend to go on and install the Python versions I'm likely to need. For me this tends to be the most recent "active" stable versions (as of the time of writing, 3.7 through 3.10; although I need to now start throwing 3.11 into the mix too).

I use this command:

$ pyenv install --list

to see the available versions. If I want to see what's available for a specific version I'll pipe through grep:

$ pyenv install --list | fgrep "  3.9"

This is handy if I want to check what the very latest release of a specific version of Python is.

The "Global" Python

When I'm done with the above I then tend to use pyenv to set my "global" Python. This is the version I want to get when I run python and I'm not inside a virtual environment. Doing that is as simple as:

$ pyenv global 3.10.7

Of course, you'd swap the version for whatever works for you.

When Stuff Breaks

It seems more rare these days, but on occasion I've had it such that some update somewhere still causes my environment to break. Now though I find that all it takes is a quick:

$ pyenv rehash

and everything is good again.

Setting Up A Repo

With all of the stuff above being mostly a one-off (or at least something I do once when I set up a new machine -- which isn't often), the real "work" here is when I set up a fresh repository when I start a new project. Really it's no work at all. Again, as I've said before, I've tended to use pipenv for my own work, and still do for personal stuff (but do want to change that), mileage may vary here depending on tool.

When I start a new project I think about which Python version I want to be working with, I ensure I have the latest version of it installed with pyenv, and then ask pipenv to create a new virtual environment with that:

$ pipenv --python 3.10.7

When you do this, you should see pipenv pulling the version of Python from the pyenv directories:

$ pipenv --python 3.10.7
Creating a virtualenv for this project...
Pipfile: /Users/davep/temp/cool-project/Pipfile
Using /Users/davep/.pyenv/versions/3.10.7/bin/python3 (3.10.7) to create virtualenv...
⠙ Creating virtual environment...created virtual environment CPython3.10.7.final.0-64 in 795ms
  creator CPython3Posix(dest=/Users/davep/temp/cool-project/.venv, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/Users/davep/Library/Application Support/virtualenv)
    added seed packages: pip==22.2.2, setuptools==65.3.0, wheel==0.37.1
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
✔ Successfully created virtual environment!
Virtualenv location: /Users/davep/temp/cool-project/.venv
Creating a Pipfile for this project...

The key thing here is seeing that pipenv is pulling Python from ~/.pyenv/versions/. If it isn't there's a good chance you have a Python earlier in your PATH than the pyenv one -- something you generally don't want. It will work, but it's more likely to break at some point in the future. That's the key thing I look for; if I see that I know things are going to be okay.

Conclusion

Since I adopted these personal rules and approaches (and really, calling them "rules" is kind of grand -- there's almost nothing to this) I've found I've had near-zero issues with the stability of my Python virtual environments (and what issues I have run into tend to be trivial and of my own doing).

As I said at the start: there are, of course, other approaches to this, but this is mine and works well for me. Do feel free to comment with your own, I'm always happy to learn about new ways!


The PEP 8 hill I will die on

Posted on 2020-08-23 16:54 +0100 in Python • Tagged with Python • 3 min read

I first learnt Python back in the mid-to-late 90s, used it in place of Perl once I was comfortable with it, and then we sort of drifted apart when I first met Ruby. It's only in the last couple of years that I've got back into it, and in a huge way, thanks to my (not-quite-so-) new job. Despite the quirks and oddness (as I perceive them), I actually quite like Python and it's one of those languages that just flows off my fingers. I'm sure you know the same thing, perhaps not with Python, but there will be languages that just flow for you, and those that take a bit more effort and concentration. Python... feels okay to me.

I also appreciate that there's been a long-standing style guide. I quite like PEP 8 as a read, and think there's a lot of good ideas in there; much of the content sits with how I'd approach things if I was tasked to come up with such a document. With this in mind, I'm a fairly heavy user of pylint and it in turn leans on PEP 8 (amongst other things) and I'm happy to accept most of its judgements. Not all of its judgements, but even when I disagree with it I try and keep track of how far I'm drifting.

But there is absolutely one hill I will happily die on when it comes to PEP 8: the concept of "extraneous whitespace" in lists and expressions. Just.... no! Oh gods no!

To borrow a line of code from the journey problem I dabbled with a while back, PEP 8 would have me write something like this:

def perform(commands: List[str],state: State) -> State:

Now, I'm sure plenty of people won't see a problem with this at all; but all I can see is an almost-claustrophobic parameter list. What's with the parameters being jammed up against the opening and closing parens? Why have the dinky little comma lost between two different things? Why have it look like a long stream of letters and punctuation? Why....

No.

Just no.

I can't.

Rightly or wrongly, I just need for the code to breathe a bit. When I type this:

def perform( commands: List[ str ], state: State ) -> State:

suddenly if feels like there's fresh air in the code, like it flows gently out of my head, off my fingers, through the keyboard and into the buffer.

In my head, and to my eyes, the code is.... relaxed.

Do I have a rational reason for this? Nope. Then again I don't see one for doing it the other way either; I can't think of one and I don't see one in the source document. So, that's a warning I always turn off with pylint and it's a style I carry through all my Python code; and I think that's the important point here: anyone reading and working with my code should see the same style all the way through. It might differ from PEP 8 on this point, but at least it's the same all the way.

And, really, that's okay: PEP 8 is there to be ignored. ;-)

PS: This is a small part of another blog post I was meaning to write, and might still do, about my (still ongoing) experience of getting lsp-mode up and running in Emacs and having it play nice with Python projects. I have that working, but it was a bit of a learning curve and epic battle over a couple of days, and one that had me first encounter pycodestyle. I may still tell the tale...


git2gantt -- Simple tool to visualise coding runs

Posted on 2019-12-08 13:44 +0000 in Python • Tagged with Python, documentation • 3 min read

At the start of this year, as part of a much bigger process to review the work that had taken place over the previous 12 months, I was asked (at work) to provide some information about how much time I'd spent on various projects. Now, for me, there's really only one project, but there's lots of different tools and libraries that I've written to support the main work I do. All of these are split into different repositories in the company-internal instance of GitLab. This meant that getting a rough idea of what I was working on and when would be easy enough -- it's all there in the commit history.

Given that this information would make up a couple of slides at most during a far bigger presentation, I wanted something that would be snappy and easy for non-developers to follow and understand. I spent a bit of time pondering some options and decided that (ab)using a gantt chart layout would make sense.

That choice was made all the more easier given that GitLab supports the use of mermaid charts within its Markdown. This meant I could quickly write some code that took the git log of each repository, turned it into mermaid code, and then render it (by hand, this was all about getting things done quickly) via GitLab.

This sounded like it could be a fun personal project. The result was some Python code called git2gantt.

As mentioned above, the output isn't anything too clever, it's just code that can be used to create a plot via mermaid. For example, running git2gannt over itself:

gantt
  title git2gantt output
  dateFormat YYYY-MM-DD

  section git2gantt
  Development: devgit2gantt20190208, 2019-02-08, 2019-02-13
  Development: devgit2gantt20190214, 2019-02-14, 2019-02-15
  Development: devgit2gantt20190303, 2019-03-03, 2019-03-04
  Development: devgit2gantt20191203, 2019-12-03, 2019-12-04

Usage is pretty straightforward: Screenshot 2019-12-08 at
13.18.12.png As you can see, it can be run over multiple repos at once, and there's also an option to have it consider every branch within each repository. Another handy option is the ability to limit the output to just one author -- perhaps you just want to document what you've done on a repo, not the contributions of other people.

Also especially handy, if you don't want to bore people with too much detail, is the "fuzz" option. This lets you tell git2gannt how relaxed you want it to be when it tries to decide how long a run of work on a repo lasted. So, perhaps, you're working on and off on a library that supports some other system you're documenting, but you might only be making changes every other day or so. With the correct fuzz value you can make it clear you were working on the library for a couple of weeks, despite there only being a commit every other day.

An example of running the output over a handful of projects would look something like this:

Screenshot 2019-12-08 at 13.34.41.png

This is one of those tools I knocked up quickly to get a job done, and haven't quite got round to finishing off fully. One thing I'd really like to do is add mermaid support directly within it, so that it actually has the option to emit plots, not just mermaid code (or, perhaps, drop the mermaid approach and use something else entirely).

Meanwhile though, if you're looking for something quick and dirty that will help you visualise what you've been working on and when for a good period of time... perhaps this will help.


Going on a journey

Posted on 2019-11-10 14:32 +0000 in Python • Tagged with Python • 3 min read

It's hardly a revelation to say that learning a new programming language, or even learning software development at all, is even more difficult if you don't have an actual problem to solve. I know I'm not alone in having pet projects that, when faced with a new environment, I'll code up a version of that project as a way to get familiar with and understand a language's idioms while implementing something I know well.

Personally, my two favourites are a puzzle called 5x5 (here, here, here, here, here, here and here), and writing a library or even a full application to read Norton Guide database files (here, here, here, here, here, here, here and here). Both are fun to work on, have practical uses, and both have the benefit of being solved problems (for me) that let me concentrate on the "how do I do X in this language/toolkit/environment/framework/etc?".

Even with those two as my goto projects, I'm always open to new small problems that might be fun to apply to languages I do know, or languages I want to get to know (internally at work we have a fun "league" of sorts, writing a particular hamming distance calculation tool in different languages, for example).

A few days ago, via this repo on GitHub, I discovered this fun little problem. Right away I could see the benefit in it. As a "go away and code up a solution" interview question it strikes me as near-perfect. It's obviously not hard to solve, but it touches on some basic but important aspects of software development and so will allow the developer to show off how they approach things.

There's so many different approaches to it too. Even in a single language, I could imagine having some fun writing the smallest code to solve the problem, the most idiomatic code to solve the problem, the most supportable and well-documented code to solve the problem, etc. And then there's the thing I talk about above: knowing the solution and knowing it's easy, you can then use it to learn the idiomatic way of solving the problem in new languages.

Even better, the README of the original repo links to solutions others have written. Knowing the problem, and knowing the solution, you can then go and read other people's code and learn something about different styles and different languages.

Over the next few weeks, as I get free time, I think I might just do this. Take the "Journeys" problem and write versions in different languages I work with, or know, and also use it to get to know languages I've yet to know or use heavily (I'm especially keen to try a version in Julia -- a language I really like the look of and want to find a reason to use).

Meanwhile, yesterday, I had a quick go at a first version in Python (aimed at Python 3.8 or higher): https://github.com/davep/journeys.py

I set out to try and write something that was fairly idiomatic Python, which uses tools that I tend to employ when working on Python projects (pipenv, make, etc), and which also used something I've never quite found a need for so far in my usual coding, but which I can see being useful and helpful.

I even threw in a couple of uses of PEP 572!

I can see me tinkering with this some more over the next few days. I can even see me writing a very different implementation in Python, just for the fun of it.

I think that's what I like about this little problem. It's a good way to do a bit of programming exercise; it's like the perfect way to do the programming equivalent of going for a short run.


My Pylint shame

Posted on 2019-11-04 20:39 +0000 in Python • Tagged with Python, fish • 3 min read

I first got into Python in the mid-to-late 1990s. It's so far back that I think the copy of Programming Python that I have (sadly in storage at the moment) might be a first edition. I probably fell out of the habit of using Python some time in the early 2000s (that was when I met Ruby). It was only 22 months ago that I started using Python a lot thanks to a change of employer.

As you might imagine, much had changed in the 15+ years since I'd last written a line of Python in anger. So, early on, I made a point of making Pylint part of my development process. All my projects have a make lint make target. All of my projects lint the code when I push to master in the company GitLab instance. These days I even use flycheck to keep me honest as I write my code; mostly gone are the days where I don't know of problems until I do a make lint.

Leaning on Pylint in the early days of my new position made for a great Python refresher for me. Now, I still lean on it to make sure I don't make daft mistakes.

But...

Pylint and I don't always agree. And that's fine. For example, I really can't stand Pylint's approach to whitespace, and that is a hill I'll happily die on. Ditto the obsession with lines being no more than 80 characters wide (120 should be fine thanks). As such any project's .pylintrc has, as a bare minimum, this:

[FORMAT]
max-line-length=120

[MESSAGES CONTROL]
disable=bad-whitespace

Beyond that though, aside from one or two extras that pertain to particular projects, I'm happy with what Pylint complains about.

There are exceptions though. There are times, simply due to the nature of the code involved, that Pylint's insistence on code purity isn't going to work. That's where I use its inline block disabling feature. It's handy and helps keep things clean (I won't deploy code that doesn't pass 10/10), but there is always this nagging doubt: if I've disabled a warning in the code, am I ever going to come back and revisit it?

To help me think about coming back to such disables now and again, I thought it might be interesting to write a tool that'll show which warnings I disable most. It resulted in this fish abbr:

abbr -g pylintshame "rg --no-messages \"pylint:disable=\" | awk 'BEGIN{FS=\"disable=\";}{print \$2}' | tr \",\" \"\n\" | sort | uniq -c | sort -hr"

The idea here being that it produces a "Pylint hall of shame", something like this:

  12 wildcard-import
  12 unused-wildcard-import
   8 no-member
   6 invalid-name
   5 no-self-use
   4 import-outside-toplevel
   4 bare-except
   2 unused-argument
   2 too-many-public-methods
   2 too-many-instance-attributes
   2 not-callable
   2 broad-except
   1 wrong-import-position
   1 wrong-import-order
   1 unused-variable
   1 unexpected-keyword-arg
   1 too-many-locals
   1 arguments-differ

To break the pipeline down:

rg --no-messages "pylint:disable="

First off, I use ripgrep (if you don't, you might want to have a good look at it -- I find it amazingly handy) to find everywhere in the code in and below the current directory (the --no-messages switch just stops any file I/O errors that might result from permission issues -- they're not interesting here) that contains a line that has a Pylint block disable (if you tend to format yours differently, you'll need to tweak the regular expression, of course).

I then pipe it through awk:

awk 'BEGIN{FS="disable=";}{print $2}'

so I can lazily extract everything after the disable=.

Next up, because it's a possible list of things that can be disabled, I use tr:

tr "," "\n"

to turn any comma-separated list into multiple lines.

Having got to this point, I sort the list, uniq the result, while prepending a count (-c), and then sort the result again, in reverse and sorting the numbers based on how a human would read the result (-hr).

sort | uniq -c | sort -hr

It's short, sweet and hacky, but does the job quite nicely. From now on, any time I get curious about which disables I'm leaning on too much, I can use this to take stock.


pydscheck -- A quick hack that keeps slowly growing

Posted on 2019-10-26 13:19 +0100 in Python • Tagged with Python, documentation • 3 min read

Something I always try to do when I'm coding is be consistent. I feel this is important. While people's coding standards may differ, I think different approaches are easier to handle if someone has been consistent with their style across all of their code.

This also stands for documentation too.

In my current position, I do a lot of Python coding, and one of the things I like about Python (there are things I don't like too, but that's not for now) is that it has doc-strings (just like my favourite language). I use them extensively, ensuring every function and method has some form of documentation, and generally I use Sphinx to generate documentation from those doc-strings.

Early on I was bothered by the fact that, just by the simple act of making typos, I wasn't keeping the form of the doc-strings consistent. And in this case it was a really simple thing that was bugging me. Normally, if I'm writing a single-line doc-string, I'll write like this:

def one_liner():
    """Here is a one-line doc-string."""

So far, so good. But, if the doc-string is a multi-liner, I prefer the ending quotes to be on a line of their own, like this:

def multi_liner():
    """Here is the first line.
    Here is another line.
    Here is the final line.
    """"

But, sometimes, by accident, I'd leave a doc-string like this:

def multi_liner():
    """Here is the first line.
    Here is another line.
    Here is the final line.""""

While it's really not a big deal, it would bug me and every time I found one like this I'd "fix" it.

Eventually, it bugged me enough that I decided I was going to write a little tool to find all such instances in my code and report them. My first approach was to think "I could just do this with some regexp magic", which was really a bad idea. Then I though, I know, I should use this as an excuse to to play with Python's ast library.

That worked really well! I had the first version of the code up and running in no time. It was simple but did the job. It ran through Python code I threw at it and alerted me to both missing doc-strings, and doc-strings with the ending I didn't like.

That served me for a while, until one day I realised that it wasn't quite doing the job correctly; it was only really looking at top-level functions and top-level methods in classes. Sometimes, not often, but sometimes, I'll define functions within functions, and I feel they deserve documentation too. So then I modified the code to ensure it walked every part of the AST.

Since then, when I've run into new things and had new ideas, pydscheck has grown and grown. I've added checks that all mentioned parameters have a type; I've added checks that any function/method that returns something actually documents the return value; I've added checks that any documentation of a returned value includes its type; I've added checks that any function or method that yields a value documents that fact; I've added checks that ensure that every parameter is documented in some way.

Each time I've done this it's helped uncover issues in my code's documentation that could be cleaner, and it's also given me a pet project to slowly better understand Python's AST.

It could be that there are better tools out there, I'd have thought that a good doc-string linting tool would be something someone had already written. But this time around I was happy to NIH it because I needed a fun learning exercise that would also have some benefits for my day-to-day work.

I'll caveat this with the fact that it's very particular to how I work and how I like my documentation to look, but if it sounds useful, here it is: https://github.com/davep/pydscheck.

There's still lots I could do with it. First off I should really properly package it up so it can be installed as a command line tool via pip. Other things that would be handy would be to allow some form of customisation of how it works. I'm sure there's other fun things I can do with it too.

That's part of the fun of having a pet project: you can tinker when you like and also get benefits from it as you use it.


A little speed issue with openpyxl

Posted on 2018-06-02 13:16 +0100 in Python • Tagged with Python, openpyxl • 4 min read

It's been very quiet on the blogging front, I'm afraid, mostly for the reasons I wrote about back in December last year. In that time I've been really very busy with work (in a good way, in a very good way) and there's not a whole lot of time to be toying with pet projects at home.

However, finding myself with a spare hour or so, I wanted to write about something I did run into as part of some development at work, and which I thought might be worth writing about in case it helps someone else.

Recently I've needed to write a library of code for loading data from Excel Workbooks. Given that the vast majority of coding I do at the moment is in Python, it made sense to make use of openpyxl. The initial prototype code I wrote worked well and it soon grew into a full-blown library that'll be used in a couple of work-related projects.

But one thing kept niggling me... It just wasn't as fast as I'd expected. The workbooks I'm pulling data from aren't that large, and yet it was taking a noticeable number of seconds to read in the data, and when I let the code have a go at a directory full of such workbooks... even the fan on the machine would ramp up.

It didn't seem right.

I did a little bit of profiling and could see that the code was spending most of its time deep in the guts of some XML-parsing functions. While I know that an xlsx file is pretty much an XML document, it seemed odd to me that it would take so much time and effort to pull the data out from it.

Given that I had other code to be writing, and given that the workbook-parsing code was "good enough" for the moment, I moved on for a short while.

But, a couple of weeks back, I had a bit of spare time and decided to revisit it. I did some more searching on openpyxl and speed issues and almost everything I found said that the common problem was failing to open the workbook in read_only mode. That can't have been my problem because I'd being doing that from the very start.

Eventually I came across a post somewhere (sorry, I've lost it for now -- I'll try and track it down again) that suggested that openpyxl was very slow to read from a workbook if you were reading one cell at a time, rather than using generators. The suggestion being that every time you pull a value form a cell, it has to parse the whole sheet up to that cell. Generators, on the other hand, would allow access to all the cells during one parse.

This seemed a little unlikely to me -- I'd have expected the code to cache the parsing results or something like that -- but it also would explain what I was seeing. So I decided to give it a test.

openpyxl-speed-issue is a version of the tests I wrote and ran and they absolutely show that there's a huge difference between cell-by-cell access vs generator access.

Code like this:

for row in range( 1, sheet.max_row + 1 ):
    for col in range( 0, sheet.max_column ):
        value = sheet[ row ][ col ].value

is far slower than something like this:

for row in wb[ "Test Sheet" ].rows:
    for cell in row:
        value = cell.value

Here's an example of the difference in time, as seen on my iMac:

$ make test
pipenv run time ./read-using-generators
        1.59 real         0.44 user         0.04 sys
pipenv run time ./read-using-peeking
       25.02 real        24.88 user         0.10 sys

As you can see, the cell-by-cell approach is about 16 times slower than the generator approach.

In most circumstances the generator approach would make most sense anyway, and in any other situation I probably would have used it and never have noticed this. However, the nature of the workbooks I need to pull data from means I need to "peek ahead" to make decisions about what I'm doing, so a more traditional loop over, with an index, made more sense.

I can easily "fix" this by using the generator approach to build up a two-dimensional array of cells, acquired via the generator; so I can still do what I want and benefit from using generators.

In conclusion: given that I found it difficult to find information about my speed issue, and given that the one off-hand comment I saw that suggested it was this wasn't exactly easy to find, I thought I'd write it all down too and create a repository of some test code to illustrate the issue. Hopefully someone else will benefit from this in the future.