Posts tagged with "coding"

textual-canvas v0.2.0

1 min read

Demo of textual-canvas

Given that for a good chunk of this year I've been a bit lax about writing here, there's a couple or so coding projects I've not written about (well, not on here anyway -- I have spoken lots about them over on Fosstodon). One such project is textual-canvas.

As the name might suggest, it's a "canvas" for Textual applications, which provides a pretty basic interface for drawing pixels, lines and circles -- and of course any other shape you are able to build up from those basics.

I've just released a quick update after it was requested that I add a clear method to the Canvas widget; a request that makes perfect sense.

A new GitHub profile README

2 min read

My new GitHub banner

Ever since GitHub introduced the profile README1 I've had a massively low-effort one in place. I made the repo, quickly wrote the file, and then sort of forgot about it. Well, I didn't so much forget as just keep looking at it and thinking "I should do something better with that one day".

Thing is, while there are lots of fancy approaches out there, and lots of neat generator tools and the like... they just weren't for me.

Then yesterday, over my second morning coffee, after getting my blog environment up and going again, I had an idea. It could be cool to use Textual's screenshot facility to make something terminal-themed! I mean, while it's not all I am these days, so much of what I'm doing right now is aimed at the terminal.

So... what to do? Then I thought it could be cool to knock up some sort of login screen type thing; with a banner. One visit to an online large terminal text generator site later, I had some banner text. All that was left was to write a simple Textual application to create the "screen".

The main layout is simple enough:

def compose(self) -> ComposeResult:
    yield Label(NAME, classes="banner")
    yield Label(PRATTLE)
    yield Label("github.com/davep login: [reverse] [/]")

where NAME contains the banner and PRATTLE contains the "login message". With some Textual CSS sprinkled over it to give the exact layout and colour I wanted, all that was left was to make the snapshot. This was easy enough too.

While the whole thing isn't fully documented just yet, Textual does have a great tool for automatically running an application and interacting with it; that meant I could easily write a function to load up my app and save the screenshot:

async def make_banner() -> None:
    async with GitHubBannerApp().run_test() as pilot:
        pilot.app.save_screenshot("davep.svg")

Of course, that needs running async, but that's simple enough:

if __name__ == "__main__":
    asyncio.run(make_banner())

Throw in a Makefile so I don't forget what I'm supposed to run:

.PHONY: all
all:
    pipenv run python make_banner.py

and that's it! Job done!

From here onward I guess I could have some real fun with this. It would be simple enough I guess to modify the code so that it changes what's displayed over time; perhaps show a "last login" value that relates to recently activity or something; any number of things; and then run it in a cron job and update the repository.

For now though... I'll stick with keeping things nice and simple.


  1. It was actually kind of annoying when they introduced it because the repo it uses is named after your user name. I already had a davep repo: it was a private repo where I was slowly working on a (now abandoned, I'll start it again some day I'm sure) ground-up rewrite of my davep.org website. 

OIDIA

2 min read

Another little thing that's up on PyPi now, which is the final bit of fallout from the Textual dogfooding sessions, is a little project I'm calling OIDIA.

The application is a streak tracker. I'm quite the fan of streak trackers. I've used a few over the years, both to help keep me motivated and honest, and also to help me track that I've avoided unhelpful things too. Now, most of the apps I've used, and use now, tend to have reminders and counts and stats and are all about "DO NOT BREAK THE STREAK OR ELSE" and that's mostly fine, but...

To keep things simple and to purely concentrate on how to build Textual apps, I made this a "non-judgement" streak tracker. It's designed to be really simple: you add a streak, you bump up/down the number of times you did (or didn't do) the thing related to that streak, for each day, and that's it.

No totals. No stats. No reminders and bugging. No judgement.

Here it is in action:

When I started it, I wasn't quite sure how I wanted to store the data. Throwing it in a SQLite database held some appeal, but that also felt like a lot of faff for something so simple. Also, I wanted to make the data as easy to get at, to use elsewhere, and to hack on, as possible. So in the end I went with a simple JSON file.

On macOS and GNU/Linux streaks.json lives in ~/.local/share/oidia, on Windows it'll be in... I'm not sure off the top of my head actually; it'll be in whatever directory the handy xdg library has chosen. and because it's JSON that means that something like this:

OIDIA in action

ends up looking like this:

[
    {
        "title": "Hack some Python",
        "days": {
            "2022-12-02": 1,
            "2022-12-03": 1,
            "2022-12-04": 1,
            "2022-12-05": 1,
            "2022-12-06": 1,
            "2022-12-07": 1,
            "2022-12-08": 1,
            "2022-12-01": 1,
            "2022-11-30": 1,
            "2022-11-29": 1,
            "2022-11-28": 1
        }
    },
    {
        "title": "Brush my teeth",
        "days": {
            "2022-12-02": 2,
            "2022-12-03": 2,
            "2022-12-04": 2,
            "2022-12-05": 2,
            "2022-12-06": 2,
            "2022-12-07": 2,
            "2022-12-08": 1,
            "2022-12-01": 2,
            "2022-11-30": 2,
            "2022-11-29": 2,
            "2022-11-28": 2
        }
    },
    {
        "title": "Walk",
        "days": {
            "2022-12-02": 1,
            "2022-12-03": 1,
            "2022-12-04": 1,
            "2022-12-05": 1,
            "2022-12-06": 1,
            "2022-12-07": 1,
            "2022-12-08": 1,
            "2022-12-01": 1,
            "2022-11-30": 1,
            "2022-11-29": 1,
            "2022-11-28": 1
        }
    },
    {
        "title": "Run 5k",
        "days": {
            "2022-12-03": 2,
            "2022-12-05": 1,
            "2022-11-30": 1,
            "2022-11-28": 2
        }
    },
    {
        "title": "Run 10k",
        "days": {
            "2022-12-03": 1,
            "2022-11-28": 1
        }
    }
]

Of course, it remains to be seen how well that actually scales; possibly not so well over a long period of time, but this was written more as another way to explore Textual than anything else. Even then, it would be pretty trivial to update to something better for holding the data.

If this seems like your thing (and I will be supporting it and onward developing it) you can find it over on PyPi, which means it can be installed with pip or the ever-handy pipx:

$ pipx install oidia

New Things On PyPi

5 min read

An update

So, it's fast approaching 2 months now since I started the new thing and it's been a busy time. I've had to adjust to a quite a few new things, not least of which has been a longer and more involved commute. I'm actually mostly enjoying it too. While having to contend with busses isn't the best thing to be doing with my day, I do have a very fond spot for Edinburgh and it's nice to be in there most days of the week.

Part of the fallout from the new job has been that, in the last couple of weeks, I've thrown some more stuff up on PyPi. This comes about as part of a bit of a dog-fooding campaign we're on at the moment (you can read some background to this over on the company blog). While they have been, and will continue to be, mentioned on the Textualize blog, I thought I'd give a brief mention of them here on my own blog too given they are, essentially, personal projects.

gridinfo

This is one I'd like to tweak some more and improve on if possible. It is, in essence, a Python-coded terminal tool that does more or less the same as slstats.el. It started out as a rather silly quick hack, designed to do something different with rich-pixels.

Here's the finished version (as of the time of writing) being put through its paces:

Download from here, or install and play with it with a quick pipx install gridinfo.

unbored

The next experiment with Textual was to write a terminal-based client for the Bored-API. My initial plan for this was to just have a button or two that the user could mash on and they'd get an activity suggestion dropped into the middle of the terminal; but really that seemed a bit boring. Then I realised that it'd be a bit more silly and a bit more fun if I did it as a sort of TODO app. Bored? Run it up and use one of the activities you'd generated before. Don't like any of them? Ignore them and generate some more! Feeling bad that you've got such a backlog of reasons to not be bored? Delete a bunch!

And so on.

Here's a short video of it in action:

Download from here, or install and play with it with a quick pipx install unbored.

textual-qrcode

This one... this one I'm going to blame on the brain fog that followed flu and Covid jabs that happened the day before (and which are still kicking my arse 4 days later). Monday morning, at my desk, and I'm wondering what to next write to experiment with Textual, and I realised it would be interesting to write something that would show off that it's easy to make a third party widget library.

And... yeah, I don't know why, but I remembered qrencode.el and so textual-qrcode was born!

The most useless Textal widget yet

I think the most amusing part about this is that I did it in full knowledge that it would be useless; the idea being it would be a daft way of showing off how you could build a widget library as an add-on for Textual. But... more than one person actually ended up saying "yeah hold up there this could actually be handy!"

If you're one of those people... you'll find it here.

FivePyFive

While I was on a roll putting stuff up on PyPi, I also decided to tweak up my Textual-based 5x5 and throw that up too. This was my first app built with Textual, initially written before I'd even spoken to Will about the position here. At one point I even did a version in Lisp.

It's since gone on to become one of the example apps in Textual itself but I felt it deserved being made available to the world via an easy(ish) install. So, if you fancy trying to crack the puzzle in your terminal, just do a quick:

$ pipx install fivepyfive

and click away.

You can find it over here.

PISpy

Finally... for this week anyway, is a tool I've called PISpy. It's designed as a simple terminal client for looking up package information on PyPi. As of right now it's pretty straightforward, but I'd like to add more to it over time. Here's an example of it in action:

One small wrinkle with publishing it to PyPi was the fact that, once I'd chosen the name, I checked that it hadn't been used on PyPi (it hadn't) but when it came to publishing the package it got rejected because the name was too similar to another package! I don't know which, it wouldn't say, but that was a problem. So in the end the published name ended up having to be slightly different from the actual tool's name.

See over here for the package, and you can install it with a:

$ pipx install pispy-client

and then just run pispy in the terminal.

Conclusion

It's been a fun couple of weeks coming up with stuff to help exercise Textual, and there's more to come. Personally I've found the process really helpful in that it's help me learn more about the framework and also figure out my own approach to working with it. Each thing I've built so far has been a small step in evolution on from what I did in the previous thing. I doubt I've arrived at a plateau of understanding just yet.

Python and macOS

6 min read

Introduction

On Twitter, a few weeks back, @itsBexli asked me how I go about setting up Python for development on macOS. It's a great question and one that seems to crop up in various places, and since I first got into using macOS and then subsequently got back into coding lots in Python it's absolutely an issue I ran into.

With my previous employer, while I was the only developer, I wasn't the only person writing code and more than one other person had this issue so I eventually wrote up my approach to solving this problem. That document is on their internal GitLab, but I thought it worth me writing my personal macOS/Python "rules" up again, just in case they're useful to anyone else.

I am, of course, not the first person to tackle this, to document this, to write down a good approach to this. Before and after I settled on my approach I'd seen other people write about this. So... this post isn't here to try and replace those, it's simply to write down my own approach, so if anyone asks again I can point them here. I feel it's important to stress: this isn't the only way or thoughts on this issue, there are lots of others. Do go read them too and then settle on an approach that works for you.

One other point to note, which may or may not make a difference (and may or may not affect how this changes with time): for the past few years I've been a heavy user of pipenv to manage my virtual environments. This is very likely to change from now on, but keep in mind that what follows was arrived at from the perspective of a pipenv user.

So... with that admin aside...

The Problem

When I first got back into writing Python it was on macOS and, really early on, I ran into all the usual issues: virtual environments breaking because they were based on the system Python and it got updated, virtual environments based on the Homebrew-installed Python and it got updated, etc... Simply put, an occasional, annoying, non-show-stopping breaking of my development environment which would distract me when I'd sat down to just hack on some code, not do system admin!

My Solution

For me, what's worked for me without a problem over the past few years, in short, is this:

  1. NEVER use the system version of Python. Just don't.
  2. NEVER use the Homebrew's own version of Python (I'm not even sure this is an issue any more; but it used to be).
  3. NEVER use a version of Python installed with Homebrew (or, more to the point, never install Python with Homebrew).
  4. Manage everything with pyenv; which I do install from Homebrew.

The Detail

As mentioned earlier, what I'm writing here assumes that virtual environments are being managed with pipenv (something I still do for personal projects, for now, but this may change soon). Your choices and mileage may vary, etc... This is what works well for me.

The "one time" items

These are the items that need initially installing into a new macOS machine:

Homebrew

Unless it comes from the Mac App Store, I try and install everything via Homebrew. It's really handy for keeping track of what I've got installed, for recreating a development environment in general, and for keeping things up to date.

pyenv

With Homebrew installed the next step for me is to install pyenv. Doing so is as easy as:

$ brew install pyenv

Once installed, if it's not done it for you, you may need to make some changes to your login profile. I'm a user of fish so I have these lines in my setup. Simply put: it asks pyenv to set up my environment so my calls to Python go via its setup.

Plenty of help on how to set up pyenv can be found in its README.

Once I've done this I tend to go on and install the Python versions I'm likely to need. For me this tends to be the most recent "active" stable versions (as of the time of writing, 3.7 through 3.10; although I need to now start throwing 3.11 into the mix too).

I use this command:

$ pyenv install --list

to see the available versions. If I want to see what's available for a specific version I'll pipe through grep:

$ pyenv install --list | fgrep "  3.9"

This is handy if I want to check what the very latest release of a specific version of Python is.

The "Global" Python

When I'm done with the above I then tend to use pyenv to set my "global" Python. This is the version I want to get when I run python and I'm not inside a virtual environment. Doing that is as simple as:

$ pyenv global 3.10.7

Of course, you'd swap the version for whatever works for you.

When Stuff Breaks

It seems more rare these days, but on occasion I've had it such that some update somewhere still causes my environment to break. Now though I find that all it takes is a quick:

$ pyenv rehash

and everything is good again.

Setting Up A Repo

With all of the stuff above being mostly a one-off (or at least something I do once when I set up a new machine -- which isn't often), the real "work" here is when I set up a fresh repository when I start a new project. Really it's no work at all. Again, as I've said before, I've tended to use pipenv for my own work, and still do for personal stuff (but do want to change that), mileage may vary here depending on tool.

When I start a new project I think about which Python version I want to be working with, I ensure I have the latest version of it installed with pyenv, and then ask pipenv to create a new virtual environment with that:

$ pipenv --python 3.10.7

When you do this, you should see pipenv pulling the version of Python from the pyenv directories:

$ pipenv --python 3.10.7
Creating a virtualenv for this project...
Pipfile: /Users/davep/temp/cool-project/Pipfile
Using /Users/davep/.pyenv/versions/3.10.7/bin/python3 (3.10.7) to create virtualenv...
⠙ Creating virtual environment...created virtual environment CPython3.10.7.final.0-64 in 795ms
  creator CPython3Posix(dest=/Users/davep/temp/cool-project/.venv, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/Users/davep/Library/Application Support/virtualenv)
    added seed packages: pip==22.2.2, setuptools==65.3.0, wheel==0.37.1
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
✔ Successfully created virtual environment!
Virtualenv location: /Users/davep/temp/cool-project/.venv
Creating a Pipfile for this project...

The key thing here is seeing that pipenv is pulling Python from ~/.pyenv/versions/. If it isn't there's a good chance you have a Python earlier in your PATH than the pyenv one -- something you generally don't want. It will work, but it's more likely to break at some point in the future. That's the key thing I look for; if I see that I know things are going to be okay.

Conclusion

Since I adopted these personal rules and approaches (and really, calling them "rules" is kind of grand -- there's almost nothing to this) I've found I've had near-zero issues with the stability of my Python virtual environments (and what issues I have run into tend to be trivial and of my own doing).

As I said at the start: there are, of course, other approaches to this, but this is mine and works well for me. Do feel free to comment with your own, I'm always happy to learn about new ways!

Build in public, even in private

5 min read

As mentioned yesterday, I'm about to start working at Textualize, and building Open-source software is important to the company. Will -- the CEO -- is all about building in public. If you follow him on Twitter you'll notice that his Python coding adventure tweets actually outnumber is cooking tweets!

As someone who has long been a supporter of and fan of Free Software and Open-source software, and has made some small contributions along the way, I've also always made a point of building my own tools in public. In most cases they're things that are likely only helpful to me, but some are more generally useful. The point being though: it's all there in case it's helpfull to someone else.

Which means that, as much as possible, when I'm writing code, I write it as if it's going to be visible in public and someone else is going to be reading it. I try and make the code tidy. I try and comment it well. I try (but don't always manage for personal projects) to fully document it. The important thing here being that someone coming to the code fresh should be able to follow what's going on.

Against that background, and having just gone through the process of handing off almost 5 years of work to someone else as a left an employer, I got to thinking: we should always "build in public", even if it's in private.

When I started with my previous employer, and even to the day I left, I was the only software developer there. I worked with a team who wrote code, but being software developers wasn't what they did. Bioinformaticians and machine learning scientists have other things to be doing. But, as I wrote my code, I wrote every line assuming they, or some other developer down the line, would be reading it. Pretty much every line was written for an audience I couldn't see and didn't fully know. This, as mentioned above, meant trying to keep the code clean, ensuring it was commented in helpful ways, ensuring the documentation was helpful, and so on.

But it wasn't just about the code. Any non-trivial system will have more to it than code. We had an internal instance of GitLab and I tracked all of my work on there. So, as I planned and worked on new features, or went on bug hunts, I'd document the process in the issue tracker. As much as possible I'd be really verbose about the process. Often I wouldn't just open an issue, go work on it, and then mark it closed; as I worked through the issue I'd add comment after comment under it, documenting my thinking, problems, solutions, cite sources when looking something up, that sort of thing.

The whole process was an act in having a conversation with current or future team members if they ever needed to look; with future me (really, that helped more than once -- we all have those "that the hell was I thinking?" moments); with any developer(s) who took over from me in the future.

I did all this as if I was broadcasting it in public on Twitter or on GitHub, etc. It was in private, of course, but I approached it as if it was in public.

There were always three main reasons for this, I felt:

  1. Accountability. At any moment someone who I worked with could review what I was doing and why I was doing it; it was an invitation to anyone curious enough to talk with me about what I was building and how I was building it.
  2. Continuity of support for unplanned reasons. Life happens, sometimes you may, unplanned, never be available at work again. I never wanted to leave my employer in a position where picking up from such an event was a nightmare.
  3. Continuity of support for planned reasons. It was possible, and it became inevitable, that I'd move on to something else. If that was to happen I wanted to be sure that whoever picked up after me would be able to do so without too much effort.

In the end item 3 seemed to really pay off. When it came time for me to hand over my work to someone else, as I left, the process was really smooth and trouble-free. I was able to point the developer at all the documentation and source code, at all the issues, and invite them to read through it all and then come back to me with questions. In terms of time actually spent talking about the main system I was handing over I'd say that 4 years of work was handed over with just a few hours of actual talking about it.

It remains to be seen if it really paid off, of course. If they get really stuck they do have an open invitation to ping me a question or two; I care enough about what I designed and built that I want it to carry on being useful for some time to come. But... I like to think that all of that building in public, in private, will ensure that this is an invitation that never needs to be called on. I like to think that, if something isn't clear, they'll be able to check the code, the documentation and the issue history and get to where they need to go.

On to something new (redux)

4 min read

Just over five years ago I got a message from my then employer to say I was going to be made redundant after 21 years working for them. After the 3 month notice period the final day came. Meanwhile, I found something new that looked terrifying but interesting. In the end it was less terrifying and way more interesting than I imagined it would be. It was fun too.

But... (there's always a but isn't there?)

In the four and change years I've been there the company got bought out, and then the result of that got bought up. As I've mentioned before I'm generally not a "big company" kind of person; in all my years I've found that I'm happier working in a smaller place. After a couple of buyouts my employer had gone from being 10s of people in size to 100s of people in size (and technically 10s of 1,000s of people in size depending on how you look at it).

This change in ownership and size meant the culture became... well, let's just say not as friendly as you tend to enjoy when it's a smaller group of folk. On top of that I was starting to notice that my efforts were making less of an impact as things got bigger, and I started to feel like my contributions weren't really relevant any more. There were some problematic things happening too: undermining of efforts, removal of responsibilities without consultation or communication, that sort of thing. Plus worse. There's little point in going into the detail, but it's fair to say that work wasn't as fun as it used to be.

That felt like a good time to start to look around. If work makes you feel unhappy and you can look around... look around.

Thing is, I wasn't sure what to look for. I was in the comfortable position of, unlike last time, not needing to find something, so I could take my time at least. Over the course of the last year I've spoken to many different companies and organisations, some big (yes, I know, I said I don't like big places -- sometimes what's on offer deserves a fair hearing), some small, but none of them quite said "this feels like me". In some cases the whole thing didn't have the right vibe, in others the industry either didn't interest me, or felt uncomfortable given my personal values. In one particular case a place looked interesting until I checked the CTO's socials and OMG NO NO NO AVOID AVOID (that was a fun one).

Then I saw Will McGugan saying he was hiring to expand Textualize. This caught my interest right away for two good reasons.

I can't remember how long I've been following Will on Twitter; I likely stumbled on him as I got back into Python in 2018 and I also remember noting that he was a Python hacker just up the road from me. We'd vaguely chatted on Twitter, briefly, in that "Twitter acquaintance" way we all often do (I remember one brief exchange about fungus on The Meadows), and he'd seemed like a good sort. A small company run by a "good sort" kinda person felt like a damn good reason.

The second reason was Textual itself. I'd been watching Will develop it, in open, with great interest. I had (and still have) a plan to write a brand new CHUI-based (okay fine TUI-based as the kids like to say these days!) Norton Guide reader, all in Python, and Textual looked like the perfect framework to do the UI in. The chance to be involved with it sounded awesome.

Now, I said two reasons, but there's also a third I guess: Will's pitch for applying to Textualize felt so damn accessible! I'm on the older end of the age range of this industry; for much of my working life as a developer I've worked in isolation from other developers; while I first touched Python in the 90s, I've only been using it in anger since 2018 and still feel like I've got a lot to learn. Despite all these things, and more, saying "aye Dave this is beyond you" I felt comfortable dropping Will a line.

Which resulted in a chat.

Which resulted in some code tinkering and chatting.

Which resulted in...

Something new.

So, yeah, as of 2022-10-10 I'm on yet another new adventure. Time for me to really work on my Python coding as I work with Will and the rest of the team as part of Textualize.

Or, as I put it on Twitter a few days ago: I'm going to be a Python impostor syndrome speedrunner!

Reading 2bit files (for fun) - the sequence

5 min read

Introduction

This post will cover the most important content of a 2bit file: the actual sequence data itself. In the first post I wrote about the format of the file's header, and in the second post I wrote about the content of the file's index.

At this point that's enough information to know what's in the file and where to find it. In other words we know the list of sequences that live in the file, and we know where each one is positioned within the file. So, assuming we have our index in memory (ideally some sort of key/value store of sequences names and their offsets in the file), given the name of a sequence we can know where to go in the file to load up the data.

So the next obvious question is, what will we find when we get there? Actual sequence data is stored like this:

ContentTypeSizeComments
DNA sizeInteger4 bytesCount of bases in the sequence
N block countInteger4 bytesCount of N blocks in the sequence
N block startsInteger Array4*count bytesPositions are zero-based
N block sizesInteger Array4*count bytes
Mask block countInteger4 bytesCount of mask blocks in the sequence
Mask block startsInteger Array4*count bytesPositions are zero-based
Mask block sizesInteger Array4*count bytes
ReservedInteger4 bytesShould always be 0
DNA dataByte ArraySee below

Breaking the above down:

N blocks

As mentioned in passing in the first post: technically it's necessary to encode 5 different characters for the bases in the sequences. As well as the usual T, C, A and G, there also needs to be an N, which means the base is unknown. Now, of course, you can't pack 5 states into two bits, so the 2bit file format solves this by having an array of block positions and sizes where any data in the actual DNA itself should be ignored and an N used in its place.

Mask blocks

This is where my ignorance of bioinformatics shows, and where it's made very obvious that I'm a software developer who likes to muck about with data and data structures, but who doesn't always understand why they're used. I'm actually not sure what purpose mask blocks serve in a 2bit file, but they do affect the output. If a base falls within a mask block the value that is output should be a lower-case letter, rather then upper-case.

The DNA data

So this is the fun bit, where the real data is stored. This should be viewed as a sequence of bytes, each of which contains 4 bases (except for the last byte, of course, which might contain 1, 2 or 3 depending on the size of the sequence).

Each byte should be viewed as an array of 2 bit values, with the values mapping like this:

BinaryDecimalBase
000T
011C
102A
113G

So, given a byte whose value is 27, you're looking at the sequence TCAG. This is because 27 in binary is 00011011, which breaks down as:

00011011
TCAG

How you pull that data out of the byte will depend on the language and what it makes available for bit-twiddling; those that don't have some form of bit field will probably provide the ability to bit shift and do a bitwise and (it's also likely that doing bitwise operations is better than using bit fields anyway). In the version I wrote in Emacs Lisp, it's simply a case of shifting the two bits I am interested in over to the right of the byte and then performing a bitwise and to get just its value. So, given an array called 2bit-bases whose content is this:

(defconst 2bit-bases ["T" "C" "A" "G"]
  "Vector of the bases.

Note that the positions of each base in the vector map to the 2bit decoding
for them.")

I use this bit of code to pull out the individual bases:

(aref 2bit-bases (logand (ash byte (- shift)) #b11))

Given code to unpack an individual byte, extracting all of the bases in a sequence then becomes the act of having two loops, the outer loop being over each byte in the file, the inner loop being over the positions within each individual byte.

In pseudo-code, assuming that start and end hold the base locations we're interested in and dna_pos is the location in the file where the DNA starts, the main loop for unpacking the data looks something like this:

# The bases.
bases = [ "T", "C", "A", "G" ]

# Calculate the first and last byte to pull data from.
start_byte = dna_pos + floor( start / 4 )
end_byte   = dna_pos + floor( ( end - 1 ) / 4 )

# Work out the starting position.
position = ( start_byte - dna_pos ) * 4

# Load up the bytes that contain the DNA.
buffer = read_n_bytes_from( start_byte, ( end_byte - start_byte ) + 1 )

# Get all the N blocks that intersect this sub-sequence.
n_blocks = relevant_n_blocks( start, end )

# Get all the mask blocks that interest this sub-sequence.
mask_blocks = relevant_mask_blocks( start, end )

# Start with an empty sequence.
sequence = ""

# Loop over every byte in the buffer.
for byte in buffer

  # Stepping down each pair of bits in the byte.
  for shift from 6 downto 0 by 2

    # If we're interested in this location.
    if ( position >= start ) and ( position < end )

      # If this position is in an N block, just collect an N.
      if within( position, n_blocks )
        sequence = sequence + "N"
      else

        # Not a N, so we should decode the base.
        base = bases[ ( byte >> shift ) & 0b11 ]

        # If we're in a mask block, go lower case.
        if within( position, mask_blocks )
          sequence = sequence + lower( base )
        else
          sequence = sequence + base
        end

      end

    end

    # Move along.
    position = position + 1

  end

end

Note that some of the detail is left out in the above, especially the business of loading up the relevant blocks; how that would be done will depend on language and the approach to writing the code. The Emacs Lisp code I've written has what I think is a fairly straightforward approach to it. There's a similar approach in the Common Lisp code I've written.

And that's pretty much it. There are a few other details that differ depending on how this is approached, the language used, and other considerations; one body of 2bit reader code that I've written attempts to optimise how it does things as much as possible because it's capable of reading the data locally or via ranged HTTP GETs from a web server; the Common Lisp version I wrote still needs some work because I was having fun getting back into Common Lisp; the Emacs Lisp version needs to try and keep data as small as possible because it's working with buffers, not direct file access.

Having got to know the format of 2bit files a fair bit, I'm adding this to my list of "fun to do a version of" problems when getting to know a new language, or even dabbling in a language I know.

Reading 2bit files (for fun) - the index

3 min read

As mentioned in the first post, once you've read in the header data for a 2bit file, the next step is to read the index. This is an index into all the different sequences held in the file. Reading the index itself is fairly straightforward.

The index comes right after the header -- so it starts on the 17th byte of the file. Each entry in the index contains three items of information:

ContentTypeSizeComments
Name lengthInteger1 byteHow many bytes long the name is
NameStringVariesLength given by previous field
OffsetInteger4 bytesLocation in the file of the sequence

So, in some sort of pseudo-code, you'd read in the index as follows:

index = dict()
for seq = 1 to seq_count // seq_count comes from the header
  name_len = (int) read_bytes( 1 )
  name     = (string) read_bytes( name_len )
  offset   = (int) read_bytes( 4 )
  index[ name ] = offset
end

Note, as mentioned in the first post, the index will need to be byte-swapped if the file is in an endian form other than the machine you're running your code on. How you'd go about this will, of course, vary from language to language, but the main idea is always going to be the same.

There's a fairly striking downside to this approach though: reading data can often be an expensive (in terms of time) operation -- this is especially true if the data is coming in from a remote machine, perhaps even one that's being accessed over the Internet. As such, it's best if you can make as few "trips" to the file as possible.

With this in mind, the best thing to do is to read the whole index into memory in one go and then process it from there -- the idea being that that's just one trip to the data source. The problem here, however, is that there's nothing in the header or the index that tells you how large the index actually is. What you can do though is work on the worst case scenario (assuming memory will allow). The worst case is fairly easy to handle: it's going to be 1 byte for the name length, plus 255 bytes for the name (the longest possible name), plus 4 bytes for the offset; multiply all that by the number of sequences in the index and you have the worst-case buffer size.

When reading this data in you might also want to ensure you're not going to run off the end of the file (perhaps the names are all quite small and so are the sequences).

Recently I've been working on a package for Emacs that can read data from 2bit files, so here's the core code for reading in the index:

(defun 2bit--read-index (source)
  "Read the sequence index from SOURCE.

As a side effect `2bit-data-pos' of SOURCE will move."
  (cl-loop
   ;; The index will be a hash of sequence names, with the values being the
   ;; offsets within the file.
   with index = (make-hash-table :test #'equal)
   ;; We could read each name/value pair one by one, but because we're doing
   ;; this within Emacs, which means making a temp buffer for every read,
   ;; that could get pretty expensive pretty fast. So instead we'll read the
   ;; index data in in one go. However, there is no easy-to-calculate size
   ;; for the index. The best we can do is calculate the worst case size. So
   ;; let's do that. The worst case size is the maximum size of the name of
   ;; a sequence (255), plus the size of the byte that tells us the name
   ;; (1), plus the size of the word that is the offset in the file (4).
   with buffer = (2bit--read source (* (2bit-data-sequence-count source) (+ 255 1 4)))
   ;; For every sequence in the file...
   for n from 1 to (2bit-data-sequence-count source)
   ;; Calculate the position within the buffer for this loop around. Note
   ;; that the skip is the last position plus 1 for the size byte plus the
   ;; size plus the length of the offset word.
   for pos = 0 then (+ pos 1 size 4)
   ;; Get the length of the name of the sequence.
   for size = (aref buffer pos)
   ;; Pull out the name itself.
   for name = (substring buffer (1+ pos) (+ pos 1 size))
   ;; Pull out the offset.
   for offset = (2bit--word-from-bytes source (substring buffer (+ pos 1 size) (+ pos 1 size 4)))
   ;; Collect the offset into the hash.
   do (setf (gethash name index) offset)
   ;; Once we're all done.... return the index.
   finally return index))

This code does what I mention above: it grabs enough data into a buffer in one go that I'll have the whole index in memory to pull apart, and then I work with the in-memory copy. The index is added to a hashing dictionary. Note that, in this case, I don't actually do the test for running off the end of the file because at the heart of the file reading code is insert-file-contents-literally and it doesn't error if you request too much.

With that done you'll have a list of all the sequences in the file. The next part, which will come in the next post, is the properly tricky part: the decoding of the sequence data itself.

Reading 2bit files (for fun)

7 min read

Introduction

I've written a bit before about the value of having simple but interesting "problems", that you know the solution to, as a way of exercising yourself in a new environment. Recently I've added another to the list I already have, and I used it as an excuse to get back into writing some Common Lisp; and then went on to use it as a reason to write yet another package for Emacs.

Having gone through the process of writing code to handle 2bit files twice in about a month, and in two very similar but slightly different languages, I thought it might be interesting for me to then use it to exercise my ability to write blog posts (something I always struggle with -- I find writing very hard on a number of levels) and especially posts that explain a particular problem and how I implemented code relating to that problem.

Also, because the initial version of this post rambled on a bit too much and I lost the ability to finish it, I'm starting again and will be breaking it up a number of posts spanning a number of days -- perhaps even weeks -- so that I don't feel too overwhelmed by the process of writing it. I will, of course, make sure every post links to the other posts.

Now, before I go on, I'll make the important point that everything here is written from the perspective of a software developer who happens to work as part of a bioinformatics team; I don't do bioinformatics, I don't claim to understand it, I just happen to sit with (well, used to sit with them -- hopefully we'll all make it back to the office one day!) and work with them, and develop software that supports their work. Anything you see in any of the posts that's wrong about that subject: that's just my ignorance being shown through the lens of a software developer (all corrections are welcome).

So, with all those disclaimers aside, I'm going to go on a slow wander through what a 2bit file is, how you'd go about reading it, and related issues. This isn't designed as a tutorial or anything like that, this is simply me taking what I've learnt and writing it down. Perhaps someone else will benefit one day, but the purpose is to simply enjoy cementing it in my own mind and to enjoy the process of putting it all in writing.

What is a "2bit file"?

So what's this new "problem" I've added to my list? It's code to read sequence data from 2bit format files. For anyone who doesn't know (bioinformatics people look away now; a software developer is going to explain one of your file formats), this is a file format that is intended to hold sequences in an efficient way. As I'm sure you know, DNA is made up of 4 bases, represented by the letters T, C, A and G. So, in the simplest case, we could just represent a genome using those four letters. Simple enough, right? Nice big text file with just those 4 letters in?

The thing is, something like the human genome is around 3 billion bases in length. That'd make for a petty big file to have to store and move around. So why not compress it down a bit? That's where the 2bit format comes in.

Given this problem I'm sure most developers would quickly notice that, given 4 different characters, you only need 2 bits to actually hold them all (two bits gets us 00, 01, 10 and 11, so four different states). This means with a little bit of coding you can store 4 bases in a single byte. Just like that you've pretty much squished the whole thing down to 1/4 of the original size. And that's more or less what the 2bit format does. If you take a look at the actual data for the human genome you'll see that hg38.2bit is roughly 1/4 of 3 billion bytes, ish, give or take.

There is a wrinkle, however. There are parts of a genome where you might not know what base is there. Generally an N is used for that. So, actually, we want to be able to store 5 different characters. But 5 isn't going to go into 2 bits... Damn! Well, it's okay, 2bit has a solution to that too, and I'll cover that later on.

How is a 2bit file formatted?

As you can see from the format information available online, a 2bit file is a binary file format that is split into 3 key parts:

  • A fixed size header with some key information
  • An index into the rest of the file
  • A series of records that contain actual sequence information

In this first post I'll cover the details of the header. Subsequent posts will cover the index and the actual sequence data records.

The header

The header of a 2bit file is fixed in size and contains some key information. It can be broken down as follows:

ContentTypeSizeComments
SignatureInteger4 bytesSee below for endian issues.
VersionInteger4 bytesAlways 0.
Sequence countInteger4 bytes
ReservedInteger4 bytesAlways ignored.

The signature value is used to test if what you're looking at is a 2bit file, but also tells you some vital information about how to read the file -- see below for more on that. The version value is always 0 -- as such another useful test would be to error out if you get a valid signature but get a version other than 0. The sequence count is, as you'd guess, the number of sequences that are held within the file -- this is important when loading in the index of the file (more on that in the next post).

The signature, big and little endianness, and byte swapping

The header mentioned above comprises of 4 32-bit word values. The very first word is important to how you read the rest of the file. This is the signature for the 2bit file and it should always be 0x1A412743. And this is where it gets interesting and fun right away. The 2bit file format allows for the fact that the file can be built in either a little-endian or a big-endian machine, and the 32-bit word values can be binary-written to the file in the local architecture's byte order. The effect of this is that, from reading the very first value in the file, you need to decide if every other numeric value you read needs to be byte-swapped in some way. The early logic being (in no particular language) something like:

if signature == 0x1A412743 then
  must_swap = False
else if byte_swap( signature ) == 0x1A412743 then
  must_swap = True
else
  raise "This isn't a valid 2bit file"
end

Simply put, to read the rest of the file you will need a function that byte-swaps a 32bit numeric value, and a flag of some sort to mark that you need to do this every time you read such a value. Of course, depending on your language of choice, you could do it in a number of different ways. In a language like JavaScript or Scheme, where you can easily throw around functions, I'd probably just assign the appropriate 32bit-word-reading function to a global function name and call that regardless throughout the rest of the code. In other languages I'd probably just check the flag each time and call the swapping function if needed. In something like Python I'd likely just use the signature to decide on which format to pass to struct.unpack. For example, some variation on:

# Assuming that 'header' is the whole header of the file read as a binary buffer.

word_fmt = ""

for test_fmt in ( "<I", ">I" ):
    if struct.unpack( test_fmt, header[ 0:4 ] )[ 0 ] == 0x1A412743
        word_fmt = test_fmt
        break

if not word_fmt:
    raise Exception( "This isn't a 2bit file" )

Now, the Python approach sort of hides the important detail here. With it we'd simply use struct.unpack's ability to handle different byte orders and not worry about the detail. Which isn't fun, right? So how might code to byte-swap a 32bit value look?

Assuming you've got the value as an actual numeric integer, it can be as simple as using a bit of bitwise anding and shifting. Here's the basic code I wrote in Common Lisp, for example:

(defun swap-long (value)
  "Swap the endianness of a long integer VALUE."
  (logior
   (logand (ash value -24) #xff)
   (logand (ash value -8) #xff00)
   (logand (ash value 8) #xff0000)
   (logand (ash value 24) #xff000000)))

JavaScript might be something like:

function swapLong( value ) {
    return ( ( value >> 24 ) & 0xff       ) |
           ( ( value >>  8 ) & 0xff00     ) |
           ( ( value <<  8 ) & 0xff0000   ) |
           ( ( value << 24 ) & 0xff000000 )
}

and other variations on that theme in different languages.

Up next

In the next post I'll write about how the sequence index is stored and how to load it, including some considerations about how to make the loading as efficient as possible.