Visual Selection

Posted on 2023-10-26 18:50 +0100 in Coding • Tagged with Python, evolution, biology, terminal, textual • 4 min read

Over the last few weeks I've had a couple of sessions of working on a library to wrap Plotext -- a popular terminal-based plotting library for Python -- so that it can easily be used in Textual apps; textual-plotext is the result.

I feel it's come together pretty well

But... I've been itching to find a reason to use it in a project of my own.

Meanwhile...

Back in the mid-2000s, when phpBB systems were still the fashion, I used to hang out on a site that was chiefly aimed at the atheist and secular humanist crowd. We'd get a good number of drive-by YEC types who'd want to argue (sorry... debate) and often talk nonsense about biology and the like.

Now, I'm no biologist, I'm no scientist, I'm just a hacker who likes to write code for fun and profit; so any time there was a chance to write some code to help illustrate an idea I'd jump at the chance. I forget the detail now -- this was back in 2008; 15 years ago as of the time of writing -- but one time I remember a conversation was taking place where someone was just flat out claiming that "random mutation" can only cause "loss of information" and could never lead to a "desired result", or some such thing.

If you've ever had, read or watched those debates, you'll know the sort of thing I mean.

So that got me thinking back then, could I write something that could give a simple illustration of how this doesn't quite make sense?

So I had a little hacking session and came up with some Ruby code1 that did what I felt was the job. You'd give it a phrase you wanted it to generate (a stand-in for the current "fitness landscape", in effect), it would then generate a totally random string of that length, and then would set about mutating it, finding mutations that were "fitter" than others (a stand in for selection), breed the best two so far (randomly copy one chunk from another to create a child), then repeat over and over.

When I first wrote it I wasn't sure what to expect; would it ever finish given a reasonably large target string?

It did.

It was fun to code.

It got posted to the BB and of course wasn't in any way persuasive to them (honestly I never expected it would be). I seem to recall it being hand-waved away with calls of there obviously being an intelligent designer involved2.

Anyway, the "meanwhile..." in this: a few times this year I've thought it could be fun to rework this in Python (it's really not that complex after all; just a string-chopping loop really) and use Textual to put a fun UI on it.

So, that's what I did, complete with textual-plotext plot:

Visual Selection in action

While, 15 years on, this isn't going to convince anyone of the underlying point, I think it does serve a good educational purpose. It shows that you can create a fun UI for the terminal, with Textual, with not a lot of code. It also shows off how you can easily create dynamic plots. Plus -- and I think this might be the really important one -- it shows you can write "traditional" tight-loop code in a Textual application and still have a responsive UI; all thanks to the worker API.

The heart of the code for this application is this:

environment = Environment("This is the target string we want to create!")
while not environment.best_fit_found:
    environment.shit_happens()

Sure, there's some detail in the Environment class, but you get the idea: while we've not hit the target, let life find a way. A loop like that would totally bog down an application with a UI without some other work taking place. With Textual and workers the resulting method in the application, complete with code to send updates to the UI, really doesn't look much different:

@work(thread=True, exclusive=True)
def run_world(self, target: str) -> None:
    worker = get_current_worker()
    environment = Environment(target)
    iterations = 0
    self.post_message(self.WorldUpdate(environment, iterations, *environment.best))
    while not worker.is_cancelled and not environment.best_fit_found:
        environment.shit_happens()
        iterations += 1
        if (iterations % 1000) == 0 or environment.best_fit_found:
            self.post_message(
                self.WorldUpdate(environment, iterations, *environment.best)
            )
    if environment.best_fit_found:
        self.post_message(self.Finished(iterations))

I honestly think the worker API is one of the coolest things added to Textual and I so often see people have real "woah!" moments when they get to grips with it.

Anyway... I've covered science, religion, and how Ruby is better than Python, so I'm sure I've annoyed almost everyone. Job done I guess. ;-)

If you want to check out the app itself there's a GitHub repo and it can also be installed from PyPi using pipx.

Expect it to be my tinker project of choice for a wee while; there's a couple of other things I'd like to add to it.


  1. Possibly unpopular opinion with some folk who will read this, but I've long been a fan of Ruby as a language and actually generally prefer it to Python. 

  2. Me, the coder. While utterly missing the point of a simple illustration, while apparently not understanding the concept of an analogy, I guess at least they felt I was intelligent? 


All green on GitHub

Posted on 2023-10-01 09:14 +0100 in Coding • Tagged with GitHub • 2 min read

In about a week's time I'll have had a GitHub account for 15 years! I can't even remember what motivated me to create one now, but back in October 2008 I grabbed the davep account...

Making my account

...and then made my first repo.

First repo made

My use of the site after that was very sporadic. It looks like I'd add or update something once or twice a year, but I wasn't a heavy user.

First few years

Then around the middle of 2015 I seem to have started using it a lot more.

The next few years

This very much shows that during those years I was working on personal stuff that I was making available in case anyone found it useful, but also leaning heavily on GitHub as a (a, not the) place to keep backups of code I cared about (or even no longer cared about). Quite a lot of that green will likely be me having a few periods of revamping my Emacs configuration.

The really fun part though starts about a year ago:

Working on FOSS full time

It's pretty obvious when I started working at Textualize, and working on a FOSS project full time. This is, without a doubt, the most green my contribution graph has looked. It looks like there's a couple of days this year where I haven't visited my desk at all, and I think this is a good thing (I try really hard to have a life outside of coding when it comes to weekends), but I'm also delighted to see just how busy this year looks.

I really hope this carries on for a while to come.

Apparently, as of the time of writing, I've made 12,588 contributions that are on GitHub. What's really fun is the fact that my first contribution pre-dates my GitHub account by 9 years!

My very first contribution

This one's pretty easy to explain: this is back from when I was involved with Harbour. Back then we were using SourceForge to manage the project (as was the fashion at the time), and at some point in the past whoever is maintaining the project has pulled the full history into GitHub.

My contribution history on GitHub is actually older than my adult son. I suspect it's older than at least one person I work with. :-/ 1


  1. I'm informed that this isn't the case2; apparently I'm either bad at estimating people's ages, or bad at remembering them; or both. 

  2. Although it's not too far off. :-/ 


Website: Miscellaneous Stuff moved

Posted on 2023-08-14 22:04 +0100 in Coding • Tagged with html, web • 2 min read

This evening I've spent more time working on the planned complete remake of my personal website, in this case "porting" over many of the files that made up the old "miscellaneous stuff" section of the site. If I'm honest, most if not all of the things in there are no longer relevant (like: who really needs a shell script to make gnuplot plots from files pulled off a 1990s-era Garmin handheld GPS unit?), but I thought I'd keep them kicking around "just in case".

One wee section I wasn't going to get rid of though was my scans of three pages from a UK magazine called Personal Computer News. These contain Grid Bike; a game I wrote for the VIC-20, all in BASIC, and got published. For my efforts I got a huge cheque for £40! If that doesn't seem like much to you, trust me, to 1983 me this was huge.

I bought a 16k RAM pack with the money.

Funnily enough, while trying to improve some of the links in the text, I decided to see if there was now an archive online somewhere and, sure enough, there is: in the obvious place. This means that my web site isn't the only copy of my program on the net. If you go to the December 21st 1983 edition and turn to around page 84, there I am!

The cover of PCN

At this point I'm almost tempted to try and get an emulator up and running and get the code going again. How much fun would it be to add a video to my YouTube channel, of me playing one of the very first games I wrote?


Website: Norton Guide information moved

Posted on 2023-08-13 10:02 +0100 in Coding • Tagged with html, web • 2 min read

This morning I've spent a wee bit of time tinkering with the configuration of the planned complete remake of my personal website. As part of this I made an effort to "port" over a section of the site. The choice for the first section to move was easy enough: Norton Guides.

Of all the parts of my old site, this is probably the most useful in terms of "contains information that isn't generally available out there on the web elsewhere and some folk might find it useful". I mean, at some point in the past, someone edited the Wikipedia page for Norton Guides and linked to mine as a source.

So getting that one back up and running as soon as possible made sense.

I've not added every bit of Norton Guide code to the main page, instead just pulling over and tidying up what was there before. On the other hand, just hacking on Markdown makes it all so much easier so I may expand on it a bit.

The really important part was moving over the file format details. This, I feel, is the information that people will be looking for, if anyone is ever looking.

So, proper start made; there's content beyond the landing page. There's still a lot to weed out and move over, and I think there's a lot of tweaking and the like with the configuration to do too. But the ball is rolling now. Ever time I get a spare hour and the desire to sit at my desk I can pick a section, look it over, decide if it deserves to come over, and act on that.

Heck, at this rate I might even end up with an actively-maintained website again!


The reboot begins

Posted on 2023-08-11 13:19 +0100 in Coding • Tagged with html, web • 2 min read

And I'm off! This morning I spent a good amount of time going through the sources for the old version of davep.org and removing everything that won't be needed any more, and also building up a rough TODO list of things I may want to recreate as content.

With that done, as mentioned earlier today, I started work on building the site around Pelican. Pretty quickly though I started to feel that that was going to be a bad choice. While Pelican felt like a perfect fit for this blog -- mainly because it seems to be very blog-oriented -- it was feeling a bit clunky for a general website that would have a handful of static pages at best; likely something I wouldn't be updating too often.

So I put it aside and went on with my morning, doing normal Friday domestic stuff like the weekly supermarket shop. It was while I was out doing that that I realised the obvious answer: use what we use for the Textual docs and what's been used for label.dev: Material for MkDocs!

I've just spent about 40 minutes after lunch kicking that off and it was really straightforward. Of course the result is horrifically cookie-cutter in terms of its look -- such is the way that mkdocs-material sites end up looking out of the box -- but I don't much care about that; what's important is that I've got a placeholder page in place, and I've quickly built a framework for writing and publishing the content.

So that's the plan: now that the welcome page is in place and there's something on my domain that looks like a working website again I can start to slowly drag in old content in a new format. Heck, if I'm careful I might even be able to retain some of the old URLs!

Longer-term plans might involve finally sorting out https support (yes, even today, my site is http-only), and perhaps adding some sort of RSS feed so there's a record of when changes are made.

After that... hopefully that'll be about it and perhaps the website will last another 22 years running on top of the same engine (actually that part should be easier because the "engine" is now local and it generates a static site).

The question then becomes who'll last longer, the site or me?


Admitting defeat on my website

Posted on 2023-08-11 08:44 +0100 in Coding • Tagged with php, html, web • 2 min read

I've had davep.org since very late 1999. Initially it started as a domain used just for email; while I did have a website, around that time it was still hosted on my then-ISPs hosting service, with a mirror on a friend's web server.

A year or so later I finally did a proper revamp of my website and finally settled on www.davep.org as the place to point people to. I think, when I made that move, that's when I decided to write my own website engine in php. It was fun. It worked. I didn't want to code backend stuff (I don't think the backend vs frontend distinction was even a thing we were talking about then) so hacking it together in an unholy mix of ruby to generate various static files that live in the filesystem and then php to turn them into actual HTML made sense.

And it worked.

The earliest version of davep.org I can find

I heavily maintained the site for many years; keeping the same engine, tweaking the styles, adding features and content. I figure some time around 2013 or 2014 I probably stopped being quite so active in messing with it, and then in the last 5 or 6 years I've pretty much neglected it.

The neglect shows.

Meanwhile... php has changed. Quiet a lot. It's one of those languages I used back in the day and pay no attention to. Then earlier this week I noticed that there must have been an update on the host and huge parts of my site broke, lots of content missing, pretty much useless and dead in the water.

I did briefly think about breaking out the latest and greatest php locally, setting things up to investigate what's going on, and seeing if I could breathe some more life into it; but really what's the point?

So after all these years I'm finally admitting defeat.

Right now, on the home page, I've just got a placeholder saying that bitrot finally ate my website and that I'm going to start again from scratch. That's my plan: given that I had a good experience moving this blog over to Pelican I think I'm going to build a new www.davep.org with Pelican. Where possible I'll try and drag some of the old content over, but I'm also going to use this opportunity to have a proper digital spring clean.

There's no planned timescale for this, but this morning I've spent an hour or so over coffee, branching the repo for the site and pruning out all the stuff I know I won't need and don't want.

I'll try and drop the odd update in here as things progress.


A new GitHub profile README

Posted on 2023-07-03 08:15 +0100 in Coding • Tagged with GitHub, Python, Textual • 2 min read

My new GitHub banner

Ever since GitHub introduced the profile README1 I've had a massively low-effort one in place. I made the repo, quickly wrote the file, and then sort of forgot about it. Well, I didn't so much forget as just keep looking at it and thinking "I should do something better with that one day".

Thing is, while there are lots of fancy approaches out there, and lots of neat generator tools and the like... they just weren't for me.

Then yesterday, over my second morning coffee, after getting my blog environment up and going again, I had an idea. It could be cool to use Textual's screenshot facility to make something terminal-themed! I mean, while it's not all I am these days, so much of what I'm doing right now is aimed at the terminal.

So... what to do? Then I thought it could be cool to knock up some sort of login screen type thing; with a banner. One visit to an online large terminal text generator site later, I had some banner text. All that was left was to write a simple Textual application to create the "screen".

The main layout is simple enough:

def compose(self) -> ComposeResult:
    yield Label(NAME, classes="banner")
    yield Label(PRATTLE)
    yield Label("github.com/davep login: [reverse] [/]")

where NAME contains the banner and PRATTLE contains the "login message". With some Textual CSS sprinkled over it to give the exact layout and colour I wanted, all that was left was to make the snapshot. This was easy enough too.

While the whole thing isn't fully documented just yet, Textual does have a great tool for automatically running an application and interacting with it; that meant I could easily write a function to load up my app and save the screenshot:

async def make_banner() -> None:
    async with GitHubBannerApp().run_test() as pilot:
        pilot.app.save_screenshot("davep.svg")

Of course, that needs running async, but that's simple enough:

if __name__ == "__main__":
    asyncio.run(make_banner())

Throw in a Makefile so I don't forget what I'm supposed to run:

.PHONY: all
all:
    pipenv run python make_banner.py

and that's it! Job done!

From here onward I guess I could have some real fun with this. It would be simple enough I guess to modify the code so that it changes what's displayed over time; perhaps show a "last login" value that relates to recently activity or something; any number of things; and then run it in a cron job and update the repository.

For now though... I'll stick with keeping things nice and simple.


  1. It was actually kind of annoying when they introduced it because the repo it uses is named after your user name. I already had a davep repo: it was a private repo where I was slowly working on a (now abandoned, I'll start it again some day I'm sure) ground-up rewrite of my davep.org website. 


Build in public, even in private

Posted on 2022-10-06 10:44 +0100 in Coding • Tagged with coding • 4 min read

As mentioned yesterday, I'm about to start working at Textualize, and building Open-source software is important to the company. Will -- the CEO -- is all about building in public. If you follow him on Twitter you'll notice that his Python coding adventure tweets actually outnumber is cooking tweets!

As someone who has long been a supporter of and fan of Free Software and Open-source software, and has made some small contributions along the way, I've also always made a point of building my own tools in public. In most cases they're things that are likely only helpful to me, but some are more generally useful. The point being though: it's all there in case it's helpfull to someone else.

Which means that, as much as possible, when I'm writing code, I write it as if it's going to be visible in public and someone else is going to be reading it. I try and make the code tidy. I try and comment it well. I try (but don't always manage for personal projects) to fully document it. The important thing here being that someone coming to the code fresh should be able to follow what's going on.

Against that background, and having just gone through the process of handing off almost 5 years of work to someone else as a left an employer, I got to thinking: we should always "build in public", even if it's in private.

When I started with my previous employer, and even to the day I left, I was the only software developer there. I worked with a team who wrote code, but being software developers wasn't what they did. Bioinformaticians and machine learning scientists have other things to be doing. But, as I wrote my code, I wrote every line assuming they, or some other developer down the line, would be reading it. Pretty much every line was written for an audience I couldn't see and didn't fully know. This, as mentioned above, meant trying to keep the code clean, ensuring it was commented in helpful ways, ensuring the documentation was helpful, and so on.

But it wasn't just about the code. Any non-trivial system will have more to it than code. We had an internal instance of GitLab and I tracked all of my work on there. So, as I planned and worked on new features, or went on bug hunts, I'd document the process in the issue tracker. As much as possible I'd be really verbose about the process. Often I wouldn't just open an issue, go work on it, and then mark it closed; as I worked through the issue I'd add comment after comment under it, documenting my thinking, problems, solutions, cite sources when looking something up, that sort of thing.

The whole process was an act in having a conversation with current or future team members if they ever needed to look; with future me (really, that helped more than once -- we all have those "that the hell was I thinking?" moments); with any developer(s) who took over from me in the future.

I did all this as if I was broadcasting it in public on Twitter or on GitHub, etc. It was in private, of course, but I approached it as if it was in public.

There were always three main reasons for this, I felt:

  1. Accountability. At any moment someone who I worked with could review what I was doing and why I was doing it; it was an invitation to anyone curious enough to talk with me about what I was building and how I was building it.
  2. Continuity of support for unplanned reasons. Life happens, sometimes you may, unplanned, never be available at work again. I never wanted to leave my employer in a position where picking up from such an event was a nightmare.
  3. Continuity of support for planned reasons. It was possible, and it became inevitable, that I'd move on to something else. If that was to happen I wanted to be sure that whoever picked up after me would be able to do so without too much effort.

In the end item 3 seemed to really pay off. When it came time for me to hand over my work to someone else, as I left, the process was really smooth and trouble-free. I was able to point the developer at all the documentation and source code, at all the issues, and invite them to read through it all and then come back to me with questions. In terms of time actually spent talking about the main system I was handing over I'd say that 4 years of work was handed over with just a few hours of actual talking about it.

It remains to be seen if it really paid off, of course. If they get really stuck they do have an open invitation to ping me a question or two; I care enough about what I designed and built that I want it to carry on being useful for some time to come. But... I like to think that all of that building in public, in private, will ensure that this is an invitation that never needs to be called on. I like to think that, if something isn't clear, they'll be able to check the code, the documentation and the issue history and get to where they need to go.


Reading 2bit files (for fun) - the sequence

Posted on 2020-09-26 15:57 +0100 in Coding • Tagged with Bioinformatics • 5 min read

Introduction

This post will cover the most important content of a 2bit file: the actual sequence data itself. In the first post I wrote about the format of the file's header, and in the second post I wrote about the content of the file's index.

At this point that's enough information to know what's in the file and where to find it. In other words we know the list of sequences that live in the file, and we know where each one is positioned within the file. So, assuming we have our index in memory (ideally some sort of key/value store of sequences names and their offsets in the file), given the name of a sequence we can know where to go in the file to load up the data.

So the next obvious question is, what will we find when we get there? Actual sequence data is stored like this:

Content Type Size Comments
DNA size Integer 4 bytes Count of bases in the sequence
N block count Integer 4 bytes Count of N blocks in the sequence
N block starts Integer Array 4*count bytes Positions are zero-based
N block sizes Integer Array 4*count bytes
Mask block count Integer 4 bytes Count of mask blocks in the sequence
Mask block starts Integer Array 4*count bytes Positions are zero-based
Mask block sizes Integer Array 4*count bytes
Reserved Integer 4 bytes Should always be 0
DNA data Byte Array See below

Breaking the above down:

N blocks

As mentioned in passing in the first post: technically it's necessary to encode 5 different characters for the bases in the sequences. As well as the usual T, C, A and G, there also needs to be an N, which means the base is unknown. Now, of course, you can't pack 5 states into two bits, so the 2bit file format solves this by having an array of block positions and sizes where any data in the actual DNA itself should be ignored and an N used in its place.

Mask blocks

This is where my ignorance of bioinformatics shows, and where it's made very obvious that I'm a software developer who likes to muck about with data and data structures, but who doesn't always understand why they're used. I'm actually not sure what purpose mask blocks serve in a 2bit file, but they do affect the output. If a base falls within a mask block the value that is output should be a lower-case letter, rather then upper-case.

The DNA data

So this is the fun bit, where the real data is stored. This should be viewed as a sequence of bytes, each of which contains 4 bases (except for the last byte, of course, which might contain 1, 2 or 3 depending on the size of the sequence).

Each byte should be viewed as an array of 2 bit values, with the values mapping like this:

Binary Decimal Base
00 0 T
01 1 C
10 2 A
11 3 G

So, given a byte whose value is 27, you're looking at the sequence TCAG. This is because 27 in binary is 00011011, which breaks down as:

00 01 10 11
T C A G

How you pull that data out of the byte will depend on the language and what it makes available for bit-twiddling; those that don't have some form of bit field will probably provide the ability to bit shift and do a bitwise and (it's also likely that doing bitwise operations is better than using bit fields anyway). In the version I wrote in Emacs Lisp, it's simply a case of shifting the two bits I am interested in over to the right of the byte and then performing a bitwise and to get just its value. So, given an array called 2bit-bases whose content is this:

(defconst 2bit-bases ["T" "C" "A" "G"]
  "Vector of the bases.

Note that the positions of each base in the vector map to the 2bit decoding
for them.")

I use this bit of code to pull out the individual bases:

(aref 2bit-bases (logand (ash byte (- shift)) #b11))

Given code to unpack an individual byte, extracting all of the bases in a sequence then becomes the act of having two loops, the outer loop being over each byte in the file, the inner loop being over the positions within each individual byte.

In pseudo-code, assuming that start and end hold the base locations we're interested in and dna_pos is the location in the file where the DNA starts, the main loop for unpacking the data looks something like this:

# The bases.
bases = [ "T", "C", "A", "G" ]

# Calculate the first and last byte to pull data from.
start_byte = dna_pos + floor( start / 4 )
end_byte   = dna_pos + floor( ( end - 1 ) / 4 )

# Work out the starting position.
position = ( start_byte - dna_pos ) * 4

# Load up the bytes that contain the DNA.
buffer = read_n_bytes_from( start_byte, ( end_byte - start_byte ) + 1 )

# Get all the N blocks that intersect this sub-sequence.
n_blocks = relevant_n_blocks( start, end )

# Get all the mask blocks that interest this sub-sequence.
mask_blocks = relevant_mask_blocks( start, end )

# Start with an empty sequence.
sequence = ""

# Loop over every byte in the buffer.
for byte in buffer

  # Stepping down each pair of bits in the byte.
  for shift from 6 downto 0 by 2

    # If we're interested in this location.
    if ( position >= start ) and ( position < end )

      # If this position is in an N block, just collect an N.
      if within( position, n_blocks )
        sequence = sequence + "N"
      else

        # Not a N, so we should decode the base.
        base = bases[ ( byte >> shift ) & 0b11 ]

        # If we're in a mask block, go lower case.
        if within( position, mask_blocks )
          sequence = sequence + lower( base )
        else
          sequence = sequence + base
        end

      end

    end

    # Move along.
    position = position + 1

  end

end

Note that some of the detail is left out in the above, especially the business of loading up the relevant blocks; how that would be done will depend on language and the approach to writing the code. The Emacs Lisp code I've written has what I think is a fairly straightforward approach to it. There's a similar approach in the Common Lisp code I've written.

And that's pretty much it. There are a few other details that differ depending on how this is approached, the language used, and other considerations; one body of 2bit reader code that I've written attempts to optimise how it does things as much as possible because it's capable of reading the data locally or via ranged HTTP GETs from a web server; the Common Lisp version I wrote still needs some work because I was having fun getting back into Common Lisp; the Emacs Lisp version needs to try and keep data as small as possible because it's working with buffers, not direct file access.

Having got to know the format of 2bit files a fair bit, I'm adding this to my list of "fun to do a version of" problems when getting to know a new language, or even dabbling in a language I know.


Reading 2bit files (for fun) - the index

Posted on 2020-09-05 10:59 +0100 in Coding • Tagged with Bioinformatics • 4 min read

As mentioned in the first post, once you've read in the header data for a 2bit file, the next step is to read the index. This is an index into all the different sequences held in the file. Reading the index itself is fairly straightforward.

The index comes right after the header -- so it starts on the 17th byte of the file. Each entry in the index contains three items of information:

Content Type Size Comments
Name length Integer 1 byte How many bytes long the name is
Name String Varies Length given by previous field
Offset Integer 4 bytes Location in the file of the sequence

So, in some sort of pseudo-code, you'd read in the index as follows:

index = dict()
for seq = 1 to seq_count // seq_count comes from the header
  name_len = (int) read_bytes( 1 )
  name     = (string) read_bytes( name_len )
  offset   = (int) read_bytes( 4 )
  index[ name ] = offset
end

Note, as mentioned in the first post, the index will need to be byte-swapped if the file is in an endian form other than the machine you're running your code on. How you'd go about this will, of course, vary from language to language, but the main idea is always going to be the same.

There's a fairly striking downside to this approach though: reading data can often be an expensive (in terms of time) operation -- this is especially true if the data is coming in from a remote machine, perhaps even one that's being accessed over the Internet. As such, it's best if you can make as few "trips" to the file as possible.

With this in mind, the best thing to do is to read the whole index into memory in one go and then process it from there -- the idea being that that's just one trip to the data source. The problem here, however, is that there's nothing in the header or the index that tells you how large the index actually is. What you can do though is work on the worst case scenario (assuming memory will allow). The worst case is fairly easy to handle: it's going to be 1 byte for the name length, plus 255 bytes for the name (the longest possible name), plus 4 bytes for the offset; multiply all that by the number of sequences in the index and you have the worst-case buffer size.

When reading this data in you might also want to ensure you're not going to run off the end of the file (perhaps the names are all quite small and so are the sequences).

Recently I've been working on a package for Emacs that can read data from 2bit files, so here's the core code for reading in the index:

(defun 2bit--read-index (source)
  "Read the sequence index from SOURCE.

As a side effect `2bit-data-pos' of SOURCE will move."
  (cl-loop
   ;; The index will be a hash of sequence names, with the values being the
   ;; offsets within the file.
   with index = (make-hash-table :test #'equal)
   ;; We could read each name/value pair one by one, but because we're doing
   ;; this within Emacs, which means making a temp buffer for every read,
   ;; that could get pretty expensive pretty fast. So instead we'll read the
   ;; index data in in one go. However, there is no easy-to-calculate size
   ;; for the index. The best we can do is calculate the worst case size. So
   ;; let's do that. The worst case size is the maximum size of the name of
   ;; a sequence (255), plus the size of the byte that tells us the name
   ;; (1), plus the size of the word that is the offset in the file (4).
   with buffer = (2bit--read source (* (2bit-data-sequence-count source) (+ 255 1 4)))
   ;; For every sequence in the file...
   for n from 1 to (2bit-data-sequence-count source)
   ;; Calculate the position within the buffer for this loop around. Note
   ;; that the skip is the last position plus 1 for the size byte plus the
   ;; size plus the length of the offset word.
   for pos = 0 then (+ pos 1 size 4)
   ;; Get the length of the name of the sequence.
   for size = (aref buffer pos)
   ;; Pull out the name itself.
   for name = (substring buffer (1+ pos) (+ pos 1 size))
   ;; Pull out the offset.
   for offset = (2bit--word-from-bytes source (substring buffer (+ pos 1 size) (+ pos 1 size 4)))
   ;; Collect the offset into the hash.
   do (setf (gethash name index) offset)
   ;; Once we're all done.... return the index.
   finally return index))

This code does what I mention above: it grabs enough data into a buffer in one go that I'll have the whole index in memory to pull apart, and then I work with the in-memory copy. The index is added to a hashing dictionary. Note that, in this case, I don't actually do the test for running off the end of the file because at the heart of the file reading code is insert-file-contents-literally and it doesn't error if you request too much.

With that done you'll have a list of all the sequences in the file. The next part, which will come in the next post, is the properly tricky part: the decoding of the sequence data itself.