• I'm back!

    I’m back! Almost. More or less. In more ways than one. First off, as often happens with blogs (we’ve all been there right?), I’ve been away from blogging for a while. I’ve still been online, still been waffling away on twitter, and have also stumbled into fosstodon as well. Doubtless plenty of other things.

    A big distraction for me, and one that is ongoing, is mucking about on YouTube. Since the last time I wrote anything on the blog I got myself a VR setup, and then a PCVR setup, and then finally fibre came to the village and I could stream, and… well, you can see how that would go.

    So, in short, that’s where I’ve been and that’s what’s been keeping me busy. Now that I’m paying some attention to blogging again (hopefully!) I imagine some of that will end up on here – I’d quite like to write about VR and gaming amongst other things.

    Now, I said I’d been away in more ways than one. Another way is explained by this post from back in 2019, where I said I was going to head over to Hashnode and carry on blogging there, obviously with an emphasis on development and just development.

    That kept me busy for a while and worked out well, mostly. But… well, see above in part; I sort of ran out of steam when it came to purely-development topics. But I still wanted to write, a bit, and wanted to write about more than just development.

    Also, something else was bothering me about being over on Hashnode. In the past year, in terms of what they promote themselves, especially blogs and posts they promote on their Twitter feed, they seem to have started to lean really hard into crypto and web3 and NFTs and all that stuff. This left me feeling like that was all a bit icky and it was time to put some distance between that platform and myself.

    So over the past couple of weeks, low-level and as a background task, I’ve been back-porting posts form over there back into this blog. Starting with this post all new blog content, be it about software development or anything else, will be on here. If I’m really sensible and don’t get distracted by new shiny… this should be how it remains now.

    Expect some changes over the next few weeks. While I’m aiming to stick with the core tech (Github pages, Markdown and Jekyll, Emacs to edit, etc), I’d like to tinker with the look and layout of the blog. The content will remain the same though.

    So, yeah, anyway, if you’re reading this… hey, it’s good to be back. :-)

  • Reading 2bit files (for fun) - the sequence

    Introduction

    This post will cover the most important content of a 2bit file: the actual sequence data itself. In the first post I wrote about the format of the file’s header, and in the second post I wrote about the content of the file’s index.

    At this point that’s enough information to know what’s in the file and where to find it. In other words we know the list of sequences that live in the file, and we know where each one is positioned within the file. So, assuming we have our index in memory (ideally some sort of key/value store of sequences names and their offsets in the file), given the name of a sequence we can know where to go in the file to load up the data.

    So the next obvious question is, what will we find when we get there? Actual sequence data is stored like this:

    Content Type Size Comments
    DNA size Integer 4 bytes Count of bases in the sequence
    N block count Integer 4 bytes Count of N blocks in the sequence
    N block starts Integer Array 4*count bytes Positions are zero-based
    N block sizes Integer Array 4*count bytes  
    Mask block count Integer 4 bytes Count of mask blocks in the sequence
    Mask block starts Integer Array 4*count bytes Positions are zero-based
    Mask block sizes Integer Array 4*count bytes  
    Reserved Integer 4 bytes Should always be 0
    DNA data Byte Array   See below

    Breaking the above down:

    N blocks

    As mentioned in passing in the first post: technically it’s necessary to encode 5 different characters for the bases in the sequences. As well as the usual T, C, A and G, there also needs to be an N, which means the base is unknown. Now, of course, you can’t pack 5 states into two bits, so the 2bit file format solves this by having an array of block positions and sizes where any data in the actual DNA itself should be ignored and an N used in its place.

    Mask blocks

    This is where my ignorance of bioinformatics shows, and where it’s made very obvious that I’m a software developer who likes to muck about with data and data structures, but who doesn’t always understand why they’re used. I’m actually not sure what purpose mask blocks serve in a 2bit file, but they do affect the output. If a base falls within a mask block the value that is output should be a lower-case letter, rather then upper-case.

    The DNA data

    So this is the fun bit, where the real data is stored. This should be viewed as a sequence of bytes, each of which contains 4 bases (except for the last byte, of course, which might contain 1, 2 or 3 depending on the size of the sequence).

    Each byte should be viewed as an array of 2 bit values, with the values mapping like this:

    Binary Decimal Base
    00 0 T
    01 1 C
    10 2 A
    11 3 G

    So, given a byte whose value is 27, you’re looking at the sequence TCAG. This is because 27 in binary is 00011011, which breaks down as:

    00 01 10 11
    T C A G

    How you pull that data out of the byte will depend on the language and what it makes available for bit-twiddling; those that don’t have some form of bit field will probably provide the ability to bit shift and do a bitwise and (it’s also likely that doing bitwise operations is better than using bit fields anyway). In the version I wrote in Emacs Lisp, it’s simply a case of shifting the two bits I am interested in over to the right of the byte and then performing a bitwise and to get just its value. So, given an array called 2bit-bases whose content is this:

    (defconst 2bit-bases ["T" "C" "A" "G"]
      "Vector of the bases.
    
    Note that the positions of each base in the vector map to the 2bit decoding
    for them.")
    

    I use this bit of code to pull out the individual bases:

    (aref 2bit-bases (logand (ash byte (- shift)) #b11))
    

    Given code to unpack an individual byte, extracting all of the bases in a sequence then becomes the act of having two loops, the outer loop being over each byte in the file, the inner loop being over the positions within each individual byte.

    In pseudo-code, assuming that start and end hold the base locations we’re interested in and dna_pos is the location in the file where the DNA starts, the main loop for unpacking the data looks something like this:

    # The bases.
    bases = [ "T", "C", "A", "G" ]
    
    # Calculate the first and last byte to pull data from.
    start_byte = dna_pos + floor( start / 4 )
    end_byte   = dna_pos + floor( ( end - 1 ) / 4 )
    
    # Work out the starting position.
    position = ( start_byte - dna_pos ) * 4
    
    # Load up the bytes that contain the DNA.
    buffer = read_n_bytes_from( start_byte, ( end_byte - start_byte ) + 1 )
    
    # Get all the N blocks that intersect this sub-sequence.
    n_blocks = relevant_n_blocks( start, end )
    
    # Get all the mask blocks that interest this sub-sequence.
    mask_blocks = relevant_mask_blocks( start, end )
    
    # Start with an empty sequence.
    sequence = ""
    
    # Loop over every byte in the buffer.
    for byte in buffer
    
      # Stepping down each pair of bits in the byte.
      for shift from 6 downto 0 by 2
    
        # If we're interested in this location.
        if ( position >= start ) and ( position < end )
    
          # If this position is in an N block, just collect an N.
          if within( position, n_blocks )
            sequence = sequence + "N"
          else
    
            # Not a N, so we should decode the base.
            base = bases[ ( byte >> shift ) & 0b11 ]
    
            # If we're in a mask block, go lower case.
            if within( position, mask_blocks )
              sequence = sequence + lower( base )
            else
              sequence = sequence + base
            end
    
          end
    
        end
    
        # Move along.
        position = position + 1
    
      end
    
    end
    

    Note that some of the detail is left out in the above, especially the business of loading up the relevant blocks; how that would be done will depend on language and the approach to writing the code. The Emacs Lisp code I’ve written has what I think is a fairly straightforward approach to it. There’s a similar approach in the Common Lisp code I’ve written.

    And that’s pretty much it. There are a few other details that differ depending on how this is approached, the language used, and other considerations; one body of 2bit reader code that I’ve written attempts to optimise how it does things as much as possible because it’s capable of reading the data locally or via ranged HTTP GETs from a web server; the Common Lisp version I wrote still needs some work because I was having fun getting back into Common Lisp; the Emacs Lisp version needs to try and keep data as small as possible because it’s working with buffers, not direct file access.

    Having got to know the format of 2bit files a fair bit, I’m adding this to my list of “fun to do a version of” problems when getting to know a new language, or even dabbling in a language I know.

  • Reading 2bit files (for fun) - the index

    As mentioned in the first post, once you’ve read in the header data for a 2bit file, the next step is to read the index. This is an index into all the different sequences held in the file. Reading the index itself is fairly straightforward.

    The index comes right after the header – so it starts on the 17th byte of the file. Each entry in the index contains three items of information:

    Content Type Size Comments
    Name length Integer 1 byte How many bytes long the name is
    Name String Varies Length given by previous field
    Offset Integer 4 bytes Location in the file of the sequence

    So, in some sort of pseudo-code, you’d read in the index as follows:

    index = dict()
    for seq = 1 to seq_count // seq_count comes from the header
      name_len = (int) read_bytes( 1 )
      name     = (string) read_bytes( name_len )
      offset   = (int) read_bytes( 4 )
      index[ name ] = offset
    end
    

    Note, as mentioned in the first post, the index will need to be byte-swapped if the file is in an endian form other than the machine you’re running your code on. How you’d go about this will, of course, vary from language to language, but the main idea is always going to be the same.

    There’s a fairly striking downside to this approach though: reading data can often be an expensive (in terms of time) operation – this is especially true if the data is coming in from a remote machine, perhaps even one that’s being accessed over the Internet. As such, it’s best if you can make as few “trips” to the file as possible.

    With this in mind, the best thing to do is to read the whole index into memory in one go and then process it from there – the idea being that that’s just one trip to the data source. The problem here, however, is that there’s nothing in the header or the index that tells you how large the index actually is. What you can do though is work on the worst case scenario (assuming memory will allow). The worst case is fairly easy to handle: it’s going to be 1 byte for the name length, plus 255 bytes for the name (the longest possible name), plus 4 bytes for the offset; multiply all that by the number of sequences in the index and you have the worst-case buffer size.

    When reading this data in you might also want to ensure you’re not going to run off the end of the file (perhaps the names are all quite small and so are the sequences).

    Recently I’ve been working on a package for Emacs that can read data from 2bit files, so here’s the core code for reading in the index:

    (defun 2bit--read-index (source)
      "Read the sequence index from SOURCE.
    
    As a side effect `2bit-data-pos' of SOURCE will move."
      (cl-loop
       ;; The index will be a hash of sequence names, with the values being the
       ;; offsets within the file.
       with index = (make-hash-table :test #'equal)
       ;; We could read each name/value pair one by one, but because we're doing
       ;; this within Emacs, which means making a temp buffer for every read,
       ;; that could get pretty expensive pretty fast. So instead we'll read the
       ;; index data in in one go. However, there is no easy-to-calculate size
       ;; for the index. The best we can do is calculate the worst case size. So
       ;; let's do that. The worst case size is the maximum size of the name of
       ;; a sequence (255), plus the size of the byte that tells us the name
       ;; (1), plus the size of the word that is the offset in the file (4).
       with buffer = (2bit--read source (* (2bit-data-sequence-count source) (+ 255 1 4)))
       ;; For every sequence in the file...
       for n from 1 to (2bit-data-sequence-count source)
       ;; Calculate the position within the buffer for this loop around. Note
       ;; that the skip is the last position plus 1 for the size byte plus the
       ;; size plus the length of the offset word.
       for pos = 0 then (+ pos 1 size 4)
       ;; Get the length of the name of the sequence.
       for size = (aref buffer pos)
       ;; Pull out the name itself.
       for name = (substring buffer (1+ pos) (+ pos 1 size))
       ;; Pull out the offset.
       for offset = (2bit--word-from-bytes source (substring buffer (+ pos 1 size) (+ pos 1 size 4)))
       ;; Collect the offset into the hash.
       do (setf (gethash name index) offset)
       ;; Once we're all done.... return the index.
       finally return index))
    

    This code does what I mention above: it grabs enough data into a buffer in one go that I’ll have the whole index in memory to pull apart, and then I work with the in-memory copy. The index is added to a hashing dictionary. Note that, in this case, I don’t actually do the test for running off the end of the file because at the heart of the file reading code is insert-file-contents-literally and it doesn’t error if you request too much.

    With that done you’ll have a list of all the sequences in the file. The next part, which will come in the next post, is the properly tricky part: the decoding of the sequence data itself.

  • Reading 2bit files (for fun)

    Introduction

    I’ve written a bit before about the value of having simple but interesting “problems”, that you know the solution to, as a way of exercising yourself in a new environment. Recently I’ve added another to the list I already have, and I used it as an excuse to get back into writing some Common Lisp; and then went on to use it as a reason to write yet another package for Emacs.

    Having gone through the process of writing code to handle 2bit files twice in about a month, and in two very similar but slightly different languages, I thought it might be interesting for me to then use it to exercise my ability to write blog posts (something I always struggle with – I find writing very hard on a number of levels) and especially posts that explain a particular problem and how I implemented code relating to that problem.

    Also, because the initial version of this post rambled on a bit too much and I lost the ability to finish it, I’m starting again and will be breaking it up a number of posts spanning a number of days – perhaps even weeks – so that I don’t feel too overwhelmed by the process of writing it. I will, of course, make sure every post links to the other posts.

    Now, before I go on, I’ll make the important point that everything here is written from the perspective of a software developer who happens to work as part of a bioinformatics team; I don’t do bioinformatics, I don’t claim to understand it, I just happen to sit with (well, used to sit with them – hopefully we’ll all make it back to the office one day!) and work with them, and develop software that supports their work. Anything you see in any of the posts that’s wrong about that subject: that’s just my ignorance being shown through the lens of a software developer (all corrections are welcome).

    So, with all those disclaimers aside, I’m going to go on a slow wander through what a 2bit file is, how you’d go about reading it, and related issues. This isn’t designed as a tutorial or anything like that, this is simply me taking what I’ve learnt and writing it down. Perhaps someone else will benefit one day, but the purpose is to simply enjoy cementing it in my own mind and to enjoy the process of putting it all in writing.

    What is a “2bit file”?

    So what’s this new “problem” I’ve added to my list? It’s code to read sequence data from 2bit format files. For anyone who doesn’t know (bioinformatics people look away now; a software developer is going to explain one of your file formats), this is a file format that is intended to hold sequences in an efficient way. As I’m sure you know, DNA is made up of 4 bases, represented by the letters T, C, A and G. So, in the simplest case, we could just represent a genome using those four letters. Simple enough, right? Nice big text file with just those 4 letters in?

    The thing is, something like the human genome is around 3 billion bases in length. That’d make for a petty big file to have to store and move around. So why not compress it down a bit? That’s where the 2bit format comes in.

    Given this problem I’m sure most developers would quickly notice that, given 4 different characters, you only need 2 bits to actually hold them all (two bits gets us 00, 01, 10 and 11, so four different states). This means with a little bit of coding you can store 4 bases in a single byte. Just like that you’ve pretty much squished the whole thing down to 1/4 of the original size. And that’s more or less what the 2bit format does. If you take a look at the actual data for the human genome you’ll see that hg38.2bit is roughly 1/4 of 3 billion bytes, ish, give or take.

    There is a wrinkle, however. There are parts of a genome where you might not know what base is there. Generally an N is used for that. So, actually, we want to be able to store 5 different characters. But 5 isn’t going to go into 2 bits… Damn! Well, it’s okay, 2bit has a solution to that too, and I’ll cover that later on.

    How is a 2bit file formatted?

    As you can see from the format information available online, a 2bit file is a binary file format that is split into 3 key parts:

    • A fixed size header with some key information
    • An index into the rest of the file
    • A series of records that contain actual sequence information

    In this first post I’ll cover the details of the header. Subsequent posts will cover the index and the actual sequence data records.

    The header

    The header of a 2bit file is fixed in size and contains some key information. It can be broken down as follows:

    Content Type Size Comments
    Signature Integer 4 bytes See below for endian issues.
    Version Integer 4 bytes Always 0.
    Sequence count Integer 4 bytes  
    Reserved Integer 4 bytes Always ignored.

    The signature value is used to test if what you’re looking at is a 2bit file, but also tells you some vital information about how to read the file – see below for more on that. The version value is always 0 – as such another useful test would be to error out if you get a valid signature but get a version other than 0. The sequence count is, as you’d guess, the number of sequences that are held within the file – this is important when loading in the index of the file (more on that in the next post).

    The signature, big and little endianness, and byte swapping

    The header mentioned above comprises of 4 32-bit word values. The very first word is important to how you read the rest of the file. This is the signature for the 2bit file and it should always be 0x1A412743. And this is where it gets interesting and fun right away. The 2bit file format allows for the fact that the file can be built in either a little-endian or a big-endian machine, and the 32-bit word values can be binary-written to the file in the local architecture’s byte order. The effect of this is that, from reading the very first value in the file, you need to decide if every other numeric value you read needs to be byte-swapped in some way. The early logic being (in no particular language) something like:

    if signature == 0x1A412743 then
      must_swap = False
    else if byte_swap( signature ) == 0x1A412743 then
      must_swap = True
    else
      raise "This isn't a valid 2bit file"
    end
    

    Simply put, to read the rest of the file you will need a function that byte-swaps a 32bit numeric value, and a flag of some sort to mark that you need to do this every time you read such a value. Of course, depending on your language of choice, you could do it in a number of different ways. In a language like JavaScript or Scheme, where you can easily throw around functions, I’d probably just assign the appropriate 32bit-word-reading function to a global function name and call that regardless throughout the rest of the code. In other languages I’d probably just check the flag each time and call the swapping function if needed. In something like Python I’d likely just use the signature to decide on which format to pass to struct.unpack. For example, some variation on:

    # Assuming that 'header' is the whole header of the file read as a binary buffer.
    
    word_fmt = ""
    
    for test_fmt in ( "<I", ">I" ):
        if struct.unpack( test_fmt, header[ 0:4 ] )[ 0 ] == 0x1A412743
            word_fmt = test_fmt
            break
    
    if not word_fmt:
        raise Exception( "This isn't a 2bit file" )
    

    Now, the Python approach sort of hides the important detail here. With it we’d simply use struct.unpack’s ability to handle different byte orders and not worry about the detail. Which isn’t fun, right? So how might code to byte-swap a 32bit value look?

    Assuming you’ve got the value as an actual numeric integer, it can be as simple as using a bit of bitwise anding and shifting. Here’s the basic code I wrote in Common Lisp, for example:

    (defun swap-long (value)
      "Swap the endianness of a long integer VALUE."
      (logior
       (logand (ash value -24) #xff)
       (logand (ash value -8) #xff00)
       (logand (ash value 8) #xff0000)
       (logand (ash value 24) #xff000000)))
    

    JavaScript might be something like:

    function swapLong( value ) {
        return ( ( value >> 24 ) & 0xff       ) |
               ( ( value >>  8 ) & 0xff00     ) |
               ( ( value <<  8 ) & 0xff0000   ) |
               ( ( value << 24 ) & 0xff000000 )
    }
    

    and other variations on that theme in different languages.

    Up next

    In the next post I’ll write about how the sequence index is stored and how to load it, including some considerations about how to make the loading as efficient as possible.

  • The PEP 8 hill I will die on

    I first learnt Python back in the mid-to-late 90s, used it in place of Perl once I was comfortable with it, and then we sort of drifted apart when I first met Ruby. It’s only in the last couple of years that I’ve got back into it, and in a huge way, thanks to my (not-quite-so-) new job. Despite the quirks and oddness (as I perceive them), I actually quite like Python and it’s one of those languages that just flows off my fingers. I’m sure you know the same thing, perhaps not with Python, but there will be languages that just flow for you, and those that take a bit more effort and concentration. Python… feels okay to me.

    I also appreciate that there’s been a long-standing style guide. I quite like PEP 8 as a read, and think there’s a lot of good ideas in there; much of the content sits with how I’d approach things if I was tasked to come up with such a document. With this in mind, I’m a fairly heavy user of pylint and it in turn leans on PEP 8 (amongst other things) and I’m happy to accept most of its judgements. Not all of its judgements, but even when I disagree with it I try and keep track of how far I’m drifting.

    But there is absolutely one hill I will happily die on when it comes to PEP 8: the concept of “extraneous whitespace” in lists and expressions. Just…. no! Oh gods no!

    To borrow a line of code from the journey problem I dabbled with a while back, PEP 8 would have me write something like this:

    def perform(commands: List[str],state: State) -> State:
    

    Now, I’m sure plenty of people won’t see a problem with this at all; but all I can see is an almost-claustrophobic parameter list. What’s with the parameters being jammed up against the opening and closing parens? Why have the dinky little comma lost between two different things? Why have it look like a long stream of letters and punctuation? Why….

    No.

    Just no.

    I can’t.

    Rightly or wrongly, I just need to the code to breathe a bit. When I type this:

    def perform( commands: List[ str ], state: State ) -> State:
    

    suddenly if feels like there’s fresh air in the code, like it flows gently out of my head, off my fingers, through the keyboard and into the buffer.

    In my head, and to my eyes, the code is…. relaxed.

    Do I have a rational reason for this? Nope. Then again I don’t see one for doing it the other way either; I can’t think of one and I don’t see one in the source document. So, that’s a warning I always turn off with pylint and it’s a style I carry through all my Python code; and I think that’s the important point here: anyone reading and working with my code should see the same style all the way through. It might differ from PEP 8 on this point, but at least it’s the same all the way.

    And, really, that’s okay: PEP 8 is there to be ignored. ;-)

    PS: This is a small part of another blog post I was meaning to write, and might still do, about my (still ongoing) experience of getting lsp-mode up and running in Emacs and having it play nice with Python projects. I have that working, but it was a bit of a learning curve and epic battle over a couple of days, and one that had me first encounter pycodestyle. I may still tell the tale…

subscribe via RSS