Base65536: The (not very serious) Successor of Base64

Base64 is an encoding scheme, which allows the safe transmission of files between ascii- and EBCDIC-based computer systems, which were popular in the fifties and sixties. Although no computer today uses the EBCDIC character set, base64 is still used for encoding binaries on the internet. Email attachments are encoded this way, and sometimes you can find small images, which are directly embedded in the source code of the web page.

So what is base65536?

base65536 encodes data in a similar fashion to base64, but its alphabet, instead of being 64 characters long, is 65536 characters long. This means, one can map 16 bits of data into a single unicode codepoint.
It is of course terribly inefficient, if you were to count the outputted bytes (especially when UTF-8 encoded), but if you count just the number of unicode characters, as for example Twitter does for it’s length limit, you can fit double the data per character.

This was GitHub user ferno’s intention when creating this ‘standard’: to fit more data into a single tweet.

Existing implementations are available for javascript, python, ruby, go and php. But what i wanted was a drop-in replacement / alternative to linux’ / gnu’s base64 tool.
My implementation does exactly that! it is written in c, the command line options are compatible (for now, only short options are implemented) and i finally found an excuse to implement a complete binary search tree in a project.

As always, i put my code up on GitHub, if anyone wants to check it out.

Details of the implementation

If you have a look at GNU’s base64, you will see, it is quite dependent on the GNU C Libary. I wanted a minimalistic program, which should compile without dependencies on GNU specific stuff. Therefore, I have opted to use getopt() from the POSIX C Library instead of getopt_long(), which is GNU-specific. Because of that, only the single letter arguments are supported.

On the encoding part of the program, not much is happening: each byte is read in (either from stdin or the specified file) and encoded. The second byte is used to find the block, and the first one will be added to it. This order allows for the second byte to be EOF (-1), without having to pad the last byte and use a trunctuation marker, as base64 does with its =. For the EOF to be correctly passed through all the functions involved, each byte is internally represented as an int.

Decoding is much more interesting*. I deviate a little from base64 in the part where I handle non-Base65536-codepoints. as Base65536 does not assign meaning to any single-byte codepoint, those can be placed anywhere inside the encoded string, either for formatting (whitespace, which can also be added automatically by the encoding function) or inserting cleartext messages. Even when having the abort-on-decoding-error enabled, base65536 will not stop when encountering an ASCII (7bit) character. This is meant as a feature.

On to the binary search tree.
All the existing implementations have one thing in common: They were all written in languages that natively have a hash map data structure implemented. This allows for a simple (albeit not really performant) way of rethrieving the beginning of a block. As I have begun studying CompSci this year, I wanted to challenge myself and implement the most suitable data structure for this task: a binary search tree. When sorting the list by block-start, it is in the same order as when sorting by block-id. This means, the BST can be traversed by looking at the ids or the start at will. As i wanted the lookup to be as efficient as possible (yes, this is premature optimisation – deal with it), a linked list was not the way to go with this implementation (declaring that would be pretty ugly, and loading it dynamically would require malloc()ing and balancing – too much hassle). I therefore went with the implementation in form of an array. This brings a new chalenge: the tree has 257 elements, and is therefore not completely full (total). Most simple algorithms (including mine) one can use to transform a sorted array into BST form are capable of balancing the tree (which is necessary for a good performance), but won’t left-align the the last row. This means, a large portion of extra memory would be necessary to put the last
element in the middle of the last row. The simplest solution was, to remove the two smallest blocks (ids EOF and 1), which leaves us with a total tree containing 255 elements. appending the previously removed elements (EOF, 1) to the end of the tree will make it left-aligned.
A small program, generate_struct.c, will automatically transform the array into the required form. It has the data hardcoded in, taken from ferno’s javascript implementation and massaged into a structure using vi macros, which worked like a charm. One could have just printed the array in the specific order required, but where is the fun in that? Instead, a linked-list-representation of the tree is generated, and then read out in breadth first order, formatted as a C array, so it can be pasted right into base65536.c. For that, I had to implement a queue (again – this is an overengineed CompSci student project). The last line in the main function reads:

//NOTE: memory occupied by the tree-as-linked-list is not freed. deal with it. 

This program is intended to be run once during programming, so not a lot is gained spending time there.

Final Words

Wow - you are still reading this? Anyway, I had a lot of fun, actually using some of the algorithms and data structures I otherwise only used in boring proseminar exercises. base65536 is by no means even remotly something that advances the human race, but it was a fun exercise and I am generally into quirky, not very serious programming. I'll gladly point you to the asciiplayer; a player that renders porn as ASCII characters and displays it in your terminal. Another project (which is not on the internet, as I found more robust implementations afterwards) is displaying images in the terminal, either using the Braille Character Set or half-block characters from Unicode and coloring them in. Those provide minimal real-life use cases and stem from my because-I-can attitude to recreational programming.

* relatively. if you are into this kind of stuff 😉