Auf diesen Seiten findest Du die Projekte und Blogposts von Carsten Ringe.

Commons Base64OutputStream - Principle of least surprise?

23 October 2013

Yesterday I had the requirement to write base64 encoded content mixed with non-base64 encoded content. I have to deal with potentially large files (hundreds of MBs), so I wanted to work with streams to avoid memory issues. Also I am bound to an API of an external product, which hands an InputStream into my code and expects an InputStream back. (As a side note, did you know about PipedInputStream and PipedOutputStream?)

I am already using commons-codec Base64 encoder for small pieces of data, so I quickly discovered Base64OutputStream to write base64 encoded data in a streaming manner. I wrote a quick junit test to verify that my usage of this class produces the correct result and was somewhat surprised that 4 characters in the base64 encoded data were missing…

See the full git repository of the code at https://gist.github.com/MoriTanosuke/7129937.

As you can see above, not closing the Base64OutputStream does not produce the correct base64 encoded data. Only if I call #close() after writing into the stream, the missing bytes are encoded and written into the stream. Because I wrapped the Base64OutputStream around a PipedOutputStream and have to write additional non-base64 encoded data into it afterwards, I can not close the stream.

The solution for me was to switch to Base64InputStream and have commons-codec do the encoding when I read from my original InputStream source. That way I get valid base64 data for all my tested inputs. I tried to call #flush() instead, but that does not write the missing bytes into the stream. A quick peek into the sourcecode revealed that the code in question is only executed in the method #close().

This was kind of surprising, because the official javadoc in OutputStream#flush says

Flushes this output stream and forces any buffered output bytes to be written out. The general contract of flush is that calling it is an indication that, if any bytes previously written have been buffered by the implementation of the output stream, such bytes should immediately be written to their intended destination.

It took me only a couple of minutes to switch from Base64OutputStream to Base64InputStream, but if my test string in the unit test was just one character different, I’d not have catched this prior to integration testing - which is always a headache if you’re trying to make systems talk to each other and you have this kind of unexpected behavior somewhere deep down in the guts of your code.