Software

I'm afraid that I am addicted to computers, so I have wasted hundreds of hours writing software. Some of this stuff might even be useful to someone.

Java String Encoding Internals

I have written about my experiments with Java string encoding performance before. However, someone asked me some questions about why my code can do better than the JDK. I did not have a good answer, so I dug into the JDK source code to find out. If you want to know what happens when you call String.getBytes("UTF-8"), read on. The short version for UTF-8 is: ... (521 words)

Read more ...

[ 2010-May-31 16:15 | Permanent Link ]

Java String Encoding Performance

When writing Java String objects to the external world, either to a file or over the network, they must be converted to bytes in some specific encoding. Personally, I recommend UTF-8, but that is another issue. No matter what encoding is used, some API must be called to do the conversion. The simplest is String.getBytes(), which returns a new byte[] array with the data in a given character set. Unfortunately, when buffering writes for high performance, this typically results in an extra copy and an array that promptly needs to be garbage collected, which is wasteful. CharsetEncoder provides a mechanism to serialize into a ByteBuffer, which can be reused. This is more complicated, but should avoid extra copies and memory allocations. Interestingly, it doesn't improve performance when used in a straightforward fashion. It turns out that if you are only encoding a few strings, use String.getBytes(). If you need to encode many strings into a separate buffer, copy the string into a char[] array and use that as the input to a CharsetEncoder. This article describes the details of these approaches, and shows some performance numbers. [Update: For more detail, read how the JDK encodes strings]. ... (633 words)

Read more ...

[ 2009-November-23 09:10 | Permanent Link ]

Incomplete Index of Serialization Formats

This is a partial list of data interchange formats that attempt to make it easy to exchange data between applications that may be written in different languages. The most common use is to build RPC-like systems, so one application can submit a request over the network, and receive some response. Unfortunately, there are tons of these formats. If you think you should invent your own, please think again. It would help our industry if there was broad support for a small number of interchange formats. As you can see from this list, there are tons that exist already. Please don't invent a new one. Other people agree with me on this. I recommend you pick one from the "somewhat broadly used" list. If anyone has any comments or updates, let me know and I'll keep this up to date. ... (563 words)

Read more ...

[ 2009-November-01 11:52 | Permanent Link ]

Efficient Java I/O: byte[], ByteBuffers, and OutputStreams

Java provides two APIs for Network I/O: The original java.net.Socket class (via InputStream and OutputStream), and the newer java.nio.SocketChannel. Both of these eventually call the read and write operating system calls to actually send and receive data. However, I was curious about how exactly data gets to the operating system, since copying data multiple times is an easy way to decrease performance. I ended up spelunking through the Java source code to discover the answers. This article presents a brief walk through of the write path (the read path is basically identical), and some small benchmarks to discover the actual performance differences. The short summary is that NIO with direct ByteBuffers should be the most efficient form of I/O, but it depends on how your data gets into the buffers. ... (887 words)

Read more ...

[ 2009-October-21 11:36 | Permanent Link ]

Java/C++ Networking Microbenchmark

For a project I am working on at school, we wanted to compare using a thread-per-connection server against using a single threaded event driven server. This is a subject that has been studied to death (e.g.: [1, 2, 3, 4, 5]), but it is still controversial. Somewhat recently I came across an online discussion claiming that threads are faster than Java NIO (second source). However, the benchmark that is discussed there is primarily throughput limited. Our application processes many tiny messages, so I expect the results could be different. Hence, this benchmark tests a simple echo server. The server reads a client request (a length prefixed blob of bytes) then echos that request back. We wrote four versions: C++ and Java, epoll and threads. ... (368 words)

Read more ...

[ 2008-October-24 08:24 | Permanent Link ]

JSamp: A sampling profiler for Java

When investigating performance issues, profilers are the first tool I use. When working with C++ on Linux, I use Google's C++ profiler. Recently, I've been working on a Java project where performance was an important priority. While VisualVM and HProf are both useful tools, they both have important limitations. VisualVM is an instrumenting profiler, which means that when running a program with it, the program becomes significantly slower. Worse, since it instruments every method call, tiny methods which are normally very cheap will become expensive. Hence, I don't trust the results. HProf is a sampling profiler, which impacts the performance less and produces more realistic results. However, it starts profiling at the beginning of the Java program, and finishes at the end. This was inappropriate for my application, which has large start up overhead. Hence, I wrote JSamp, with help from my co-worker (co-student?) Yang Zhang. ... (198 words)

Read more ...

[ 2008-September-15 21:19 | Permanent Link ]

Extracting Text from Wikipedia

One of the greatest things about Wikipedia is that it is a completely open project. The software used to run it is open source, and the data is freely available. For a natural language processing course, I processed some text from Wikipedia. It was considerably harder than I expected. One of the biggest problems is that there is no well-defined parser for the wiki text that is used to write the articles. The parser is a mess of regular expressions, and users frequently add fragments of arbitrary HTML. Here is how I managed to wade through this and get something useful out the other end, including the software and the resulting data. ... (632 words)

Read more ...

[ 2008-April-13 16:50 | Permanent Link ]

Easy Stack Traces

I like writing unit tests and using lots of assert() statements while writing C++ code. This means I spend a lot of time looking at stack traces to figure out where my tests are failing. I got annoyed with having to run my programs from GDB, when all I really want is the stack trace, with the functions, files and line numbers listed. ... (340 words)

Read more ...

[ 2008-April-03 20:42 | Permanent Link ]

van Emde Boas Priority Queue

I am taking an Advanced Algorithms course at the moment, and it covered van Emde Boas priority queues. It is a priority queue or an associative array that maps sorted integer keys to values. It performs all operations in at most O(log log n) time, as opposed to O(log n) time for the traditional binary heap priority queue. The catch is that the keys must be limited to a certain range, typically between 0 and some power of 2. As usual, the Wikipedia article is very informative. I decided to implement the van Emde Boas queue in Python to actually understand how it works. It is designed to be executable pseudocode, and not an industrial strength implementation. However, I'm pretty sure it is correct thanks to its fairly extensive tests. It supports insert, delete, find, max, min and predecessor operations. Adding a successor operation is left as an exercise for the reader.

[ 2007-November-08 07:19 | Permanent Link ]

Tools for ns2 Simulations

The ns2 network simulator is a powerful tool, but unfortunately it does not provide any tools for getting results. It simply simulates a network, and provides a trace file of all the events that occured. This means that everyone goes out and writes their own tools to parse through these files. This is my attempt at creating a collection of tools for working with ns2. These tools are written in Python, and require Python 2.3. Many of them are command line tools, but all of them also expose a Python API. None of them are supported, so you'll need to figure it out yourself. Good luck! [Updated 2007-09-28: New revision and more documentation]. ... (320 words)

Read more ...

[ 2007-September-28 07:53 | Permanent Link ]