Evan Jones' Scratch Pad

I am a Ph. D. student at MIT. I am interested in low-level systems software generally, and at the moment, distributed systems more specifically. This is my poorly organized pile of stuff.

Weak Isolation in Relational Databases

The ACID model is a key feature of traditional relational database systems. The I stands for Isolation, meaning that transactions cannot see intermediate results from other transactions. The traditional definition of isolation is serializability, where the results of processing a set of transactions is equivalent to executing them one at a time in some order. Effectively, each transaction can pretend to be executing by itself, when in reality multiple transactions can run concurrently. This is very useful, since applications do not need to worry about concurrency, making it much easier to write correct code. However, most database systems do not actually provide this model. Many systems, notably Postgres and Oracle, provide Snapshot Isolation, which is weaker. The primary difference is that snapshot isolation only checks for write/write conflicts, while serializability enforces read/write conflicts. Even worse, MySQL with InnoDB by default uses what it calls the REPEATABLE READ isolation level, but it has reads which are not repeatable. Similarly, Postgres defaults to the READ COMMITTED isolation level. The good news is that if you explicitly tell MySQL to use the SERIALIZABLE isolation level, it really is serializable. This article will provide a few concrete examples of where these isolation levels differ. ... (1171 words)

Read more ...

[ 2009-December-10 09:18 | Permanent Link ]

Java String Encoding Performance

When writing Java String objects to the external world, either to a file or over the network, they must be converted to bytes in some specific encoding. Personally, I recommend UTF-8, but that is another issue. No matter what encoding is used, some API must be called to do the conversion. The simplest is String.getBytes(), which returns a new byte[] array with the data in a given character set. Unfortunately, when buffering writes for high performance, this typically results in an extra copy and an array that promptly needs to be garbage collected, which is wasteful. CharsetEncoder provides a mechanism to serialize into a ByteBuffer, which can be reused. This is more complicated, but should avoid extra copies and memory allocations. Interestingly, it doesn't improve performance when used in a straightforward fashion. It turns out that if you are only encoding a few strings, use String.getBytes(). If you need to encode many strings into a separate buffer, copy the string into a char[] array and use that as the input to a CharsetEncoder. This article describes the details of these approaches, and shows some performance numbers. ... (623 words)

Read more ...

[ 2009-November-23 09:10 | Permanent Link ]

Incomplete Index of Serialization Formats

This is a partial list of data interchange formats that attempt to make it easy to exchange data between applications that may be written in different languages. The most common use is to build RPC-like systems, so one application can submit a request over the network, and receive some response. Unfortunately, there are tons of these formats. If you think you should invent your own, please think again. It would help our industry if there was broad support for a small number of interchange formats. As you can see from this list, there are tons that exist already. Please don't invent a new one. Other people agree with me on this. I recommend you pick one from the "somewhat broadly used" list. If anyone has any comments or updates, let me know and I'll keep this up to date. ... (563 words)

Read more ...

[ 2009-November-01 11:52 | Permanent Link ]

Efficient Java I/O: byte[], ByteBuffers, and OutputStreams

Java provides two APIs for Network I/O: The original java.net.Socket class (via InputStream and OutputStream), and the newer java.nio.SocketChannel. Both of these eventually call the read and write operating system calls to actually send and receive data. However, I was curious about how exactly data gets to the operating system, since copying data multiple times is an easy way to decrease performance. I ended up spelunking through the Java source code to discover the answers. This article presents a brief walk through of the write path (the read path is basically identical), and some small benchmarks to discover the actual performance differences. The short summary is that NIO with direct ByteBuffers should be the most efficient form of I/O, but it depends on how your data gets into the buffers. ... (887 words)

Read more ...

[ 2009-October-21 11:36 | Permanent Link ]

Building Reliable Storage on Virtual Infrastructure

I have had a few discussions recently about storing data reliably on so-called "Infrastructure as a Service" (IaaS) platforms, such as Amazon's EC2 or Rackspace Cloud. These services provide a virtual machine hosted in some data center somewhere. You can use the local disk on this virtual machine to store data. The catch is that for some types of failures, the data stored on that disk will be lost. This makes life easier for the infrastructure provider, since they don't need to worry about attempting to recover data from failed disks, or moving disks from failed machines. The problem is that users panic when they hear that they can't trust the data on the local disks. However, from my perspective, this is not a new problem. ... (622 words)

Read more ...

[ 2009-August-07 17:33 | Permanent Link ]

Open Cirrus: Open Source Cloud Infrastructure

I just stumbled upon the web site for a research project called Open Cirrus, aimed at developing open source "cloud computing" infrastructure. It is a collaboration between HP, Intel, Yahoo!, and a number of academic institutions. The web site is terrible, but there is a set of interesting presentations given at a recent workshop. The most interesting is one from Andrew Chein, the VP of research at Intel, titled "Seizing the Open Source Cloud Stack Opportunity." The presentation seems to argue that we really need some common consensus on the software infrastructure (a "unified" stack). The presentation points out that there are a huge number of projects all duplicating the same functionality, and we really need only one or two pieces of software for each piece of the stack (monitoring, job scheduling, etc.). This is music to my ears, since I have been frustrated by this same problem in my own work. For the moment, Open Cirrus seems to be using the following stack: Tashi for virtual machine job management, Ganglia for monitoring, then a variety of other layers on top, primarily Hadoop related. At this point, I don't really care what people choose, as long as the resulting "unified stack" is widely adopted.

[ 2009-August-02 14:56 | Permanent Link ]

State of Toronto Espresso

I just spent a week in Toronto, and as an espresso addict I spent some of my time in search of good coffee. While I didn't get to try all the cafés that have good buzz on the Internet, in my limited experience Toronto's espresso needs work. The cafés I visited typically have great spaces and great looking baked goods, but on my personal bad/acceptable/good scale, their espresso was only acceptable. Mind you, I am picky, so acceptable is better than 90% of the establishments out there, but this does not compare to New York or Vancouver, where I have had great espresso from multiple cafés. Maybe this is why Espresso Map, which I find very reliable, only lists one location in Toronto, which I did not get the opportunity to try. Toronto shouldn't feel too bad, because it is not alone. In Boston, a city of roughly similar population and size, I only know two places that serve what I consider to be good or even occasionally great espresso. I blame Toronto's problem on Tim Horton's. Ontarians think Tim Horton's is the be all and end all of coffee. Actually, that might explain the problem in Boston as well: New Englanders love their Dunkin' Donuts.

[ 2009-July-09 13:10 | Permanent Link ]

Deploying Distributed Applications

Today, many important applications run on multiple machines, for both fault-tolerance and to handle additional load. The most common example are web applications, which typically involve load balancers, web servers, application servers, and database servers. Sadly, there is no standard infrastructure for starting, stopping and updating these applications across multiple machines. In many cases, deploying the application requires manually configuring multiple servers, which is tedious and error prone. There are many tools which attempt to address these problems, but as a developer of distributed applications, none of them seem ideal. I want a system where I can declaratively state "here is my application and the resources it needs. Start x instances with this command line." The system should find available servers, distribute the application and start it. It should also then provide some basic tools for monitoring the health of the services, such as restarting the instance when it fails. This article is a survey of the tools I am aware of, and why I find them deficient. If I am wrong about anything here, or there are tools I should be aware of, please let me know. ... (897 words)

Read more ...

[ 2009-May-25 16:05 | Permanent Link ]

The Datacenter as a Computer

This is a 120 page document describing the design of state of the art, large scale computing facilities, such as those run by the big Internet companies. It discusses everything from facilities issues through the computing hardware through to the software infrastructure. This is an excellent design guide about how everyone should be designing data centers of all sizes, not just huge facilities. Don't be intimidated by its length: it is very easy to read. Just browse the table of contents and pick and choose the sections that interest you. I particularly enjoyed Chapter 5: Energy and Power Efficiency. ... (130 words)

Read more ...

[ 2009-May-23 18:06 | Permanent Link ]

Model Checking a Paxos Implementation

I find distributed consensus algorithms, such as Viewstamped Replication or Paxos, to be somewhat magical. Given a set of processes communicating over a network, they will either all eventually agree on the same value, or they will fail to reach agreement if the network is failing in arbitrary ways. These algorithms are used to build reliable distributed systems using replication. However, they are notoriously tricky to implement correctly. Subtle bugs that only occur in rare situations with unusual sequences of failures are easy to introduce. In order to build a replicated distributed system that can survive failures, the replication protocol effectively needs to be bug free. Thus, it is worth spending considerable effort to verify that the implementation is correct. ... (1185 words)

Read more ...

[ 2009-April-15 13:57 | Permanent Link ]