Tracing a Python performance bug on App Engine

about | archive


[ 2016-October-30 09:27 ]

I work at Bluecore where, like any growing application, we periodically run into interesting performance problems. Most of our service runs on App Engine, which has pretty good automatic scaling. That causes most performance issues to show up as large bills, rather than production failures. This is good news for our customers, but bad news for Bluecore, so we keep a close eye on our costs. While investigating a cost increase, we discovered that the Datastore was taking nearly a second to return 20 objects, instead of the normal 100 ms. In this article, I'll talk about how we used traces and logging to find the problem, then monkey patched Google's code to fix it. I think there are some generally applicable lessons, but the details may only be interesting to the unfortunately small number of people who use App Engine. TL;DR: App Engine's older Python Datastore library is very slow when serializing big objects. Traces helped make the problems visible, but logs pinpointed the issue.

We store a lot of data in the Google Cloud Datastore, which is a slightly weird key/value store with lots of features. For the purposes of this discussion, all that matters is that you can look up objects by key. We had implemented a new feature that read a batch of 20 objects using their key, which we expected to take no more than around 100 ms. Unfortunately, our monitoring showed this new feature was very slow, sometimes taking more than a second. App Engine traces a fraction of all requests, so we found a slow request and looked at the trace. We saw something like the following:

Request trace without logs

This shows a backend task taking around two seconds to return a result. The selected Datastore get took 13 ms, and the details on the right shows it returned 10 entities. Around 300 ms later the application then does a Datastore put. In the worst cases, we saw a gap after a get of up to a second. So what the heck was happening? To get more detail, the trace tool can also show us our application logs:

Request trace without logs

The gap between the Datastore get and the log message is about 250 ms, so we still are missing a lot of time. However, the requests combined with the log messages told us approximately what chunk of code was responsible. To track it down further, we logged the execution time of small pieces of the suspicious code. It turns out that the db.get API call was taking basically all the time. This was confusing at first: the trace showed the Datastore get returning after 100 ms, but our logs showed the function call taking a second. I suspected this meant that the Datastore was giving our application the data in the ~100 ms reported in the trace, but whatever the Python library was doing after that was slow.

While App Engine does have some private implementation pieces, the bulk of the code appears to be what they provide with the development kit. I walked through the code to see if I could find anything obviously slow. It turns out the older db library we use is built using an undocumented API called datastore.GetAsync. This returns an Entity object, that the db library then converts to the application's custom Model class. Since we were now down to a single function call, I wrote a microbenchmark which thankfully demonstrated the problem, and showed that datastore.GetAsync was about 2X faster. We tried deploying that fix.

While this did run significantly faster, there were still cases that were extremely slow. We logged the object keys for slow requests, then investigated using the microbenchmark. We found that the slow objects had more than 100 attributes, and that more attributes made deserialization slower. Looking more carefully at the implementation, it became clear why. The code converts all attributes from bytes to Python objects. However, for our feature, we only needed about four or five attributes, so this was wasted effort. Unfortunately, there was no easy API that skipped this step.

You can't change the core libraries on App Engine, since Google provides them for you. However, this is Python and everything is mutable. Eventually we found a place where we could monkey patch their internal API to skip the conversion, and just return the "raw" data from the Datastore. The microbenchmark showed that solved the performance problem. To get the attributes we needed, we wrote a wrapper that converted them lazily. This was around 4-5X faster than the original API, and allowed us to reach our performance target.

Lessons

Performance notes for App Engine users

If you use App Engine with Python, I've summarized some benchmark results from my microbenchmark. If you want to re-run these tests yourself, I've published the microbenchmark, as well as the LazyEntity class and monkey patch that does lazy deserialization. The worst part is that after I did all this work, I found that Matt Faus found the same problem in December 2014. His solution is similar, so you may want to evaluate both if you run into this issue.

For even more details, see the notes on my datastore performance microbenchmark.