Tuesday, November 26, 2013

Unicast enumerators let you debug in production.

Have you ever had to debug something broken in production? Have you ever had to grep through mountains of logs to find a stack trace? What if I told you you could get back log data in your API call response? But wait there's more! What if you could run jobs via an API endpoint, allowing them to be scheduled from other services?

The ghost of Billy Mays here, and before I'm off to haunt that dick bag sham wow guy, I just wanted to tell you about this amazing opportunity to use Play Framework's unicast enumerators.

It's no secret that Klout fucks with iteratees. We use them all over the place: our collectors, our targeting framework, our HBase and ElasticSearch libraries. Recently, we discovered one weird trick to debug code in production using this framework. Bugs hate us!

Iteratees

If you don't know what an enumerator or iteratee is, usher yourself to the Play website to have a 20 megaton knowledge bomb dropped on your dome piece. If you're too lazy, let me give you a mediocre 10,000 foot overview: enumerators push out inputs, enumeratees modify those inputs, iteratees are a sink to receive those inputs.

For example, in the following code snippet:

List(1,2,3).map(_ + 1).foldLeft(0)(_ + _)

The list is the enumerator, the mapping is the enumeratee, and the folding is the iteratee. Of course, everything is done piece-meal and asynchronously. The 1 gets pushed to _ + 1 enumeratee, which then pushes a 2 to the fold iteratee, which dumps it in a sink.

Play Framework

Interestingly enough, Play's HTTP handling is actually done by iteratees. Your controller actions return byte array enumerators, which get consumed by various iteratees. Everything else is just a bunch of sugar on top of this low-level interface.

Also in Play 2.1, there is a unicast enumerator, which allows you to imperatively push inputs via a channel. Yeah, yeah we all know that imperative code is uglier than Steve Buscemi and Paul Scheer's unholy love child, but pushing inputs allows you to do some interesting things. If you haven't solved the fucking mystery yet, and are still as confused as you were by the ending to Prometheus, let me give you a giant fucking hint: you can push debug statements via the channel, which get spat out by the enumerator. This enumerator then gets consumed by a Play's iteratee.

So here's my implementation of this pattern:

The call starts with a controller wrapper, intended to be used like this:

The event stream then gets passed down to the service method as an implicit.

When you curl the endpoint, if you use a special header, the log pushes will be returned as a chunked http response AS THEY ARE PUSHED.

curl -H 'X-Klout-Use-Stream:debug' 'http://rip.billy.com/api/blah.json'

[info] [2013-11-26 09:57:50.931] Fetching something from HBase
[info] [2013-11-26 09:57:51.932] HBase returned in 1s. Durrgh.
[complete] [2013-11-26 23:25:32.499] {"billymays" : "badass"}
200

Here's the wrapper code itself. The implementation of the EventStream case class itself is left as an exercise for the reader.

Basically what's going on here is I look through the headers for my special header and see if it maps to a predefined event level. If it doesn't, I do the usual shit. If it does, I call Concurrent.unicast. This factory takes in a function that uses a channel of type T and returns an enumerator of type T. I take this channel and put it into an EventStream case class. This case class gets passed down to the business logic, which typically uses it as an implicit.

Word.

Application

In addition to being able to stream logs by request in production, you can by extension stream logs from a long running request and call it a job. This means you can use external services like Jenkins to run an API endpoint like a job.

Yay! Also, Deltron:

Saturday, June 15, 2013

Protip: layer cake

This post is about Scala's cake pattern. If you don't know what it is, educate yourself here. To tldr; it, the cake pattern is the canonical way of doing type safe dependency injection in Scala.

An unfortunate problem with the cake pattern is registry bloat. The Klout API (surprise surprise) happens to be a pretty large application. Our registry got to be over 100 lines long. It's not a huge deal, but it's ugly as fuck. And if there's one thing we love at Klout more than cursing, it's damn hell ass beautiful code.

That is not beautiful.

To shrink down the registry and just organize our code better in general, we've started using what we're tentatively calling the layer cake pattern. And much like Daniel Craig's character, we're all in it to be good middlemen.

Instead of directly adding components to the application cake, we build a sub-cake of related components. For example, our perks system is composed of 6 components that talk to each other. We then add this sub-cake (called a feature) to the application cake. External dependencies are handled as self-types on the subcake. This is an extra step, but I would argue that this is compensated by the time you save when you add components to a feature. You don't have to add those components to the application cakes since they already include the feature.

We haven't completely switched over to this pattern yet (as we are also adherents of suffering oriented programming,) but are definitely feeling the code organization benefits.

Saturday, June 8, 2013

Tokamak: this datastore idea I had and will probably never build

ElasticSearch is the premier second generation NoSQL datastore. It took all the lessons learned from building BigTable and Dynamo style datastores and applied them to creating something that hadn't really existed yet: a datastore that lets you slice and dice your data across any dimension.

At Klout, I've been tasked with a few projects that sounded technically challenging on paper. ElasticSearch has made these laughably easy. Like a couple hours of off and on hacking while sipping a bloody mary easy.

Fuck "bonsai cool." ElasticSearch is sexy. Ryan Gosling with a calculator watch sexy.

Unfortunately...

Unfortunately, there is one feature that ElasticSearch is missing. There is no multi-dimensional routing.

Every NoSQL datastore does some type of sharding (not to be confused with sharting.) Riak has vnodes, HBase has regions, etc. This partitions your data horizontally so that each node in your cluster owns some segment of your data. Which shard a particular document goes to is usually determined by a hashing function on the document id.

In ElasticSearch, you can also specify a routing value, which allows you to control which shard a document goes to. This helps promote data locality. For example, suppose you have an index of notifications. Well, you can route these feed items by the user id of the notification instead of the notification id itself. That means when a user comes to the site and wants to search through the notifications, you can search on just one shard instead of the entire index. One box is active instead of dozens. This is great since you never really search across users for notifications anyways.

Taking this one dimension further, suppose I have advertising impression data that has a media id (the id of the ad shown,) a website id (the id of the website the ad was shown on,) and a timestamp. I want to find distribution of websites that a particular media id was shown on within a date range. Okay, so let's route by media id. No big deal. Two weeks later, product wants to find the distribution of media shown on a particular website. Well fuck. Nothing to do now except search across the entire index, which it so expensive that it brings down the entire cluster and kills thousands of kittens. This is all your fault, Shay.

What would be ideal is a way to route across both dimensions. As opposed to single dimensional routing, where you search a particular shard, here you would search across a slice of shards. Observe...

Tokamak

If we model a keyspace as a ring, a two dimensional keyspace would be a torus. Give it a bit of thought here. You create a ring by wrapping a line segment so that the ends touch. Extruding that to two dimensions, you take a sheet of paper and wrap one end to make a cylinder and then the other end to make the torus.

A shard here would be a slice on the surface of the torus (we are modeling it as a manifold instead of an actual solid.) Now when we search by media id, we're going across the torus in one direction, and when we search by website id, we're going across the torus in the other direction. Simple, right? I don't think this even needs to be a new datastore. It could just be a fork of ElasticSearch with just this feature.

There a few implications to increasing the number of dimensions we can route on. You are restricted to shard numbers of xn shards where n is the number of dimensions. You also really have to plan out your data before hand. ElasticSearch is a datastore that encourages you to plan your writes so that you don't have to plan your reads. Tokamak would force you to establish your whole schema beforehand since the sharding of your index depends on it.

In terms of applications, advertising data is the most obvious. That industry not only drives 99% of datastore development, but also employs 99% of all DBAs to run their shit-show MySQL and Oracle clusters.

Well, hope you enjoyed this post, just like I also hope to find the time to work on this, in between writing my second rap album and learning Turkish. If only I could convince someone to pay me money to program computers.

Friday, May 3, 2013

Interesting to see ElasticSearch's decision to give up on partition tolerance validated

ElasticSearch gives up on partition tolerance in the CAP tradeoff, which is unique among NoSQL stores. Most NoSQL datastores readily give up on consistency to achieve availability and partition tolerance. The Basically Available, soft state, eventually consistent model, which was discussed in the context of banking on High Scalability is nice, but gives up so much to achieve partition tolerance. While ATMs can be partitioned often, the fact is that most consumer applications do not need to have such strict partition tolerance.

Looks like a few peeps are coming around to the idea: Dynamo Sure Works Hard

For more, check out Shay Banon's mailing list response on why ElasticSearch gives up on partition tolerance

Sunday, April 28, 2013

The right way to think about memory

Computational Neuroscience is a great class. If you are at all interested in how the mind works, definitely check it out, especially the introduction. One thing it's forever changed is how I view memory.

Culturally, we think of memory as a lockbox where we store things, similar to a computer hard-drive. When learning, I'm storing things into this box and at a later date, I will open the box and pull these things out. We have long accepted the fact that the lockbox is lossy, but the lossiness is confusing as it's unpredictable yet not random. I know your first reaction is to blame the gypsies, but please allow me to first propose an alternative explanation.

The lockbox metaphor is very flawed and leads to poor conclusions when approaching the subject of memory. The physical basis of memory, as far as we know, is Hebbian plasticity, where the synapses between neurons that fire repeatedly are strengthened. The classic example is a three neuron system where the first neuron fires when a growl is heard, the second neuron fires when a tiger appears, and the third neuron fires to get you to run like a motherfucker. The first time you hear a growl you'll think nothing of it because you don't associate it with the tiger. However, the tiger then appears, you run, and the synapse between the first and third neurons is strengthened. It is actually strengthened substantially because of the emotional nature of running for your life.

Extend this to the scale of trillions of synapses and the scope of many years. You can quickly see how memory is not about storing things, but linking things together and keeping them linked through experiences.

The way I think of memory now is as a beach. You can write words and draw pictures in the sand, but waves will slowly wash these away bit by bit. You have to repeatedly write and draw the things you care about and with effort, you can carve the words in more so that they are harder to wash away. Life is all about figuring out what you want your beach to look like.

Hence, life's a beach.

Wednesday, April 24, 2013

Lemme gloat for a second

I recently started taking Coursera's Computational Neuroscience class and I wanted to point out how close I was to how brains actually work in this old post without actually knowing anything.

The one thing I got wrong was that dendrite configurations are important when in fact it is permanent changes in the neurotransmitter receptors that result in learning and memory.

Also interesting and equally incoherent is this post.

Monday, April 2, 2012

Routes lookup in Play Framework 2.0



Apologies for the gap in blog posts. I got sick after getting back from Ultra and crypto-class has been beating me about the face and neck.

This will be another post about getting around some Play 2.0 jankiness, namely how you can't get controller and action name from the request object any more.

This solution is going to make you cringe. It might even make you vomit. I've looked around the framework a decent amount and I just don't see a better way of doing this.

Ready?

Yes, the solution is to populate your own list of compiled routes and test every single one until you get the right route.



Yeah you read that right.

There is no runtime lookup because the routes file is compiled to Scala code and then to byte code. There is no way to look up anything. The one piece of information you can get from request about where you came from is the URI, but since URIs map to routes dynamically (for example you can specific /pics/wangs/{wangid},) you still have to test each route.

If you haven't closed your browser in disgust yet, I'll show you how this is done.

Populating the list




Notice how I got "routes" as a resource stream instead of just opening the "conf/routes" file. You have to be careful here since when you are deploying with "play dist," there is no conf folder.

This code should be pretty straightforward. I use the RouteFileParser in RoutesCompiler to extract some information I care about from the routes file. Then I use that to generate a list of routesFuncs, which can be "unapplied" to a request.

Doing lookups




Here we run through the entire list, unapplying each Route to the request you're testing. You can of course use foreach/yield to stop when you get to the first matching route. This is left as an exercise to the reader.

I used the reverse of this method to fix routes parsing in swagger-play2. You can see that code here.


While the reverse lookup solution is decently fast since we can just use maps, the forward lookup solution is quite inefficient. The cost is low enough (2ms for a lookup on 100 routes) that we don't care too much, especially since we're doing hbase lookups and all kinds of IO bound crap for our API.

If you do find a better solution, please let me know. The performance isn't an issue, but the fact that this algorithmic atrocity is out there just makes my blood boil.



Bonus level!



Oh you're still here. Well, let me drop another knowledge bomb on your dome that may prove useful.

At Klout, we have a library for dealing with Play 2.0, also written as a Play 2.0 application. The problem is that when an application depends on this library, the application's routes and application.conf files are overridden by the library's. We have a sample app in the library for testing and it's annoying to constantly move conf/routes to conf/routes2 before publishing.

So instead of getting mad, I choked myself out with a Glad bag. Then I threw this in my Build.scala:



The application.conf part should be straight forward. The other stuff... maybe not. But as I mentioned before, conf/routes is compiled into a whole slew of class files for performance reasons. You can't just exclude conf/routes, you have to exclude all those class files, which is why I have those disgusting looking regexes.

Don't worry your pretty little head about it though. Just throw this in the PlayProject().Settings section of the Build.scala file of your Play 2.0 library.

Enjoy!