Showing posts with label nosql. Show all posts

Saturday, June 8, 2013

Tokamak: this datastore idea I had and will probably never build

ElasticSearch is the premier second generation NoSQL datastore. It took all the lessons learned from building BigTable and Dynamo style datastores and applied them to creating something that hadn't really existed yet: a datastore that lets you slice and dice your data across any dimension.

At Klout, I've been tasked with a few projects that sounded technically challenging on paper. ElasticSearch has made these laughably easy. Like a couple hours of off and on hacking while sipping a bloody mary easy.

Fuck "bonsai cool." ElasticSearch is sexy. Ryan Gosling with a calculator watch sexy.

Unfortunately...

Unfortunately, there is one feature that ElasticSearch is missing. There is no multi-dimensional routing.

Every NoSQL datastore does some type of sharding (not to be confused with sharting.) Riak has vnodes, HBase has regions, etc. This partitions your data horizontally so that each node in your cluster owns some segment of your data. Which shard a particular document goes to is usually determined by a hashing function on the document id.

In ElasticSearch, you can also specify a routing value, which allows you to control which shard a document goes to. This helps promote data locality. For example, suppose you have an index of notifications. Well, you can route these feed items by the user id of the notification instead of the notification id itself. That means when a user comes to the site and wants to search through the notifications, you can search on just one shard instead of the entire index. One box is active instead of dozens. This is great since you never really search across users for notifications anyways.

Taking this one dimension further, suppose I have advertising impression data that has a media id (the id of the ad shown,) a website id (the id of the website the ad was shown on,) and a timestamp. I want to find distribution of websites that a particular media id was shown on within a date range. Okay, so let's route by media id. No big deal. Two weeks later, product wants to find the distribution of media shown on a particular website. Well fuck. Nothing to do now except search across the entire index, which it so expensive that it brings down the entire cluster and kills thousands of kittens. This is all your fault, Shay.

What would be ideal is a way to route across both dimensions. As opposed to single dimensional routing, where you search a particular shard, here you would search across a slice of shards. Observe...

Tokamak

If we model a keyspace as a ring, a two dimensional keyspace would be a torus. Give it a bit of thought here. You create a ring by wrapping a line segment so that the ends touch. Extruding that to two dimensions, you take a sheet of paper and wrap one end to make a cylinder and then the other end to make the torus.

A shard here would be a slice on the surface of the torus (we are modeling it as a manifold instead of an actual solid.) Now when we search by media id, we're going across the torus in one direction, and when we search by website id, we're going across the torus in the other direction. Simple, right? I don't think this even needs to be a new datastore. It could just be a fork of ElasticSearch with just this feature.

There a few implications to increasing the number of dimensions we can route on. You are restricted to shard numbers of xⁿ shards where n is the number of dimensions. You also really have to plan out your data before hand. ElasticSearch is a datastore that encourages you to plan your writes so that you don't have to plan your reads. Tokamak would force you to establish your whole schema beforehand since the sharding of your index depends on it.

In terms of applications, advertising data is the most obvious. That industry not only drives 99% of datastore development, but also employs 99% of all DBAs to run their shit-show MySQL and Oracle clusters.

Well, hope you enjoyed this post, just like I also hope to find the time to work on this, in between writing my second rap album and learning Turkish. If only I could convince someone to pay me money to program computers.

Friday, May 3, 2013

Interesting to see ElasticSearch's decision to give up on partition tolerance validated

ElasticSearch gives up on partition tolerance in the CAP tradeoff, which is unique among NoSQL stores. Most NoSQL datastores readily give up on consistency to achieve availability and partition tolerance. The Basically Available, soft state, eventually consistent model, which was discussed in the context of banking on High Scalability is nice, but gives up so much to achieve partition tolerance. While ATMs can be partitioned often, the fact is that most consumer applications do not need to have such strict partition tolerance.

Looks like a few peeps are coming around to the idea: Dynamo Sure Works Hard

For more, check out Shay Banon's mailing list response on why ElasticSearch gives up on partition tolerance

Thursday, August 26, 2010

CouchDB BAMF Installation

Well I promised that I would teach you how to install CouchDB "with nothing but a magnetized needle and a steady hand." So let's do that.

Most people can install CouchDB with apt-get install. There are occasions when that's not the case. In fact, I'm assuming the worst case scenario:

You have no package management system.

You don't have root access

You have really nothing except make, gmake, tar, wget, and a generous helping of badass

SpiderMonkey 1.7.0


wget http://ftp.mozilla.org/pub/mozilla.org/js/js-1.7.0.tar.gz
tar xvzf js-1.7.0.tar.gz
cd js/src/
make -f Makefile.ref # (add BUILD_OPT=1 for non-debug build?)
JS_DIST=[where you want to install spidermonkey] make -f Makefile.ref export

ICU 4.2.1


wget http://download.icu-project.org/files/icu4c/4.2.1/icu4c-4_2_1-src.tgz
tar xvzf icu4c-4_2_1-src.tgz
cd icu/source
./runConfigureICU Linux --prefix=[where you want to install icu]
gmake install
cd [where you installed icu]

Curl 7.21.0

CouchDB requires a fairly recent version of curl (>=7.18.0), which you might not have. You can check with curl --version


wget http://curl.haxx.se/download/curl-7.21.0.zip
unzip curl-7.21.0.zip
cd curl-7.21.0
./configure --prefix=[where you want to install curl]
make && make install
cd [where you installed curl]

Erlang R13B02


wget http://www.erlang.org/download/otp_src_R13B02.tar.gz
tar xzvf otp_src_R13B02.tar.gz
cd otp_src_R13B02
./configure --prefix=[where you want to install erlang]
make && make install
cd '[where you installed erlang]

Fix $PATH

You need to put a place on your $PATH that you can control. /home/[user]/bin works fine. Then you need to put symbolic links (ln -s) to the binaries in ICU, curl, and Erlang. You can also just copy the binaries from those directories into your $PATH folder. This is not recommended for obvious reasons, but it does work.

CouchDB!


wget http://www.takeyellow.com/apachemirror/couchdb/0.11.0/apache-couchdb-0.11.0.tar.gz
tar xzvf apache-couchdb-0.11.0.tar.gz
cd apache-couchdb-0.11.0
./configure --with-js-lib=[your spidermonkey install]/lib64 --with-js-include=[your spidermonkey install]/include 
   --with-erlang=[your erlang install]/lib/erlang/usr/include/ --prefix=[where you want to install couchdb]