Showing posts with label programming. Show all posts

Saturday, June 15, 2013

Protip: layer cake

This post is about Scala's cake pattern. If you don't know what it is, educate yourself here. To tldr; it, the cake pattern is the canonical way of doing type safe dependency injection in Scala.

An unfortunate problem with the cake pattern is registry bloat. The Klout API (surprise surprise) happens to be a pretty large application. Our registry got to be over 100 lines long. It's not a huge deal, but it's ugly as fuck. And if there's one thing we love at Klout more than cursing, it's damn hell ass beautiful code.

That is not beautiful.

To shrink down the registry and just organize our code better in general, we've started using what we're tentatively calling the layer cake pattern. And much like Daniel Craig's character, we're all in it to be good middlemen.

Instead of directly adding components to the application cake, we build a sub-cake of related components. For example, our perks system is composed of 6 components that talk to each other. We then add this sub-cake (called a feature) to the application cake. External dependencies are handled as self-types on the subcake. This is an extra step, but I would argue that this is compensated by the time you save when you add components to a feature. You don't have to add those components to the application cakes since they already include the feature.

We haven't completely switched over to this pattern yet (as we are also adherents of suffering oriented programming,) but are definitely feeling the code organization benefits.

Saturday, June 8, 2013

Tokamak: this datastore idea I had and will probably never build

ElasticSearch is the premier second generation NoSQL datastore. It took all the lessons learned from building BigTable and Dynamo style datastores and applied them to creating something that hadn't really existed yet: a datastore that lets you slice and dice your data across any dimension.

At Klout, I've been tasked with a few projects that sounded technically challenging on paper. ElasticSearch has made these laughably easy. Like a couple hours of off and on hacking while sipping a bloody mary easy.

Fuck "bonsai cool." ElasticSearch is sexy. Ryan Gosling with a calculator watch sexy.

Unfortunately...

Unfortunately, there is one feature that ElasticSearch is missing. There is no multi-dimensional routing.

Every NoSQL datastore does some type of sharding (not to be confused with sharting.) Riak has vnodes, HBase has regions, etc. This partitions your data horizontally so that each node in your cluster owns some segment of your data. Which shard a particular document goes to is usually determined by a hashing function on the document id.

In ElasticSearch, you can also specify a routing value, which allows you to control which shard a document goes to. This helps promote data locality. For example, suppose you have an index of notifications. Well, you can route these feed items by the user id of the notification instead of the notification id itself. That means when a user comes to the site and wants to search through the notifications, you can search on just one shard instead of the entire index. One box is active instead of dozens. This is great since you never really search across users for notifications anyways.

Taking this one dimension further, suppose I have advertising impression data that has a media id (the id of the ad shown,) a website id (the id of the website the ad was shown on,) and a timestamp. I want to find distribution of websites that a particular media id was shown on within a date range. Okay, so let's route by media id. No big deal. Two weeks later, product wants to find the distribution of media shown on a particular website. Well fuck. Nothing to do now except search across the entire index, which it so expensive that it brings down the entire cluster and kills thousands of kittens. This is all your fault, Shay.

What would be ideal is a way to route across both dimensions. As opposed to single dimensional routing, where you search a particular shard, here you would search across a slice of shards. Observe...

Tokamak

If we model a keyspace as a ring, a two dimensional keyspace would be a torus. Give it a bit of thought here. You create a ring by wrapping a line segment so that the ends touch. Extruding that to two dimensions, you take a sheet of paper and wrap one end to make a cylinder and then the other end to make the torus.

A shard here would be a slice on the surface of the torus (we are modeling it as a manifold instead of an actual solid.) Now when we search by media id, we're going across the torus in one direction, and when we search by website id, we're going across the torus in the other direction. Simple, right? I don't think this even needs to be a new datastore. It could just be a fork of ElasticSearch with just this feature.

There a few implications to increasing the number of dimensions we can route on. You are restricted to shard numbers of xⁿ shards where n is the number of dimensions. You also really have to plan out your data before hand. ElasticSearch is a datastore that encourages you to plan your writes so that you don't have to plan your reads. Tokamak would force you to establish your whole schema beforehand since the sharding of your index depends on it.

In terms of applications, advertising data is the most obvious. That industry not only drives 99% of datastore development, but also employs 99% of all DBAs to run their shit-show MySQL and Oracle clusters.

Well, hope you enjoyed this post, just like I also hope to find the time to work on this, in between writing my second rap album and learning Turkish. If only I could convince someone to pay me money to program computers.

Monday, April 2, 2012

Routes lookup in Play Framework 2.0

Apologies for the gap in blog posts. I got sick after getting back from Ultra and crypto-class has been beating me about the face and neck.

This will be another post about getting around some Play 2.0 jankiness, namely how you can't get controller and action name from the request object any more.

This solution is going to make you cringe. It might even make you vomit. I've looked around the framework a decent amount and I just don't see a better way of doing this.

Ready?

Yes, the solution is to populate your own list of compiled routes and test every single one until you get the right route.

Yeah you read that right.

There is no runtime lookup because the routes file is compiled to Scala code and then to byte code. There is no way to look up anything. The one piece of information you can get from request about where you came from is the URI, but since URIs map to routes dynamically (for example you can specific /pics/wangs/{wangid},) you still have to test each route.

If you haven't closed your browser in disgust yet, I'll show you how this is done.

Populating the list

Notice how I got "routes" as a resource stream instead of just opening the "conf/routes" file. You have to be careful here since when you are deploying with "play dist," there is no conf folder.

This code should be pretty straightforward. I use the RouteFileParser in RoutesCompiler to extract some information I care about from the routes file. Then I use that to generate a list of routesFuncs, which can be "unapplied" to a request.

Doing lookups

Here we run through the entire list, unapplying each Route to the request you're testing. You can of course use foreach/yield to stop when you get to the first matching route. This is left as an exercise to the reader.

I used the reverse of this method to fix routes parsing in swagger-play2. You can see that code here.

While the reverse lookup solution is decently fast since we can just use maps, the forward lookup solution is quite inefficient. The cost is low enough (2ms for a lookup on 100 routes) that we don't care too much, especially since we're doing hbase lookups and all kinds of IO bound crap for our API.

If you do find a better solution, please let me know. The performance isn't an issue, but the fact that this algorithmic atrocity is out there just makes my blood boil.

Bonus level!

Oh you're still here. Well, let me drop another knowledge bomb on your dome that may prove useful.

At Klout, we have a library for dealing with Play 2.0, also written as a Play 2.0 application. The problem is that when an application depends on this library, the application's routes and application.conf files are overridden by the library's. We have a sample app in the library for testing and it's annoying to constantly move conf/routes to conf/routes2 before publishing.

So instead of getting mad, I choked myself out with a Glad bag. Then I threw this in my Build.scala:

The application.conf part should be straight forward. The other stuff... maybe not. But as I mentioned before, conf/routes is compiled into a whole slew of class files for performance reasons. You can't just exclude conf/routes, you have to exclude all those class files, which is why I have those disgusting looking regexes.

Don't worry your pretty little head about it though. Just throw this in the PlayProject().Settings section of the Build.scala file of your Play 2.0 library.

Enjoy!

Wednesday, March 21, 2012

Framework ids in Play 2.0

The problem

Pretend you're a sysops guy (for some of you, this will not require much imagination.) You have an application that needs to be deployed on multiple environments (dev, stage, qa, prod, etc.,) each with different component dependencies. prod might use the sql1.prod.root mysql server while dev would use sql-dev.stage.root.

config.xml: The technical term is "cluster-fuck"

Back in the day, you would just partition these configs into different files. You would have a site.conf.prod file, a site.conf.dev file, etc. and would rename these to site.conf as needed. As you can imagine, even with deployment automation, this really, really sucks as you can have two environments that might need to share a large part of their configuration. While SQL server configuration can be exclusive between the environments, you have things like pool sizes and connection timeouts that should be roughly the same across many environments.

Play 1.2 had framework ids, which were designed to solve this problem. You could pass them in from the command line like play run --%prod or you could specify them directly using play id. When specifying --%prod, all keys prefixed with %prod would override the base keys. For example: %prod.http.port will override http.port. This allows you to partition your environments however you like, while always having a fallback. Suppose you always want to have a connection timeout of 200ms. Well now you can, without needing to specify this individually for prod,qa, etc.

Unfortunately, Play 2.0 did away with framework ids and there does not appear to be any replacement in sight.

The solution

Thankfully, Play 2.0 allows you to create your own configuration variable by overriding the Global object. If you override the configuration method of the Global object (an undocumented feature,) you can pass back any configuration you want at load time. Unfortunately, you don't get access to the default implementation of the method configuration or any configurations that it would have created. You have to re-build the whole thing yourself. To do this, I used the same ConfigFactory library used by Play's default config parsing method.

When you use this library to parse "application.conf", it also inserts in a few useful parameters. The one we care about is "sun.java.command," which is the full command that was used to launch your app. This includes any environment parameters you might have sent in. Unfortunately, this does not work with "play start". Play start forks a Netty server from your sbt launch process so sun.java.command just shows the Netty command, with no command line args. I got around this by creating an application.tag config variable that the app can fall back on in the absence of command line args. At Klout, we use Capistrano to deploy our Apis to both staging and production environments. This allows me to inject an "application.tag=%stage" or "application.tag=%prod" line into application.conf during deployment so that deployment is still a one step process. It's a bit janky, but it works well.

We're in the process of switching to running Play 2.0 distributions created with "play dist." This gives us back sun.java.command since you can start the Netty server straight from the shell and not from sbt.

Be vewy vewy careful

I found out that some config variables, notably the akka ones, are read directly from the application.conf file. If you try to do a %prod.play.akka._____, it will do nothing. There is a pull request from my colleague David Ross to fix that.

Sunday, March 18, 2012

Rap on BASH

I'm in talks with the Supreme Council of Hip-Hop to switch over the default platform API for rap from English to BASH. Here are some initial performance tests:

Jay-Z


find ./ -name '*Shawn*Carter.*' -xattrname ballstoohard | /usr/bin/fine.sh

The NBA: all about those wildcards.

Big L


awk '{ebonicsmap[$0]=$1; print $0; print $1;}' ./ebonics.txt

Soulja Boy would’ve written a single-machine MapReduce.

Kanye


curl -xPOST http://www.thisbitch.com/photos -d@mydick.jpg

thisbitch.com redirects to prostiflutes.xxx. Awesome site if you're looking for pictures of dicks.

Nas


sudo apt-get install one-mic.lib; gcc my_cool_rap.c; chmod 755 a.out; ./a.out

Nas gets close to the metal, all up in that grill.

Method Man


cat fatbagofskunk.dat whiteowlblunts.dat | methodman.sh; assert $LIFTED

I don't think it matters which side that assert is on.

2Pac


while true; do echo "hennessy"; echo "enemies"; echo "remember me"; echo "penitentiary"; done

And they’ll never find the guy who Ctrl-C’ed him.

Ghostface Killah


trap 'Strawberry Kiwi' SIGHIGHSPEED

Props if you get this one. SIGHIGHSPEED is defined in RFC-SuqMadiq.

Notorious BIG


if [$((RANDOM%2)) -eq 1]; then hypnotize.sh -f words.txt -u me; fi

His code only works sometimes.

Slim Thug


awk '{print $0 " ...like a boss"}' ~/TODO.txt

Fix race condition bug ...like a boss.

Saturday, March 17, 2012

I'm back, motherfuckers

It's been over a year since my last post. Just wanted to let you know that I'm not dead or in jail. I got some good shit coming down the pipe, quite possibly today. Regrettable word choice, yes, but fuck it, it's Irish Mardi Gras!

For a quick update on my life situation and a preview of posts to come:

I started working at Klout on Monday. So far, shit is pretty awesome. I've moved to SOMA, which is not quite as fun or as safe or as awesome in general as the Mission, but my place is a block from work so yeah.

On the lifting side, I've been holding pretty steady at 180, which seems right about where I should be. I need to start training my ankles for snowboarding, which I picked up this season.

I made a song with my friend Marco (I rapped, he produced.) You can find it here. This is pretty much the limit of my abilities in terms of performance. I put in all I had into recording this song, with only a few punch-ins, and I hope it shows through.

I'm going to be talking more about Scala and the Play Framework, since that's what I use at work. I'm also using that stack on the pet project I've been working on with my homeboy Sam. Release date is whenever Half Life 3 comes out.

In the meantime, enjoy my latest bullshit ventures. I think they truly redefine the meaning of bullshit:

What does dubstep sound like? - Scala, Play, MongoDB, Heroku

Is it bro-able? - Python, GAE

I'll see if I can deliver some more posts today. It'll be hard because I'm also going to be delivering some green booze to my face.

Saturday, January 15, 2011

Clojure on Google App Engine

Deploying Clojure on GAE has typically been a pain in the ass, by which I mean more of a pain in the ass than deploying the Python stuff. That is, until I found this:

AppEngine Magic

This is hands down the best Google App Engine library for Clojure.

Here's a sampling of the features:
1. It is leiningen integrated.
2. It includes a dev environment AND packages up your code automatically for deployment to GAE.
3. It is very well documented. Compojure devs, you might learn something here.
4. It doesn't force you to use any particular web framework. I was able to successfully hook in a vanilla Java HTTPServlet.
5. It has libraries for pretty much anything worth using in GAE.

I'm usually not impressed with Clojure libraries since (let's be honest here) they're almost always just wrappers around Java libraries with some semantic sugar. Appengine-magic is just so complete and useful that it has overwhelmed my prejudices.

Just look at how easy it is to deploy FROM SCRATCH:
0. I assume you have a [project] account created in Google App Engine
1. lein new [project]
2. cd [project]
3. Edit project.clj to include [appengine-magic "0.3.2"] in :dev-dependencies
4. lein deps
5. lein appengine-new
[optional step: writing code]
6. lein appengine-prepare
7. Assuming your appengine SDK is in the directory above your project, ../appengine-java-sdk-1.4.0/bin/appcfg.sh update resources
8. Go to [project].appspot.com

This is definitely easier than vanilla Java and arguably easier than the Python SDK. I won't go into running development servers and deploying code changes, because the most excellent documentation covers all that.

Don't be a jabroni. If you're using the Java GAE API, you should be using Clojure and this library.

In other news...

To the one or two people who read my blog, sorry for the complete lack of updates over the past few months. I just moved to the Bay Area and things have been extremely hectic at work (iTeleport,) with my side project (BeBannered,) and with life in general.

Yes, I absolutely love it here. I always thought that there was a comparable CAP theorem for people: the TAN (talented, ambitious, nice) theorem. This theorem does not apply here. In addition to the awesome people, the weather is amazing (and I'm saying this in January, the worst month in the area weather-wise,) the food selection kicks ass, the nightlife rocks, and did I mention soup dumplings?

Thursday, November 4, 2010

RC4 for jackasses like me

The RC4 stream cipher is the simplest thing ever, but all the articles I found assumed that you had some knowledge of crypto. So what I'm trying to provide here is a beeline to a rough conceptual understanding of RC4 and stream ciphers in general, without any prerequisite knowledge of cryptography.

Got it? Good. And since you understood...¹

The purpose of a cipher to provide an opaque pipe between two parties so that you can send messages without anyone else understanding those messages except the other party, even if third parties intercept those messages.

A cipher needs two components:

An encryption algorithm that translates your message (plaintext/cleartext) into gibberish (ciphertext.)

A decryption algorithm that translates that gibberish into the original message.

Say I'm White Power Mike and I want to send a secret message to Aryan Jim, without allowing it to be intercepted by the horde of foreigners. Being a racist, White Power Mike doesn't do the logic thing very well and uses a decoder ring. This decoder ring maps a letter in the original message to a letter in the coded message. The message is decrypted by doing the inverse.

Unfortunately, the horde includes some number of excellent cryptoanalysts. They use frequency analysis (finding how frequently every letter in the ciphertext occurs and comparing it to the average frequency of the letter in various English texts) to figure out what White Power Mike is saying.

Que lastima, bitches.

There are two big problems with White Power Mike's cipher implementation:

Securely achieving a mutual understanding of the decoding/encoding scheme is difficult.

The plaintext and ciphertext have a static bijective mapping between them, which makes the cipher vulnerable to all sorts of cryptoanalysis.

Modern ciphers, like RC4, try to fix these problems.

RC4 is a synchronous stream cipher. You have a pseudo-random keystream that you combine with your message bit by bit (or byte by byte since text messages are in byte-sized chunks) using a symmetric operation (almost always xor.) Your encryption keystream should always be synchronized with the other party's decryption keystream and vice versa. That means if you consume a bit from the keystream to encrypt a message, the receiving party must consume a bit to decrypt the message. If any message is lost, the whole cipher is toast.

Let's go through an encryption/decryption cycle with a hypothetical synchronous stream cipher.

Message:
Curse those handsome foreign devils!

Bytes:²
0x43 0x75 0x72 0x73 0x65 0x20 0x74 0x68 0x6f 0x73 0x65 0x20 0x68 0x61 0x6e 0x64 0x73 0x6f 0x6d 0x65 0x20 0x66 0x6f 0x72 0x65 0x69 0x67 0x6e 0x20 0x64 0x65 0x76 0x69 0x6c 0x73

To encrypt \C, White Power Mike asks his encryption key-stream function for a byte. Let's say he gets 0x20. 0x43 xor 0x20 is 0x63.

Then he'll go and encrypt \u. Suppose his key-stream returns 0x51. 0x75 xor 0x51 is 0x24.

Eventually, White Power Mike will have an encrypted message that he'll pass along to Aryan Jim.

Aryan Jim gets an encrypted message from White Power Mike:

0x63 0x75 .......

He'll ask for a byte from his keystream, which should return 0x20 because it's synchronized. 0x63 xor 0x20 is 0x43, 0x75 xor 0x51 is 0x24 and so on.

Suppose White Power Mike sends another message and Aryan Jim misses it. Their keystreams are no longer synchronized, which means that future messages from White Power Mike will not be decrypted properly.

Because the keystreams must be synchronized, the generation of new bits can't be fully random. You will have to maintain a state array of bytes and generate a keystream from this state. To make things extra crazy I mean secure, you will also want to mutate the state array deterministically every time you make a new key byte. Therefore, your keystream function will have two functions:

To generate keys based on the key state

To mutate the keystream state

RC4 is a specification that defines how you initialize the keystream state, how key bytes are generated from the keystream state, and how the keystream state is mutated for every key byte generated. I won't go into the implementation details here because that's what Wikipedia is for.

Hope this helps clear up any confusion about RC4.

¹ Would you stop scheming and looking hard? I got a great big bodyguard.
² Assuming ASCII... obviously, racists can't unicode. Also, here's the Clojure function I used: (fn [text] (map #(Integer/toHexString (int %)) text)). If Python was the new Perl, Clojure is the new Python.

Friday, August 27, 2010

FFT with MapReduce

This uses the Cooley-Tukey radix-2 decimation by time algorithm.

Let's start with the definition of a DFT:

$X_{k} = \sum_{n=0}^{N-1}x_{n}e^{-\frac{2\pi i }{N} nk}$
where k = 0,1,2...N-1

So imagine a square matrix NxN large with k’s as the rows and n’s as the columns.

By applying the Danielson - Lanczos lemma, we can divide the problem into 2 of size N/2:

$X_{k} = \sum_{m=0}^{\frac{N}{2}-1}x_{2m}e^{-\frac{2\pi i }{N/2} mk} + e^{-\frac{2\pi i }{N} k} \sum_{m=0}^{\frac{N}{2}-1}x_{2m+1}e^{-\frac{2\pi i }{N/2} mk}$

This is essentially the DFT of the even indexed terms plus the product of the DFT of the odd indexed terms and a “twiddle factor”:
$X_{k} = E_{k} + T O_{k}$

We can recursively apply the lemma, eventually getting a balanced binary tree of terms. I'm using a DFT of size 11 as an example here:

You can see that the twiddle factor builds up (denoted by the T's.)

We can already achieve data parallelism at the k granularity. However, we want to achieve data parallelism at the k/partition granularity. In order to achieve this, we need to be able to calculate the effective twiddle factor of the partitions and apply a unique partition id.

Well first, you can see that for a DFT of size N, you will be able to get partitions of sizes 1 and 2 with exactly floor(log₂N) splits. We'll call this the recursion depth (R). Since this is a balanced binary tree, the number of partitions (i.e. number of leaves) will be 2^R. Splitting by even and odds recursively is the same as splitting by divisibility by the number of partitions. Therefore, we can calculate a unique parititon id: n mod 2^R where R = floor(log₂N).

The partition id is actually not arbitrary since we will use it to calculate the effective twiddle factor. From the terrible drawing above you can see that the number of twiddle factor applications is the number of right traversals in the path from the root of the tree to your node/partition. Well since branching in this graph is representative of a division by two, a right traversal is representative of an incremental bit of difference. Therefore, we can calculate the number of twiddle factor applications by counting the number of bits in the partition id.

Now, since all the calculations necessary are only dependent on the x_n’s in the partition, n itself, and k, we have data parallelism to the k/partition granularity. This is more than enough granularity to apply map reduce:

For the first map step, we map
f(n,x_n):
    for(k in range(0,N-1):
        emit([k,partition(n)],[x_n,n])
where partition(n) is n mod 2^R and R = floor(log₂N).

In the first reduce step, we solve the size 1 and size 2 DFTs using the naive algorithm (just apply the definition.)
f([k,partition],[vals]):
    emit([k,partition],DFT(k,[vals])

For the second map step, we then apply the twiddle factor vector to the results of the DFT
f([k,partition],dft):
    emit (k,twiddle(partition,k)*dft)
where twiddle(partition,k) is $e^{-\frac{2\pi i }{N} k * bitcount(partition)}$

In the final reduce step, we add the adjusted DFT results together
f(k,[vals]):
    emit (k,sum([vals]))

Thursday, August 26, 2010

CouchDB BAMF Installation

Well I promised that I would teach you how to install CouchDB "with nothing but a magnetized needle and a steady hand." So let's do that.

Most people can install CouchDB with apt-get install. There are occasions when that's not the case. In fact, I'm assuming the worst case scenario:

You have no package management system.

You don't have root access

You have really nothing except make, gmake, tar, wget, and a generous helping of badass

SpiderMonkey 1.7.0


wget http://ftp.mozilla.org/pub/mozilla.org/js/js-1.7.0.tar.gz
tar xvzf js-1.7.0.tar.gz
cd js/src/
make -f Makefile.ref # (add BUILD_OPT=1 for non-debug build?)
JS_DIST=[where you want to install spidermonkey] make -f Makefile.ref export

ICU 4.2.1


wget http://download.icu-project.org/files/icu4c/4.2.1/icu4c-4_2_1-src.tgz
tar xvzf icu4c-4_2_1-src.tgz
cd icu/source
./runConfigureICU Linux --prefix=[where you want to install icu]
gmake install
cd [where you installed icu]

Curl 7.21.0

CouchDB requires a fairly recent version of curl (>=7.18.0), which you might not have. You can check with curl --version


wget http://curl.haxx.se/download/curl-7.21.0.zip
unzip curl-7.21.0.zip
cd curl-7.21.0
./configure --prefix=[where you want to install curl]
make && make install
cd [where you installed curl]

Erlang R13B02


wget http://www.erlang.org/download/otp_src_R13B02.tar.gz
tar xzvf otp_src_R13B02.tar.gz
cd otp_src_R13B02
./configure --prefix=[where you want to install erlang]
make && make install
cd '[where you installed erlang]

Fix $PATH

You need to put a place on your $PATH that you can control. /home/[user]/bin works fine. Then you need to put symbolic links (ln -s) to the binaries in ICU, curl, and Erlang. You can also just copy the binaries from those directories into your $PATH folder. This is not recommended for obvious reasons, but it does work.

CouchDB!


wget http://www.takeyellow.com/apachemirror/couchdb/0.11.0/apache-couchdb-0.11.0.tar.gz
tar xzvf apache-couchdb-0.11.0.tar.gz
cd apache-couchdb-0.11.0
./configure --with-js-lib=[your spidermonkey install]/lib64 --with-js-include=[your spidermonkey install]/include 
   --with-erlang=[your erlang install]/lib/erlang/usr/include/ --prefix=[where you want to install couchdb]

Tuesday, August 17, 2010

Clojure JIT

Finally got off my ass and put my JIT gen-class implementation on github.

I need to fix things up a bit, but I'm having trouble getting the JIT compiled classes to play well with Hadoop, which was the whole point of writing this library anyways. So I'm going to solve that problem first.

Anyways, enjoy!

Saturday, August 14, 2010

ClojureCLR

Installation (Getting to the REPL)

First off, you need Visual Studio. I’m doing this with 2008, but I’m pretty sure it would also work on 2010 and the relevant Express versions. I have doubts as to how well it would work with 2005 and am clearly not going to test it.

Download ClojureCLR 1.1.0 from the github page.

Follow the instructions here.

If you have trouble with that, use these instructions. They worked for me:

Download ClojureCLR and unzip it somewhere

Download the Dynamic Language Runtime: http://dlr.codeplex.com/

Copy the unzipped folder into your Clojure parent directory if you want the Clojure solution to work out of the box. Rename it DLR_Main. Just to make this step clear, if your Clojure solution file is C:\dev\gukjoon-clojure\Clojure\ClojureCLR.sln, then you need to have DLR in C:\Dev\DLR_Main. The subfolders for that should be C:\Dev\DLR_Main\src and C:\Dev\DLR_Main\Samples.

Open the Clojure solution in the Clojure folder.

For now, unload Clojure.Test from the solution.

I had trouble with the post-build in Clojure.Main so I just took that out. If you do this, you will also have to copy the “clojure” folder from Clojure.Source into the output directory (“bin/Debug”) of Clojure.Main.

Run Clojure!

Making a package to build external solutions

Installation is kind of a bitch with ClojureCLR, especially compared to Java Clojure. To help smooth over acceptance, I created a Clojure package that had all the Clojure binaries in it and two batch scripts that would use nant to build from target solutions and start a REPL with the target solutions loaded.

First copy your bin/Debug directory into a folder somewhere. Then, get nant and put that in the same directory.

Bootstrap.build
You’ll have to customize this nant build file for your solutions. You set a base directory for your solution (project.dir) and then can set up targets for each solution you have. I’m not very good with Nant so if you have recommendations for improving this buildfile, definitely let me know.

BuildClojure.bat
BuildClojure.bat runs the nant build file. It takes in an argument that it passes along to the nant build as the target. So depending on how you set up the nant build file, you could build specific solutions or all of them.

RunClojure.bat
RunClojure.bat is kind of worthless. It’s just one line: .\Clojure.Main.exe .\Startup.clj. I mostly have it because the Windows command prompt is a usability nightmare. I would much rather click to run Clojure.

Startup.clj
RunClojure.bat starts clojure with a startup script that loads all the assemblies you need. .NET classloading doesn’t work quite as well as Java so here I am loading the assemblies manually. You will want to change “Adc.*.dll” to something else, unless you also happen to work at AOL Advertising.

Generics

A big problem I came across was the lack of generics support in vanilla ClojureCLR. You actually need to pass in parameter types in .NET for generic types and methods and our legacy code was littered with generic types and methods. I didn’t really even need to create generics, just use them.

It’s fairly easy to hack your way around this using reflection. However, I would advise against doing it the way I did:

Clojure/CljCompiler/GenGeneric.cs
Clojure/CljCompiler/GenGenericMethod.cs
core-clr.clj additions

I created two new classes to do the actual reflection, when you can totally do the whole thing in Clojure with interop. You probably should also put your clojure code in a separate file from the other core-clr functions. This is left as an exercise to the reader.

If you choose to use this code, just add it into your solution in the appropriate places and recompile Clojure.Main and Clojure.Source (make sure to copy over clore-clr.clj.dll from the clojure directory in the Clojure.Source build.) gen-generic takes in a generic type, a vector of types and then arguments to the constructor. It will return an instance of the type. gen-generic-method takes in your callsite (either a type or an object,) the method as a String, a vector of types and arguments to the method.

Since you still need to pass in a generic type to the gen-generic function, you will have to use Type/getType to get the generic type from its string representation. In .NET the number of type parameters is reflected by a ` and then the number. For example, a list would be “System.Collections.Generic.List`1” and a dictionary would be “System.Collections.Generic.Dictionary`2.”

How I used it:

The problem I found with debugging any sort of object oriented code is that your control tends to be limited to an external interface. While the Visual Studio debugger is very helpful in narrowing down where your code is broken, the stubby fingers that C# (and Java, too) force on you make pinpointing irregular behavior harder than it needs to be. I believe that as a developer, no component of your system should be a black box if you don’t want it to be.

My debugging usually works this way:
1) Duplicate the bug in my dev environment, as described by QA
2) Walk through the callstack to pinpoint the exact point where irregular behavior occurs
3) Consistently duplicate this irregular behavior
4) Figure out how to fix it.

Any seasoned developer will tell you that 1-3 are far more difficult and time consuming than 4. Fixing shit is easy. Figuring out what to fix is hard. Furthermore, straight up exceptions tend to be easy to pinpoint using the Visual Studio debugger. The hard bugs to fix are ones that cause bad behavior without any overt exceptions.

ClojureCLR is a tool for helping with 2-3 (sorry, you’re still shit out of luck for non-deterministic bugs that are impossible to duplicate) in finding “soft” bugs. You want to stand up individual components, test these components with inputs and see if you get the expected outputs. It’s far easier to do this on the fly in Clojure than writing a seperate program, compiling it and doing it.

Anyways, long story short: don’t use ClojureCLR unless you have to, but it is useful if you have to deal with .NET.

Monday, August 24, 2009

Solution!

I made some adjustments to the CLSQL low-level APIs to do what I want to do. I only created the database-result-fields method for MySQL and Oracle. All other SQL databases can go screw themselves.... a.k.a. they are handled by this generic function:

db-interface.lisp


(defgeneric database-result-fields (result-set database)
 (:method (result-set (database t))
   (declare (ignore sql-expression))
   (signal-no-database-error database))
 (:method (sql-expression (database database))
   (declare (ignore sql-expression))
   (warn "database-result-fields not implemented for database type ~A."
     (database-type database)))
 (:documentation "Returns list of fieldnames and types"))

Here's the MySQL adjustment:

mysql-sql.lisp


(defmethod database-result-fields (result-set (database mysql-database))
 (let ((fields '())
   (field-vec (mysql-fetch-fields (mysql-result-set-res-ptr result-set))))
   (dotimes (i (mysql-result-set-num-fields result-set))
     (declare (fixnum i))
     (let* ((field (uffi:deref-array field-vec '(:array mysql-field) i))
        (name (uffi:convert-from-foreign-string
           (uffi:get-slot-value field 'mysql-field 'mysql::name)))
        (type (uffi:get-slot-value field 'mysql-field 'type))
        (type2 (uffi:get-slot-value field 'mysql-field 'mysql::flags))
        )
   (push (list (read-from-string name) (convert-mysql-type type type2)) fields)
   )
     )
   (nreverse fields)
   )
 )

(defun convert-mysql-type (mysqltype mysqlflags)
 (let ((unsigned (plusp (logand mysqlflags 32))))
   (case mysqltype
     ((#.mysql-field-types#tiny
   #.mysql-field-types#short
   #.mysql-field-types#int24
   #.mysql-field-types#long
   #.mysql-field-types#longlong)
      'integer)
     ((#.mysql-field-types#double
   #.mysql-field-types#float
   #.mysql-field-types#decimal)
      'float)
     (otherwise
      'string))))

Surprisingly, the MySQL API is weirder. There's no actual LISP result set object. It's more of just a struct that holds all sorts of foreign-ass C bullshit (i.e. the MySQL API types.) The method is pretty simple. It takes the MySQL result set, calls the foreign function mysql-fetch-fields, which creates a C array of mysql-fields. Then it goes through and processes each one of those. The convert fields function calls this struct that matches all the data type codes to data types. Somewhat messy, but not that bad.

Oracle on the other hand....

oracle-sql.lisp


(defmethod database-result-fields ((cursor oracle-result-set) (database oracle-database))
 (loop for cd across (qc-cds cursor)
   collect (list
        (read-from-string (cd-name cd))
        (convert-oracle-type (cd-oci-data-type cd) (cd-sizeof cd)))))


(defun convert-oracle-type (oracletype sizeof)
 (case oracletype
     ((2;;number
   3;;integer
   68;;unsigned int
   )
   'integer)
     ((4;;float
   21;;native float
   22;;native double
   )
      'float)
     ((5;;null-term string
   97;;charz
   )
      `(string ,(- sizeof 1)))
     (otherwise
      `(string ,sizeof))))

...is pretty simple and elegant. Whoever built the OCI portion of CLSQL built the data structures so that they would be further removed from the UFFI crap. I like it. I also really need to learn how to use LISP loops properly.

In addition to the database-result-fields methods, I made some changes to the create-table function to allow it to handle the base clsql:sql type. I was getting pretty pissed with the constant "fell through etypecase" errors.

expressions.lisp


(defmethod output-sql ((stmt sql-create-table) database)
 (flet ((output-column (column-spec)
          (destructuring-bind (name type &optional db-type &rest constraints)
              column-spec
            (let ((type (listify type)))
              (output-sql name database)
              (write-char #\Space *sql-stream*)
              (write-string
               (if (stringp db-type) db-type ; override definition
                 (database-get-type-specifier (car type) (cdr type) database
                                              (database-underlying-type database)))
               *sql-stream*)
              (let ((constraints (database-constraint-statement
                                  (if (and db-type (symbolp db-type))
                                      (cons db-type constraints)
                                      constraints)
                                  database)))
                (when constraints
                  (write-string " " *sql-stream*)
                  (write-string constraints *sql-stream*)))))))
   (with-slots (name columns modifiers transactions)
     stmt
     (write-string "CREATE TABLE " *sql-stream*)
     (etypecase name
         (string (format *sql-stream* "~s" (sql-escape name)))
         (symbol (write-string (sql-escape name) *sql-stream*))
         (sql-ident (output-sql name database))
     (sql (output-sql name database)))
     (write-string " (" *sql-stream*)
     (do ((column columns (cdr column)))
         ((null (cdr column))
          (output-column (car column)))
       (output-column (car column))
       (write-string ", " *sql-stream*))
     (when modifiers
       (do ((modifier (listify modifiers) (cdr modifier)))
           ((null modifier))
         (write-string ", " *sql-stream*)
         (write-string (car modifier) *sql-stream*)))
     (write-char #\) *sql-stream*)
     (when (and (eq :mysql (database-underlying-type database))
                transactions
                (db-type-transaction-capable? :mysql database))
       (write-string " Type=InnoDB" *sql-stream*))))
 t)

Anyways, I think there were some valuable lessons learned. I wouldn't say that I know clsql-sys like the back of my hand, but I am significantly more familiar with it.

Happy hacking!

Thursday, August 20, 2009

CLSQL......blah?

So I've been doing some stuff with CLSQL at work and while I appreciate the abstraction of the SQL functions, I would really also like to have access to more low level stuff to build my own abstractions, esp. when the whole point of using LISP over other languages is to build your own abstractions. I have to go digging in the clsql-sys package (which is basically not documented at all) and even that didn't have what I was looking for (namely getting field names/types from a result set object.) I'm probably just not looking in the right places, but this is frustrating to say the least.

This is where LISP suffers. Poor documentation resulting from a lack of users. If you can't figure out how to do something very specific in PHP or Javascript in like an hour or two of Google searching, you should probably just give up on life. All I want to do is get some pretty simple information from a SQL result and I don't want to rebuild all the SQL APIs in CLSQL to accommodate this. Grrrr. FML.

On the bright side, <3 Caroline Polacheck

Monday, July 27, 2009

I have issues

Wired magazine is awesome, significantly more so now that I read it. I described it once as Maxim for nerds, but it's more like Maxim is Wired for sexually charged toddlers (Lolita 2: Lolitita.) Wired has consistently thought-provoking articles (New Socialism and Formula that Killed Wall Street come to mind) and the writers don't dumb technical shit down more than they have to. My first and probably only print magazine subscription.

Back to the issues thing. I've been getting dragged into GUI programming, but not kicking and screaming as I had expected. There's a certain appeal to the subjectivity of interface requirements (build what the user wants, not what the user says he/she wants) and I'm not going to lie... I really like pretty things. Also... Javascript programming is fun as hell. It's a great language (first class functions ftw) and your development cycles are like minutes long (write, reload, check if shit works.) Coding seems to become a chore whenever you move away from that iterative paradigm. Writing huge, complex systems in one sitting is definitely a recipe for disaster.

So I've been settling on two approaches to interface design recently. One is... present the user with a playground. With lots of visual and/or tactile feedback, the user is able to learn quickly and the interface becomes incredibly intuitive over time. There's an investment for the user, but it pays off. The interface here needs to be minimally stateful i.e. the actions the user can possibly perform needs to stay fairly constant.

The second is... present the user with as few actions as possible. I try to use a lot of fade-away menus here. This is where you want to funnel the user down a certain process; you want to assert some level of control over the user's actions so that they don't fuck shit up or generally do something you don't want them to do. An example of this would be a landing page for an internet advertisement: you want to prevent the user from surfing off into the sunset... you want the user to BUY YOUR SHIT.

The trouble is you often need to make a compromise between these two approaches. A simple design example would be say a process tracking application. You want to enforce a reasonably strict process with data input. Data needs to be accurate. But the reporting needs to be more or less a sandbox. The process of converting data to information requires user freedom.

There are probably two areas of computer programming that require this level of balance and dare I say Zen. Interface design is one. The other is programming language design. These diametrically opposed aspects of HCI illustrate the two types of computer users, their differences and similarities. The similarities, in my opinion, are more interesting.

In both, there's this yin-yang conflict with user freedom vs. system integrity. With memory management: "C programmers think memory management is too important to be left to the computer. Lisp programmers think memory management is too important to be left to the user." Computers are excellent at dealing with well-defined rules, but I think it's less than clear if memory management can be defined by a set of simple or even complex rules. I think it's less than clear in both interfaces and programming languages if users want control or simplicity. Ideally they could have both, but having your cake and eating it too is improbable, even in Schroedinger's box. C and C++ have diminished significantly in popularity, esp. in mainstream programming. I understand that C++ still plays a huge role in game development and some server-side programming, but let's be honest here: Java, PHP, and Python won with the Joe Schmoes.

You also have the trend towards sandbox environments and "tactile" feedback. As I mentioned before, iterative development is best because humans are built to process feedback. Here we have this trend towards languages that are simpler and "safer" to screw around with. At the extreme end you have LISP, Python, Ruby that have REPL environments. But you also have PHP and Java, which are safe enough to change, run, and debug without worrying about a dangling pointer that will guide you to a blue screen of death. Basically... garbage collection ftw.

Sorry for rambling here... but I've been trying to write more. It's a good habit. Also, take everything I say with a grain of salt. I'm a n00b and I actually kind of like C++. Wrangling with memory issues builds a lot of character, something these limp-wristed (well... more limp-wristed than the typical computer programmer) Python hacking kids these days really need. :-p

Monday, June 29, 2009

today was a good day

I made an interesting foray into desktop development today. So what had happened was... our COO, one of the lead developers of AIM back in the day, says that he can build a desktop widget for this alerts system that I've been working on with Adam and my manager Carrie. In a day. Clearly, I'm not going to take that kind of smack talk, even if it's coming from a programming stud. I did some research into GUI development and hacked out this widget in a day. I was going to challenge the dude to a race, but I'm assuming he has better shit to do than participate in code dick measuring contests with some random analyst. Moral of the story though... don't fucking call me out like that, especially when it comes to programming and drinking contests.

C++ is not as bad as everyone makes it out to be, but largely because Qt is so freaking AWESOME. The signals and slots system it uses is very intuitive and it makes hacking the Windows desktop total cake. It also has extensions to basically every data type in C++. Instead of string you use QString. Instead of lists you have QLists and its bastard offspring. It's almost like writing Java except you have the C++ level of control over memory. And finally it has a surprising number of modules for... well... anything you want to do. SQL, HTTP, OpenGL, XML etc. etc. Very sweet. I'll definitely be trying some more things in Qt and C++. Yeah I know that there's Python and LISP integration. I'm finally following Saffer's advice to learn a programming language that doesn't hold your hand with memory management.

Also, I got my iPhone 3GS today. After a week of complete isolation, this is quite the relief. What probably didn't help was deactivating my Facebook account. Terrible idea.

Sunday, June 7, 2009

SBCL core image dump

HOLY FUCKING SHIT.
How did I not get the memo on this?

A brief explanation of my sicedness:
When you exit out of SBCL (Steel Bank Common LISP,) you have the option of dumping an image of the process to disk that you can restart at any time. By restart I really mean start because you can run an arbitrary number of copies of the image.

So uh... who cares, right? Well loading a core image is reasonably fast. Faster than rebuilding your environment for sure. If you use LISP for CGI shit, you probably want to be loading a core image instead of loading/compiling all your shit from scratch. And using LISP for web development generally requires a good number of non-native shit. CLSQL comes to mind. So yeah... images can be used as templates.

Presumably... you can also keep an image around for each user. So instead of the query/response, stop and go computation that's typical of web development, you can have an actual persistent environment. This can get expensive since core images are not small. The one I use is 29 megs and I'm sure they can get bigger. Maybe not the most practical but damn it makes me wet just thinking about it.

Anyways... LISP rules

Tuesday, June 2, 2009

Yay

My first programmingpraxis submission. Check out my 1337 use of mapping. But not reducing. *sigh* not reducing.


def e2pword(engword):
if engword[0] in vowellist:
return engword + “-way”
else:
return engword[1:] + “-” + engword[0] + “ay”

def e2p(engstr):
return ” “.join([e2pword(word) for word in engstr.split(' ')])

def p2eword(pigword):
pigsplit = pigword.split(’-')
if pigsplit[1][0]==’w':
return pigsplit[0]
else:
return pigsplit[1][0]+pigsplit[0]

def p2e(pigstr):
return ” “.join([p2eword(word) for word in pigstr.split(' ')])

Saturday, March 14, 2009

(map 'list (lambda (n) (solve n)) 99-problems)

I got 99 problems and a ~~bitch~~ LISP ain't one!

I had a mild headbutting (no jerk, of course) with Saffer yesterday about me being spoiled by LISP. What bothered me wasn't the fact that I'm spoiled by LISP (I know I am... first-class everything FTW) but the fact that I couldn't defend LISP properly since I don't fully understand how some of its sweetest features work...namely macros, the type system, and the exception system.

So I'm going to correct that.

Also on the plate... I'm thinking of learning C. "It'll make a man out of you," Saffer said. Yeah... but I've found hot sex and Hurricane malt liquor to be a more pleasurable solution to that problem. And a language without an interpreter environment? *shudder*

Update (7:12PM):
Adam showed me this site: http://projecteuler.net/

Pretty cool.

Tuesday, July 8, 2008

Legos

"Have you ever noticed that Lego plays a far more important role in the lives of computer people than in the general population? To a one, computer technicians spent huge portions of their youth heavily steeped in Lego and its highly focused, solitude-promoting culture. Lego was their common denominator toy.

Now, I think it is safe to say that Lego is a potent three-dimensional modeling tool and a language in itself. And prolonged exposure to any language, either visual or verbal, undoubtedly alters the way a child perceives its universe. Examine the toy briefly...

First, Lego is ontologically not unlike computers. This is to say that a computer by itself is, well...nothing. Computers only become something when given a specific application. Ditto Lego. To use an Excel spreadsheet or to build a racing car-this is why we have computers and Lego. A PC or a Lego brick by itself is inert and pointless: a doorstop; litter. Made of acry-lonitrile butadiene stryrene (ABS) plastic, Lego's discrete modular bricks are indestructible and fully intended to be nothing except themselves.

Second, Lego is 'binary'-a yes/no structure; that is to say, the little nubblies atop any given Lego block are either connected to another unit of Lego or they are not. Analog relationships do not exist...

Third, Lego anticipates a future of pixelated ideas. It is digital. The charm and fun of Lego derives from reducing the organic to the modular: a zebra built of little cubes; Cape Cod houses digitized through the Hard Copy TV lens that pixelates the victim's face into little squares of color."
-Bug
Microserfs