Thursday, June 6, 2013

Map and Reduce - Conceptual differences Between Clojure and Hadoop

In this article I will explain the differences and the similarities in the concepts of map and reduce between the two very popular platforms. This is not a comparison of Clojure and Hadoop, the two are largely incomparable as one is a programming language and the other one a data processing framework. It is also not a performance benchmark, so if you are looking for tables with statistics on arbitrarily chosen operations, you won't find it here. This is merely an article on what do the words map and reduce mean within the scope of these two technologies.

First, the similarities:

In both technologies, input data is typically divided in a large number of smaller data units, we can simply call them records for the purpose of this article. Map operation is responsible for the transformation of each record individually. It only ever needs to know one item at a time, which is a very powerful assumption as it allows very easy parallelization. Reduce, on the other hand, works by applying the transformation on records against each other or in other words it derives information from multiple items at a time. This makes parallelization a bit more difficult, but it still may be possible depending the way the data is organized. So, roughly said, in both cases map performs a scalar transformation on the input sequence and reduce aggregates it.

Now, the differences:

Hadoop map seems to be a more general case of Clojure map, specifically with regard to argument cardinality.

In Clojure, map always produces a sequence of the same length it received. For example, we can increase every number in the input sequence by exactly ten:



The same can be written in Hadoop as this:



It becomes obvious just by looking at the code above that the number of output records for each input record depends solely on the number of times the collect method has been called. For example, we can decide to completely ignore input records with values in some specific range. We can achieve this by making a small change in the code:



This is something our Clojure map function cannot do. Admittedly it can transform unwanted items to nil and leave it to the caller function to remove them, but that is not exactly the same thing. The function we need here is filter:



Following the same train of thought, we can notice that the number of output items can also be larger:



Although, all by itself Clojure map cannot produce more output items than it has input items, we can use some trickery to achieve this. Instead of producing separate multiple items, we will produce sequences and then flatten the final sequence of sequences:



Or a bit more elegant:



That was about Map - the differences are mainly with regard to the input and output argument cardinality. Now we will focus on the Reduce part.

Clojure reduce aggregates the result by sequentially applying the given function on the current item, then applying the same function on the result and the next item and so on, until it runs out of input items. In the next example we will find the minimum element in a sequence:



What happens here:

  1. Function min is applied on 4 and 2, the result is 2
  2. Now min is applied to the previous result which is 2 and the next element which is 1, the result is 1.
  3. min(1,5) = 1
  4. min(1,3) = 1
  5. We have no more elements, so the result is 1

On the other hand, an equivalent Hadoop reduce method would look something like this:



Again, it seems that Hadoop reduce is a bit more general case than its Clojure equivalent, since the order in which the input items will be processed isn't fixed, even if the order of the input items is.

However, if we focus on the part that matters and that is the way we think about our programs all the differences in terminology between the two technologies fade. Even if the same keywords do not exactly map one-to-one the simple and most important similarity remains: Map is processing records individually and Reduce is combining them - both of which are the necessary steps in processing of huge piles of data, even more so since this way of dividing the operations also allows the process to be paralellized to some extent.

Sunday, January 8, 2012

Clojure Extension for Chrome

Do you like Clojure? Do you often read blogs and articles about it? I know I do. And every time I see an interesting Clojure snippet I rush to open my REPL to try it out. To do that I have to execute following steps:
1. Press Logo key to open my console.
2. cd to any project
3. lein repl
4. Copy the code snippet from the webpage and paste it in the REPL.
5. Wow, cool!
Your own steps to run the code may be different and probably more efficient. However, in any case, they probably involve opening a REPL and copy pasting code from the page into it.

What if you do this way too often to repeat all the steps? What if you are just plain lazy? All of us Clojurians know that lazy is good, therefore I decided to jump on the problem and try to make it easier. What if, I asked myself, I could just select a piece of Clojure code right in my browser, right click it and choose something like Eval to get my result back. For that functionality I would need to write a browser plugin, specifically a Chrome extension.

 As it turned out, writing Chrome extensions is a pretty easy task. It is not that much different than writing regular web application - write some HTML files, put some JavaScript in it to do the work and an Ajax call to communicate with the backend service. The only significant difference is in various security rules which the extension has to satisfy. Anyway, the official documentation is pretty good and the examples are even better.
As for the backend service, I used the source of the popular online REPL tryclojure written by Anthony Grimes and modified it slightly to replace Noir framework with an old version of Compojure which I already used in some of my older apps. Then I unmodified it because Noir is way better.

Anyway, after a few afternoons of hacking, results of which can be seen on goranjovic/chromeclojure on GitHub if anyone cares, I published the app. Clojure backend is hosted on Heroku on a free plan and the Chrome extension itself is where all Chrome extensions live - on Google's Chrome Web Store . Click on the link, install it and try it out on this very blog post. I included some Clojure snippets below just for that purpose.

A simple snippet that calculates the meaning of life:


Go on, select the snippet, right click it and choose Eval as Clojure. If everything is installed and working properly you should see a Chrome notification containing result 42. If you, for whichever reason, don't really like notifications and find them annoying you can change the extension to use plain old alerts to show the result. To do this:
1. Open Wrench Menu > Tools > Extensions
2. Find Clojure Extension
3. Click the Options link
4. Choose between offered response methods and click Save. Currently it is just notification and alert, but more options may come in future releases.

A somewhat less simple snippet which converts number 12345 into a sequence of its digits in base 12:



Again, use the extension to eval the code. It should return (7 1 8 9).

It is also possible to evaluate several forms sequentially, which is useful if you have one or several def-ed functions and one call which evaluates what is needed. This seems useful for the previous example. Select both Clojure forms and evaluate them as before.



Of course, you don't have to evaluate them both at once. You could have evaluated the function definition first and then the function call in a different request. Each user has a session and can use it for defs within the time limit.

Naturally, not everything can be evaluated this way. For example if you try to evaluate the following snippet you will get an error, since require isn't allowed for security reasons and the jar is probably unavailable to the backend service.



So, unfortunately it is not possible to use this feature to evaluate snippets like the ones on Noir website. But, then again sometimes it's better not to be lazy and fire up leiningen.

I hope you like my extension and find it useful in your browsing. If you find any bugs feel free to report them on the project's Issues page. If you have any idea on how to extend the plugin functionality feel free to comment or fork my repo and add them yourself.

Happy hacking!

Saturday, July 30, 2011

Asimov's Laws of Robotics and Web 2.0 - Part 2

In my previous post I mentioned a possible use of Asimov's laws of robotics in modern web applications (and other software). In this post I will elaborate how specifically that could be done and what are some possible challenges in doing so that some people might not be willing to take.

First, let's analyze the laws one by one:

1. A software robot may not injure a human being or, through inaction, allow a human being to come to harm.

There are some ways a web application can 'injure' a human being. For example, exposing someones private data to wrong people can cause all sorts of trouble ranging all the way from being dumped because of some photos from a party to having your house robbed while you are on a vacation by burglars who had your address and read your status message 'We are on vacation!'. Various ways of hurting people or allowing them to get hurt may include child pornography, cyber stalking (possibly in combination with real stalking) allowing them to get robbed, etc.

2. A software robot must obey any orders within its scope given to it by authorized human beings, except where such orders would conflict with the First Law.

This one is pretty straightforward, since it encompasses a requirement common to all software - authentication and authorization. I would also put user friendliness and overall user experience in this category because it allows an easier way for humans to communicate their orders to software. Maybe it's a stretch, but even performance metrics like response time could fit nicely here.

3. A software robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

As I said in my previous post, this one should be easiest to implement, since it basically implies that software should be stable enough not to crash or delete its own data.

As you may have noticed, software requirements imposed by this interpretation of Asimov's laws are not in any way new. Respect of privacy, user friendliness, fast response time and overall stability have always been important aspects of software design. What is important about Laws of Robotics isn't in any one particular law - it's in their order of importance, and that is the main point of this article.

In order to be Asimov compliant, not only does software have to meet all these requirements, but also in correct order.

Example 1: Let's say that some hacker tries to steal personal data from an online service. Our software detects unauthorized intrusion, however has no way to stop it other than to shut down the server or otherwise cause downtime. Now, 'Asimovian' software has to protect its own existence in order to comply with the Third law. However, the First law says that human beings may be harmed if their personal information leaks to shady people. The First law is more important than the Third law, and software has to shut down and cause downtime, rather than to expose its users to risk.

Problems with such approach are quite evident:
1. Downtime = loss of money.
2. If a company shuts down their server to prevent information theft, they are practically admitting they have been hacked. Much better to just pretend it didn't happen and that no one would find out anyway.

Guess what, people generally do find out when their credit card gets drained. And they hate it much, much more than when they see a 404 page!

Saturday, July 2, 2011

Asimov's Laws of Robotics and Web 2.0 - Part 1

As any science fiction geek knows, Isaac Asimov wrote a series of short stories and novels about robots. He was more or less the first author to approach the topic seriously and he even coined the word Robotics, nowadays used for both scientific discipline and industry. The most famous aspect of his robots is the fact that they always have to respect the famous Three Laws of Robotics:

1. A robot may not injure a human being or, through inaction, allow a human being to come to harm.
2. A robot must obey any orders given to it by human beings, except where such orders would conflict with the First Law.
3. A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

It is year 2011 and a lot of our every day reality depends on robots, even if we don't see or hear about them every day. Our cars are constructed by assembly lines consisted almost entirely of robots, we can buy a robot pet or watch one of these Japanese robots dancing or playing a violine. Heck, even this very blog post will be read, indexed and catalogued by a search engine (ro)bot.

In a way Asimov was right, unlike other science fiction predictions (flying automobiles, anyone?) we really do have useful robots. All right, they don't have positronic brains, but they do have semiconductor microchips. And they don't have Three laws built in. So, why don't our robots respect them, like Asimov's robots did? Wouldn't they be more useful or at least safer if they did? Unfortunatelly, at this time it is an impossible task. For example, how would we formally define the concept of "injure"? Also, not all injuries have to be inflicted willingly. A modern robot may crush a human by tripping over the edge and falling on him. Most modern robots can barely walk and hold objects. Making sure that they don't injure people by accidentally hitting them, let alone covering all possible meanings of injuring at this point in our technological development really is impossible.

Ok, but I also mentioned a software robot browsing the web. With software robots there isn't even a remote possibility that they might physically harm people. A simple fact - if a robot is incorporeal, it can't crush anyone's bones. This allows us to avoid the fundamental problem with implementation of the First law. The Second law can be easily implemented if we allow the robot to decide whether or not to obey the order based on human's authorization and permissions. Also, a robot should only be expected to do what it is programmed to, so an order to do something outside its scope should also be politely refused. The Third law is much easier to implement, since all it would require is that software bot gives its best not to crash and lose all its data.

So, we might try to adapt Asimov's laws into Three Laws of Software Robotics (changes emphasized):

1. A software robot may not injure a human being or, through inaction, allow a human being to come to harm.
2. A software robot must obey any orders within its scope given to it by authorized human beings, except where such orders would conflict with the First Law.
3. A software robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

I'll just conclude this article with an important note. Some of you may ask: "But, why bother? Why would anyone want to waste time on trying to make online software behave like robots from science fiction stories?". It's not about trying to make the present (formerly known as 'the future') like what we expected it to be when we were kids reading futuristic magazines. Ok, actually it is, but that is not the main point. :)

The thing is that the way Web is developing lately, with all privacy concerns, corporation conspiracy data miners, stalkers, pedophiles, etc. it may be necessary to address all these issues and try to stop web services from being harmful. Maybe we need all that software to be more Asimov compliant.

Saturday, November 13, 2010

JavaBean Property Change Listener With Dynamic Proxy Wrapper

Recently, I was working on a project where my software would receive some Java beans, do something with them, pass them around. The trick was that, besides all the usual stuff, the system had to react on bean property changes. For, example if we had a Person object, and its property "name" is changed, framework would have to execute some code, sync some GUI components, etc.

So, I was quite excited when I found PropertyChangeSupport class in Java API specification. However, functionality provided by this class requires changes in bean class code. I would have to modify the Person class, add code for PropertyChangeSupport object, and add a firePropertyChange method call to each setter method in the bean class (or at least for setter methods of those properties my system cared about). This, wouldn't be a big problem if all bean classes were known in advance and their source code preferably available and open for editing, however, in my case, it wasn't so.

After several failed experiments, including but not limited to AspectJ, I found a solution which you can see int this article - Javassist framework for runtime bytcode manipulation. I used it to generate dynamic proxy wrappers around any Java bean with firePropertyChange called in each of the chosen setter methods. Yeah, I know it's heavy thinking already, so let's get to example.

We have a Java class:



And our generated proxy class would look like this:



And here is how to make a proxy object:



The list contains names of all properties to be bound, so that we can let some property changes be ignored. As simple as that! Now, whenever we change any bound property on the returned object, the assigned listener would do its job. Exactly how should listener class look like depends on the particular system. For the purpose of this tutorial, I created a dummy implementation:



And at any time, if we want our original untracked bean we can do this:



Ok, that's how you use it. Now, how it's made. We had several problems here. First was to create a runtime wrapper proxy around a bean object, then to make our object "trackable" by adding PropertyChangeSupport support (no pun intended). From another point of view, we first had to generate a runtime class, then to instaniate it around our original bean.

So, the solution contains WrapperProxyFactory class with a concrete method createWrapperProxy and abstract methods adjustProxyClass and adjustProxyObject. The concrete method handles all the proxy-wrap-and-delegate stuff and leaves the proxy purpose itself to a subclass implementing abstract methods. In this case subclass is named BoundBeanWrapperProxyFactory. Its adjustProxyClass method alters all bound setters to fire event on each call, and adjustProxyObject adds listeners to wrapper objects. We also have interfaces WrapperProxy and BoundBeanWrapperProxy which expose methods which are added to our proxies in runtime, such as retrieveOriginal.

Separation of these factories did complicate things a little, however a proxy that wraps around a bean and does something else may be useful, too. Then all we have to do is write a different factory subclass.

Anyway, you can download code from the link here or digg it out from my hg repository on Google Code. Some code changes (i.e. package names) may be required for use.

Tuesday, September 21, 2010

Developing From Console

In my previous post I promised to explain how I developed an application while (mostly) avoiding bloated development tools, so here it comes...

First, a brief explanation of the app itself: It is a genetic algorithm application implemented in Clojure, whose goal is to solve a mathematical game. Game solver (my program) gets one target number and six more numbers. The goal is to use these six numbers to create an equation which evaluates to the target number. In this article I will focus on development environment, rather than on domain problem or the program itself, so you can see more details about the game and get the source on project page: http://code.google.com/p/genetic-my-number (In case you got to this blog from there, just continue reading, and see how it was made :))

Ok, I briefly explained the problem which my application solved, now to get to the post title. Yeah, most of the tools I used to develop it are command line based. Well, actually all but two. Unix-like environments offer a great deal of useful command line tools, so it was a piece of cake to find and install all of them, however most are (as far as I know) available on Windows platform as well.

So, let's see those tools:

Editor - Vim, with limited Clojure plugin meaning that I only used syntax colouring feature and rainbow parens, not the nailgun or any other automatic runner. (I also could've used an easier editor like nano, I just like vim more)

Source Versioning - since I hosted my source on Mercurial, the best and simplest client turned out to be default console based Mercurial client hg.

Build Management and Running - Maven2, default console based (I'm sure leiningen is great, and sure will try it out, but I really don't see anything wrong with maven either). Clojure plugin for Maven offers a clojure:repl command, which creates a REPL with all the maven dependencies set up on the classpath without trouble.

Diff tool - Meld (a GUI app!) - simple, lightweight, intuitive, and looks nice... Much easier to work with than default diff, vimdiff and (personal opinion) aesthetically more appealing than kdiff. It also serves the purpose to show that choosing console tools over graphical ones shouldn't be just for geekness sake, but based on their actual usefulness.

Web Browser - Chrome and Firefox, obvious choice for both testing web applications and reading online documentation (also add whatever browser you wish to test against)...

Music Player (C'mon, who doesn't occasionally need music while programming :)) - Moc, Music On Console (the geekest of them all) a command line based music player that does exactly what it says, plays music - no lyrics plugin, no funky visualisations, no music store support - it just plays the damn music... :-)

These tools (and the command line terminal itself) were the only toolset I used while writing the genetic algorithm application, and I wrote it in a remarkably short time. Why is that? First, we shouldn't forget the choice of programming language and platform. Clojure is a modern Lisp, which means that it gives a huge flexibility and the possibility to write great functionality for brief time and in small number of code lines. Second, the tools I chose are all lightweight, they load in a millisecond, store file in the same time, writing a file doesn't cause rebuild... so I just got rid of all these annoying seconds waiting for Eclipse to finish some automatic task, not to mention time needed to setup all the plugins. There were many times when I had more trouble integrating a tool with Eclipse, than setting up the tool itself, i.e. trying to use a Google Web Toolkit, with both Maven GWT plugin and Eclipse GWT plugin. So, my solution is quite simple - use editor for editing. Period. Use build tool for building, version tool for versioning, etc... "Write (or in my case: Use) programs that do one thing and do it well" This principle isn't anything new, it's been known as a key part of the Unix philosophy, so it isn't really a surprise that most development tools that satisfy it come from Unix world.

Also, as I mentioned in my previous post, there is a tendency in IDE design to encompass the functionality of system's window manager within IDE window, so basically we have to switch between working on different files, REPL, web browser etc. both within IDE and within system window manager. With toolkit I used, only switching I had to do was between terminal windows and a browser. Even the meld tool would pop out and launch from console, and then after seeing diff and merging changes I would simply close it.


Of course, there are drawbacks to this aproach. Some technologies, especially proprietary ones or those that have enterprisey approach (or both!), simply demand some specific enviroment. But, whenever I have the luxury of choosing what to work with it will be whatever is comfortable and makes me most productive.

Thursday, September 16, 2010

Are IDEs bloated by design?

By IDE, i obviously mean, Integrated Development Environment. A tool every programmer out there knows of, and most of us use it. But first, before I start with the topic I just want to make one thing clean - I am not starting another flame war about which IDE is less bloated or more powerful or whatever. The point I am trying to make is that all Integrated Development Environments are inevitabely bloated at some time in their lifecycle, pricipally because of the 'Integrated' part.


Ok, it really helps to have some basic functionalities needed for programming in one place, but the definition of 'some basic' may vary for different people and sometimes for different projects. The core functionalities include code editing (with color coding, folding and other fanciness), automatic compiling in background or at least easy compiling with a single click, and easy running of the written program. Also, core functionality may include linking with libraries, debugging and accessing library documentation.


But then, when we program we don't just write code, run and debug countinuosly. We also use source versioning systems, build managers, testing tools and frameworks. So, IDEs then get features (usually as plugins) for all these functionalities.


And that is not all, many projects contain files with different syntaxes or in different programming languages, i.e. a Java based web application may contain java files, jsp, properties, javascript, css, and plenty of xml. So, our IDE of choice gets support for all these as well.


And at some point, when we finish working on that project and start another one using a different technology stack, we can either reinstall IDE and repeat process of setting up all the plugins, this time for another set of technologies or we can add new plugin to the existing environment setup. So, now our IDE contains support for newest buzzword-compliant soa based web services, ages old database system and an obscure scripting language with support for closures and duck typing.


Ok, I admit that using buzzword software as an example is a bit of a cheating, because it is not really that hard to convince programmers against leveraging state of art game changer software, but you do get my point of where is existing trend leading us.


Next thing, we have embedded web browser in an IDE, so that you can test your webpages without opening a browser, and a system console within a IDE console, which... well, sends commands to the real shell console, and displays output back in IDE... Some people are apparently trying to create an Uber Development Environment, where programmers would never have to minimize The Editor when programming.


It reminds me a bit of those huge fancy hotels where you are being encouraged to spend your entire vacation within hotel compound, swim in the hotel pool, eat only in hotel restaurant, buy souvenirs in specialized gift shops with mass-produced hand-made local products. Well, I'm simply not that kind of a guy. Seriously, I'm always much more in the mood of going to a beach of my choice, eating wherever I want etc. Ok, I'm offtopic in my own topic! :) My main concern was the efficiency and (effective) ease of use, not the matter of convenience vs. choice. But I couldn't help it, I really dislike these hotels :)


Back to the topic.. There already is an article on thedailywtf, describing a phenomenon which I believe is the main factor that causes bloated IDEs in the first place. They called it The Inner-Platform Effect and the key part of the definition is "a result of designing a system to be so customizable that it ends becoming a poor replica of the platform it was designed with". This effect may occur in literally any application that offers support for plugins or similar extensions. Of course, IDE without plugins may become bloated too, but then all the bloating would have to be done by the IDE developers themselves, rather than "general public".


Anyone correct me if I'm wrong, but it does seem to me that the main IDE window is becoming a replacement for our system's window manager. It contains many tabs or small windows and other widgets for organizing code and other stuff in project. Switching tabs between windows/tabs in Eclipse, Visual Studio, NetBeans, etc.. is the new Alt-Tabbing. I'm just not sure what was wrong with the first one. And that is least of the problems bloatware creates us. For example, any modern editor with all the plugins takes a lifetime to load, occasionally may hang while doing some validation of html versus the w3c standards it wasn't even asked to and other whatnots..


Ok, but so far I just ranted against bloated IDEs, and didn't offer any solution to the problem. As the matter of fact, many of you probably don't even see a problem there. If so, I'm happy for you, folks :) Also, it's possible that you do see the same problem, but may see my solution as afflicted with even worse problems.


Anyway, if you visit my blog again, see my next post where I will explain how I used a really simple and quite low tech programming environment to develop a cool app.

Clockwork Fig is unleashed!

I proudly present my personal blog, where I will post about various different things which I find interesting, from programming stuff, tutorials, my personal project, to all other things that are interesting enough to be written about.


I hope you will enjoy reading my posts (or at least find them useful :))