“Open Source Software” isn’t about source code
While thinking about the slides for my upcoming talk at Mongo Boston, I had an interesting thought. It’s nothing earth shattering, but I wanted to write it down and see where it took me.
Lets take a look at the open source ecosystem surrounding MongoDB, and for this example we’ll focus on the Ruby space and some of the stuff I’ve played with at Punchbowl. You have MongoDB itself, which is written in C++. There’s MongoMapper, an ORM. There’s Rack-GridFS, a Rack middleware for directly accessing files stored in GridFS. There’s OpenIDAuthentication, a library for doing OpenID auth in MongoDB. There’s Roachclip, a plugin I wrote for MongoMapper which combines the fun of Thoughtbot’s Paperclip image processing with the ability to store all the assets in GridFS through Joint. There are literally hundreds of open source software projects out there that anyone can pick up and use in a hobby or in a business.
Some of this software has documentation. Some doesn’t. It’s all open source, though. So you can download it, crack it open in a text editor, and just go figure it out. See a problem or notice a lack in functionality? Most of this software is using version control (hi Github); just fork and fix it. What used to be:
http://github.com/original_author/sweet_repo
now becomes:
http://github.com/you/sweet_repo
Maybe your contributions will get pulled into the mainline, maybe your version will remain a fork and start to get used by others, or maybe nobody else will ever use it but you.
Such is the life of open source source code. In my opinion, it’s pretty damn cool. But this coolness is NOTHING compared to what open source is really about.
What I’ve started to realize over the past year is that “Open Source Software” isn’t about the last part of the Github URI: the repository name. Open Source Software is really about the second to last part: The Author.
Open Source Software is about the people. MongoDB isn’t just a C++ repository; it’s Kristina, Mike, Kyle, Dwight & Elliot (among others). MongoMapper, OpenIDAuthentication, Rack-GridFS, Roachclip & Paperclip aren’t just Ruby libraries; they are John Nunemaker, Brandon Keepers, Blake Carlson, yours truly & Jon Yurek (again, among others).
There is obviously value in the code: it does stuff. The code can also teach you stuff. The code can show you how to properly build shareable, modular and properly tested pieces of software. The code can teach you the black arts of meta-programming and the Ruby object model.
But I assure you that no matter how much awesomeness is in the code, there is at least 7 times more awesomeness in the brains of the people who built it. Those people who first had a need for it. The people who struggled through the bugs. The people who had a dozen false starts that might not get reflected by the current HEAD. That experience is PRICELESS. And it’s all available if you just ask.
So, ask. Say hi.
Every single one of those people, and countless more, are “Open Source”. You can get at them on mailing lists, IRC, and Github Issues. You can email them directly or tweet at them. Most of the time, they’ll answer (and don’t get pissed if they don’t answer — sometimes people get busy).
So if you are using Open Source Software just for the source code, you’re missing out. Use these people. Ask questions. Get a conversation going. That’s the real meaning of Open Source Software.
3 reasons to use MongoDB
Note: This is precursor post to my talk at Mongo Boston on September 20th. It’s gonna be at Microsoft’s NERD (which I hear is COMPLETELY AWESOME). If you haven’t signed up yet, stop being lame. It’s gonna be awesome. http://www.10gen.com/conferences/mongoboston2010
People have asked me why to use MongoDB. I used to answer with “it’s SO FREAKING AMAZING!!”, and talk about how “new” and “cool” and “hawt” it felt. I would wax on about how it felt like the first time I used Rails and so forth.
That was cheating. It’s a crappy answer of no value. So today, I am going to give you three reasons to use MongoDB over MySQL, Postgres, Tokyo Cabinet, or CouchDB.
1. Simple queries
MongoDB is a document store with no transactions and no joins. When an application warrants using this type of database[1], the result is that your queries become much simpler. They are easier to write. They are easier to tune. They make it easier for developers to do their job. In Punchbowl land, ‘users’ have ‘events.’ There is a table for each, with a user_id on the events table. Lets say I want to get all the users who have published an event.
In an SQL database, I have two tables: users and events. I could write this query like so:
SELECT `users`.* from `users` INNER JOIN `events` ON `events`.`user_id` = `users`.`id` where `events`.`published_at` is not null group by `users`.`email`;
Analogously, in a MongoDB database, lets say I have just one collection: users. Each user document has an attribute called ‘events’, which is a list of embedded documents. It looks something like this in JSON:
{
”name” : “Ryan Angilly”,
”events” : [
{
“title” : “First one!”
},
{
“title” : “Whoa!”
},
{
“title” : “Oh hi”,
“published_at” : true
}
]
}
To perform the same query in MongoDB query syntax:
db.users.find( { ‘events.published_at’: {$ne: null}} )
Simpler. Simpler to read. Simple to write. I glossed over the fact that we are drastically changing how we store our data, but that’s the whole point. And you can clearly see how it makes things easier to understand.
2. Sharding
Sharding is a simple concept. If you have lots of data and you are getting disk-bound and/or running out of space, take your data and split across several machines. You get more disk throughput and more storage. In a perfect world, as your storage and performance needs grow, just add more shards.
MongoDB is pretty close to this perfect world. If you have a mongod process running, and you want to setup sharding, you:
1) Bring up a new machine
2) Start a new mongod process to act as a member of your shard cluster
3) Start a new mongod process to act as a separate ‘config’ database for maintaining configuration information about which data are in which shard
4) Start a mongos process & tell it how to find the current db, the new shard member, the config database
5) Enter ~5 commands to enable sharding on whichever databases and collections you want
6) Modify your apps to connect to the mongos process instead of the old mongod process
7) Profit.
All intraprocess communication is done over IP, so the configuration mongod process and mongos process can run on their own machines or run on the same machine as one of your shard members. This can be be done with no downtime, guys. And you don’t have to have an eye towards sharding when you start. You can take a regular old mongod process and it will “just work.”
There are solutions to this problem in MySQL [2], but they require massaging data at a layer above the database. The database itself does not support this feature. Also, you don’t have to think about sharding until you need it. You don’t have to pre-optimize. When you don’t need sharding, just start up a mongod process and go. When you do need sharding, fire up a few more machines, and issue a few commands. No downtime.
A common quip that I’ve heard and read is something to the effect of: “How many people reading this post actually have enough data to worry about the need to shard? Not many.” My response to this is simple. Most people who use MySQL w/ master/slave replication probably don’t need that either. Lots of apps could get away with sqlite and a cronjob that backed the file up every hour. But MySQL & master/slave replication is the status quo, so we all do it. Now, think about HD video, geolocation, realtime messaging, augmented reality, closer-to-realtime-satellite imagery. Think about all that data, and how much faster people will want it (and mashups & derivatives of it), 5 years from years from now. Then think about what database you want to start using right now.
3. GridFS
For reasons that I’m not experienced enough to talk about, you don’t store files in MySQL. Let’s say you have an application where a user can upload a profile pic. The standard practice is to store the path to that file in the database, store the file on the filesystem (a shared filesystem if you have multiple app servers) or S3. If you use a filesystem, some kind of backup is usually performed as well. If you have multiple apps, you have to use a shared filesystem.
With GridFS, you store files in the database. MongoDB was built to do this. Why is this a “reason to use MongoDB,” because MongoDB has replication and sharding of collections built-int. And guess what? You can apply that stuff seamlessly to GridFS collections as well. When you store assets in MongoDB, you get all the replication and sharding capability for free. Want to backup your user assets? Just replicate the GridFS collections. Running out of space on your NFS share? Have fun dealing with IT. Running out of space on your MongoDB GridFS installation? Bring up another machine and shard that collection.
Storing assets in a database is the way we should be doing it from now on.
FINI
So there you have it. MongoDB is teh awesome because of a simple query syntax, the ability to shard data across machines easily, and the ability to store files in GridFS while taking advantage of replication & sharding.
If you can make it to Mongo Boston, sign up and come say hi, I’ll be the guy getting lynched by the MySQL and CouchDB fanboys.
[1] “Well when the crap is that?!” you may be asking. Look for my next post about when you should be using MongoDB.
