3 reasons to use MongoDB
Note: This is precursor post to my talk at Mongo Boston on September 20th. It’s gonna be at Microsoft’s NERD (which I hear is COMPLETELY AWESOME). If you haven’t signed up yet, stop being lame. It’s gonna be awesome. http://www.10gen.com/conferences/mongoboston2010
People have asked me why to use MongoDB. I used to answer with “it’s SO FREAKING AMAZING!!”, and talk about how “new” and “cool” and “hawt” it felt. I would wax on about how it felt like the first time I used Rails and so forth.
That was cheating. It’s a crappy answer of no value. So today, I am going to give you three reasons to use MongoDB over MySQL, Postgres, Tokyo Cabinet, or CouchDB.
1. Simple queries
MongoDB is a document store with no transactions and no joins. When an application warrants using this type of database[1], the result is that your queries become much simpler. They are easier to write. They are easier to tune. They make it easier for developers to do their job. In Punchbowl land, ‘users’ have ‘events.’ There is a table for each, with a user_id on the events table. Lets say I want to get all the users who have published an event.
In an SQL database, I have two tables: users and events. I could write this query like so:
SELECT `users`.* from `users` INNER JOIN `events` ON `events`.`user_id` = `users`.`id` where `events`.`published_at` is not null group by `users`.`email`;
Analogously, in a MongoDB database, lets say I have just one collection: users. Each user document has an attribute called ‘events’, which is a list of embedded documents. It looks something like this in JSON:
{
”name” : “Ryan Angilly”,
”events” : [
{
“title” : “First one!”
},
{
“title” : “Whoa!”
},
{
“title” : “Oh hi”,
“published_at” : true
}
]
}
To perform the same query in MongoDB query syntax:
db.users.find( { ‘events.published_at’: {$ne: null}} )
Simpler. Simpler to read. Simple to write. I glossed over the fact that we are drastically changing how we store our data, but that’s the whole point. And you can clearly see how it makes things easier to understand.
2. Sharding
Sharding is a simple concept. If you have lots of data and you are getting disk-bound and/or running out of space, take your data and split across several machines. You get more disk throughput and more storage. In a perfect world, as your storage and performance needs grow, just add more shards.
MongoDB is pretty close to this perfect world. If you have a mongod process running, and you want to setup sharding, you:
1) Bring up a new machine
2) Start a new mongod process to act as a member of your shard cluster
3) Start a new mongod process to act as a separate ‘config’ database for maintaining configuration information about which data are in which shard
4) Start a mongos process & tell it how to find the current db, the new shard member, the config database
5) Enter ~5 commands to enable sharding on whichever databases and collections you want
6) Modify your apps to connect to the mongos process instead of the old mongod process
7) Profit.
All intraprocess communication is done over IP, so the configuration mongod process and mongos process can run on their own machines or run on the same machine as one of your shard members. This can be be done with no downtime, guys. And you don’t have to have an eye towards sharding when you start. You can take a regular old mongod process and it will “just work.”
There are solutions to this problem in MySQL [2], but they require massaging data at a layer above the database. The database itself does not support this feature. Also, you don’t have to think about sharding until you need it. You don’t have to pre-optimize. When you don’t need sharding, just start up a mongod process and go. When you do need sharding, fire up a few more machines, and issue a few commands. No downtime.
A common quip that I’ve heard and read is something to the effect of: “How many people reading this post actually have enough data to worry about the need to shard? Not many.” My response to this is simple. Most people who use MySQL w/ master/slave replication probably don’t need that either. Lots of apps could get away with sqlite and a cronjob that backed the file up every hour. But MySQL & master/slave replication is the status quo, so we all do it. Now, think about HD video, geolocation, realtime messaging, augmented reality, closer-to-realtime-satellite imagery. Think about all that data, and how much faster people will want it (and mashups & derivatives of it), 5 years from years from now. Then think about what database you want to start using right now.
3. GridFS
For reasons that I’m not experienced enough to talk about, you don’t store files in MySQL. Let’s say you have an application where a user can upload a profile pic. The standard practice is to store the path to that file in the database, store the file on the filesystem (a shared filesystem if you have multiple app servers) or S3. If you use a filesystem, some kind of backup is usually performed as well. If you have multiple apps, you have to use a shared filesystem.
With GridFS, you store files in the database. MongoDB was built to do this. Why is this a “reason to use MongoDB,” because MongoDB has replication and sharding of collections built-int. And guess what? You can apply that stuff seamlessly to GridFS collections as well. When you store assets in MongoDB, you get all the replication and sharding capability for free. Want to backup your user assets? Just replicate the GridFS collections. Running out of space on your NFS share? Have fun dealing with IT. Running out of space on your MongoDB GridFS installation? Bring up another machine and shard that collection.
Storing assets in a database is the way we should be doing it from now on.
FINI
So there you have it. MongoDB is teh awesome because of a simple query syntax, the ability to shard data across machines easily, and the ability to store files in GridFS while taking advantage of replication & sharding.
If you can make it to Mongo Boston, sign up and come say hi, I’ll be the guy getting lynched by the MySQL and CouchDB fanboys.
[1] “Well when the crap is that?!” you may be asking. Look for my next post about when you should be using MongoDB.
What is MongoDB?
Note: This is precursor post to my talk at Mongo Boston on September 20th. It’s gonna be at Microsoft’s NERD (which I hear is COMPLETELY AWESOME). If you haven’t signed up yet, stop being lame. It’s gonna be awesome. http://www.10gen.com/conferences/mongoboston2010
A lot of people come up to me and ask about MongoDB. Here’s a 101 for those of you still totally in the dark.
MongoDB is a database
It’s just like MySQL in the sense that you run a daemon, that daemon creates files on a filesystem, and you access it over a network via a client. A single mongod process runs on one machine, and can have many databases. A database can have many collections (“tables” in MySQL-speak). You can write to it & you can query based on attributes of records. Out of the box, it comes with support for replication and sharding. It has support for atomic operations. There are clients for it written in pretty much every popular language.
MongoDB is “schema-less”
In MySQL, you create a table w/ a pre-defined set of typed attributes (Create a ‘users’ table w/ name:string, email:string). When you write a new record to a table in MySQL, you specify attribute/value pairs: name = ryan. If you don’t specify a certain attribute (email, in this case) that’s usually ok. That record will just get a default value for that attribute. This default value could be null or an empty string or a predefined string, but it’s still there, and it has a value. Every record in the table will always have the same set of attributes. If you try to write a record to a table and include an attribute not in that table’s definition (wicked_attractive = true) you’ll get a nasty error .
In MongoDB, you create a “collection” with no pre-defined attributes (Create a ‘users’ collection). When you write a new document to a collection in MongoDB, you also specify attribute/value pairs: name = ryan. But in this case, there are no default values. There is no email attribute on that document. If you try to write a document to a collection with an attribute that no other document has (wicked_attractive = true), MongoDB will be ok with it. This is a key point: documents within the same collection can have different sets of attributes.
MongoDB is a “document-store”
This is closely related to being “schema-less.” In MySQL, you define a set of attributes for a table. Rows get inserted into tables, and the rows are 1 dimensional. What I mean by 1 dimensional is that all of the pieces of data in a row are first class citizens. The number of pieces of information in a row equals the number of attributes defined for that table.
MongoDB lets you store arbitrarily complex documents (think JSON). The following document can be stored in the users collection:
{
name: ‘Ryan’,
email: ‘ryan@angilly.com’,
likes: [‘mongodb’, ‘skiing’, ‘Red Sox’, ‘Boulder chicks’],
dislikes: [‘humidity’, ‘Sarah Palin’, ‘bigotry’, ‘The Yankees’],
current_outfit: {
pants: ‘blue shorts’,
shirt: false,
shoes: ‘flip-flops’,
undies: ‘wouldn't you like to know’
}
}
In this case, there are 5 “top level” attributes, but 14 “pieces of data.”
Along with standard DB types (string, integer, float, datetime, boolean), MongoDB also has arrays and hashes as native types. In this ‘users’ document, you have an embedded ‘current_outfit’ document, but ‘current_outfit’ isn’t a collection. It’s just an embedded document inside of this particular user document. You also have lists of likes and dislikes. The elements in a list do not have to be the same type.
You can put indexes on “deep” attributes. In MySQL, you can put an index on `users`.`email` to speed up queries on that attribute. In MongoDB, you can put indexes on any attribute in the document. In our previous example, for… example…, you can put an index on users.current_outfit.shirt and quickly query to see who is topless. If you put an index on an array type (users.likes), you’d be able to quickly query for any user who ‘liked’ ‘twitter’, and quickly get a result.
MongoDB is “NoSQL”
To query MySQL, you use (surprise, surprise) SQL:
SELECT * FROM `users` WHERE `users`.`email` = ‘ryan@angilly.com’ limit 1;
SQL is a very powerful language, where different types of joins give you the power to issue a single query that effectively spans multiple tables, and can return a result set with data from multiple tables.
There is no SQL in MongoDB. You query a MongoDB database by issuing a query command and passing along a hash. In the mongo shell console (which is JavaScript) we use JSON:
db.users.find({email: ‘ryan@angilly.com’}).limit(1);
There are $ operators for doing different types of inequalities, lat/long distance calculations, regex matches, etc….
When you issue a query to a MongoDB database, you cannot ask for stuff from two collections at once. There are no joins. However, keeping with our last example, if you query for a user with the email ‘ryan@angilly.com’ you would get back the entire user document we stored — with likes, dislikes, and current_outfit included. This is what people mean when they say “not having joins is ok because you don’t need them.” You can embed arbitrarily complex data inside a document, and get it all at once.
MongoDB is different (has downsides)
MongoDB is different. And anytime something is different, it has downsides from what you’re used to. Out of the box, MongoDB will acknowledge a write has completed before it’s on disk (although this is tunable on a write-by-write basis). MongoDB does not have transaction support (but after designing an app from the ground up with documents, you find you rarely need them). MongoDB will not make you more attractive to the opposite sex (although I hear they are working on it for 1.8).
I hope this has give you some insight into what MongoDB is. If you can make it to Mongo Boston, come say hi. I’ll be the wicked attractive topless guy. http://www.10gen.com/conferences/mongoboston2010
Some mornings MySQL really gets on my nerves
mysql> alter table some_table modify column priority decimal; Query OK, 1170663 rows affected (2 min 10.26 sec) Records: 1170663 Duplicates: 0 Warnings: 0
Table was write locked the entire time. GO FASTER.
