I am fairly new to MongoDB, even though I have some knowledge about NoSQL databases concepts. In the last two days I was thinking to give a try to MongoDB and work with it in actual environment. To see how Mongo works with large data, I have targeted to import a SSV file contains 76,000 line and do some experiments. The file is a quote dataset which has three fields including
quote, person, and category separated by semicolon [;]. You can download the dataset from this link.
Unfortunately I have faced two major and one minor issue during the first phase of experiment.
The first challenge was to install MongoDB on Ubuntu 15.10. Why is it a challenge? Since Ubuntu 15.04 and onwards upstart -default init system- replaced by systemd and caused lots of headache. I am not sure about others but I personally have given up installing MongoDB on Ubuntu. Therefore, I used an OpenShift instance as my playground.
Installing MongoDB is very easy in OpenShift you just need to click and install it via its console I created a database called
quotionary and a collection (equals to [formatless] RDBMS table) called
Now, let’s discuss about the second challenge. This roadblock was to transform the file from semicolon separated to tab separated (TSV). Why tab? MongoDB only accepts Json, CSV and TSV and since our file has comma (,) as part of its content, the most handy solution is to convert it to tab. To do so, use this command,
$ sed 's/;/\t/g' quotes_all.csv > quote.tsv
The last hurdle was about uploading the CSV file to OpenShift data directory in a clean way without any bleeding. Why in the clean way? Overall, there are two approaches to have a file in an OpenShift instance, one is via
scp which is neat and clean. The other way (quick hack) is to commit the file to repository. I came up with a third and easy-peasy-lemon-squeezy solution described in below.
To upload a file to OpenShift, first I uploaded it to an Amazon S3 bucket and got the public URL. Then
ssh to the instance using this command.
$ rhc ssh Test
cd to persistence directory with this command.
$ cd OPENSHIFT_DATA_DIR
Finally download the file using
$ wget [link of S3 bucket]
Some may think where to find a S3 bucket and is there any replacement for that. The answer is so simple, there is a workaround if don’t have access to S3, just create a Dropbox or Gdrive account, upload the file there and get its public link.
Before jumping to import the file to Mongo, need to create a collection. To do that, I use
mongo interactive console command prompt like this.
$ mongo > use quotionary > db.createCollection(quotes)
After all steps, we can import the file to MongoDB using
$ mongoimport --host $OPENSHIFT_MONGODB_DB_HOST --db quotionary --collection quotes --type tsv --headerline --file quote.tsv -u admin -p --fields "quote,name,category"
Final step of phase one experiment is to verify the data by querying them.
$ mongo > use quotionary > db.quotes.find().pretty;
If anything went wrong and you need to repopulate the data just use
Now phase one is completed and the TSV file successfully imported to MongoDB, however, this is just the beginning and for the next phases, I plan to do some data processing on the dataset.