Choose Your Database Wisely and my IMDB project

I am currently learning a lot about Python. Python is one of a suite of tools that are free that can be used for Data Analysis. In courses they teach you how to interact with a database and they show you the least effort possible. In the courses they hype the fact that SQLite is native and built-in to the core libraries of Python.

In Python, the database of the least possible effort is SQLite. So, armed with my new knowledge I charged forward using Python on my desktop against a SQLite database for millions of records.

I was so “in love” with the idea of all these free tools, that I forgot about the free tools for the desktop that I had used before – Microsoft SQL Server Express. As I did estimates of how long it would take to process 20.4 million records I was a little disappointed in SQLite, but unsure if that was causing the performance problems. Perhaps it was the disk use (100%) of the program.

There really wasn’t much to the program, though, so if the problem was the file access throttling the hard drive I was fairly stuck.

Eventually, even after processing millions of records I decided to install my old standby – Microsoft SQL Server Express – which like SQLite is free. These are two vastly different databases. SQLite is what it says – lite – and can be used on a wide variety of portable devices. SQL Server has a specific product for smaller or compact devices, but SQL Server Express is basically full SQL Server minus some features.

The people at Microsoft have done thousands of hours of programming to make the SQL Server engine perform. Quickly, I was able to rip through a million records to the SQL Server database, using virtually the same code I had been using against SQLite. Python was not the problem.

Falling in love with a way of doing things – in the tool to do things, was the problem. Now, instead of multiple computers with multiple instances of the program running I have one program running on one computer. It has been a relatively short period of time (I can be more precise later – with the datetime stamps) it has already processed 4.5 million records.

Now, instead of considering long lengths of time just to process the actors.list.txt file, I can consider getting through all the files in a relatively short period of time and focus on analyzing the data and figuring out master data structures.

Just to give a picture, against SQLite it took 1:25 seconds to process 1000 records from the actors.list.txt file to the SQLite database. It is too hard to even time 1000 records processing against Microsoft SQL Server Express – it takes 10 seconds to process 10,000 records.

I get no money from Microsoft or anyone else (no ads on my site at present). This is just the cold hard facts. Free tools are great and I would use them to teach people. It is the path of least resistance to getting someone started on a database. Microsoft SQL Server express involves installs and it take a serious commitment to get through the install process; however, the time invested in that is returned nearly immediately given the speed of storing information in the database.

Don’t fall in love with your tool!

One thought on “Choose Your Database Wisely and my IMDB project

  1. Pallox January 12, 2017 / 10:34 am

    Thank you for contributing the information. It’s nice to see that other people are also messing around with the IMDb-Files – as those files got a really messy format.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s