Choose Your Database Wisely and my IMDB project

I am currently learning a lot about Python. Python is one of a suite of tools that are free that can be used for Data Analysis. In courses they teach you how to interact with a database and they show you the least effort possible. In the courses they hype the fact that SQLite is native and built-in to the core libraries of Python.

In Python, the database of the least possible effort is SQLite. So, armed with my new knowledge I charged forward using Python on my desktop against a SQLite database for millions of records.

I was so “in love” with the idea of all these free tools, that I forgot about the free tools for the desktop that I had used before – Microsoft SQL Server Express. As I did estimates of how long it would take to process 20.4 million records I was a little disappointed in SQLite, but unsure if that was causing the performance problems. Perhaps it was the disk use (100%) of the program.

There really wasn’t much to the program, though, so if the problem was the file access throttling the hard drive I was fairly stuck.

Eventually, even after processing millions of records I decided to install my old standby – Microsoft SQL Server Express – which like SQLite is free. These are two vastly different databases. SQLite is what it says – lite – and can be used on a wide variety of portable devices. SQL Server has a specific product for smaller or compact devices, but SQL Server Express is basically full SQL Server minus some features.

The people at Microsoft have done thousands of hours of programming to make the SQL Server engine perform. Quickly, I was able to rip through a million records to the SQL Server database, using virtually the same code I had been using against SQLite. Python was not the problem.

Falling in love with a way of doing things – in the tool to do things, was the problem. Now, instead of multiple computers with multiple instances of the program running I have one program running on one computer. It has been a relatively short period of time (I can be more precise later – with the datetime stamps) it has already processed 4.5 million records.

Now, instead of considering long lengths of time just to process the actors.list.txt file, I can consider getting through all the files in a relatively short period of time and focus on analyzing the data and figuring out master data structures.

Just to give a picture, against SQLite it took 1:25 seconds to process 1000 records from the actors.list.txt file to the SQLite database. It is too hard to even time 1000 records processing against Microsoft SQL Server Express – it takes 10 seconds to process 10,000 records.

I get no money from Microsoft or anyone else (no ads on my site at present). This is just the cold hard facts. Free tools are great and I would use them to teach people. It is the path of least resistance to getting someone started on a database. Microsoft SQL Server express involves installs and it take a serious commitment to get through the install process; however, the time invested in that is returned nearly immediately given the speed of storing information in the database.

Don’t fall in love with your tool!

IMDB Data Search Project Part 2 (project update – no code)

I have done some preliminary analysis of the files and written a small amount of code to transfer the files from the original text to SQLite database. My preference is SQL Server – even the free version, but I’m trying new things and well SQLite is free.

There does seem to be some limitations with SQLite compared to what I am used to even using the Microsoft SQL Express versions. Typically, I can have one database in Microsoft SQL Express and connect to it from multiple custom written clients and all write to the same database.

Now, I haven’t put a lot of energy in to this with SQLite; however, at present it seems I can only have 1 client to database file.

At first I began processing the files with a program that would rip through each file, inserting the records in to a RAW_data table – for later processing with Python in to separate tables and data normalization applied.

I started on my laptop and processing the first file the actors file – and it was working well. It was chunking through the data and inserting it into the SQLite database. The problem was that I didn’t know how long it was going to take. The actors file – is 1.5 GB; however, since I am processing the files line by line – the number of GB the file is – isn’t much of an indicator of how long it will take to process.

So, I wrote a second program – again a very simple program. And it ripped through all the files and gave me the number of lines in each file.

It turns out that the point where my patience was running low and I wrote this other program to help me get a progress report, I had processed over 400,000 records. For some reason my “gut” said it wouldn’t take long to process.

Well, the “gut” was wrong. Here is a list of the files and their lengths in terms of number of lines:

20494819 actors.list.txt                                                 12310518 actresses.list.txt
2232725 aka-names.list.txt                                            1028301 aka-titles.list.txt
11774007 biographies.list.txt                                           2315726 business.list.txt
688418 certificates.list.txt                                               1651352 cinematographers.list.txt
1914881 color-info.list.txt                                                   86222 complete-cast.list.txt
49427 complete-crew.list.txt                                           1378534 composers.list.txt
476732 costume-designers.list.txt                                  1968883 countries.list.txt
75481 crazy-credits.list.txt                                              3103707 directors.list.txt
1845204 distributors.list.txt                                            2121689 editors.list.txt
2384782 genres.list.txt                                                       43686 german-aka-titles.list.txt
1390862 goofs.list.txt                                                            2573 iso-aka-titles.list.txt
55635 italian-aka-titles.list.txt                                       6834098 keywords.list.txt
1938923 language.list.txt                                                  357895 laserdisc.list.txt
428031 literature.list.txt                                                 1086096 locations.list.txt
1113667 miscellaneous-companies.list.txt                       7531399 miscellaneous.list.txt
2906536 movie-links.list.txt                                            4031167 movies.list.txt
72460 mpaa-ratings-reasons.list.txt                               7771660 plot.list.txt
7585527 producers.list.txt                                               2571705 production-companies.list.txt
546929 production-designers.list.txt                             5085511 quotes.list.txt
691804 ratings.list.txt                                                    4678389 release-dates.list.txt
1339115 running-times.list.txt                                           631133 sound-mix.list.txt
2700214 soundtracks.list.txt                                               76752 special-effects-companies.list.txt
500669 taglines.list.txt                                                    1796072 technical.list.txt
3660623 trivia.list.txt                                                      4934220 writers.list.txt

As you can see – progress of 400,000 records is not impressive when the file contains over 20 million records. So, I moved the program, files and database file to two of my desktop computers. Then using a “poor man’s” multithreading I copied the python program – ripped out the logic to move from file to file and set it up to start on a specific record. Then I divided up the file by different start points and launched multiple copies of the program on both computers.

This is how I found out that databases can be locked and that SQLite might not be as handy as SQL Server Express. However; it is very easy to make new SQLite databases – and so this was a very minor hindrance. In addition, with a little research it is very easy to merge separate SQLite databases. At least the instructions look easy. In a later session I’ll describe the process of merging the SQLite databases.

At present, across the two computers 7 instances of the program running – I have completed moving 5.7 million records to SQLite databases. Earlier I had estimated that it would take 49 days for the laptop to transfer just the records from the actors file to SQLite. Unacceptable!

Now; however, it shouldn’t take much longer than a week. In combination with my laptop, I had it working against the smaller files – and a number are already completed and ready for me to start processing in to normalized data in the database.

Some minor irritations in the files.

  1. The first 100 to 200 or more lines are the people from IMDB talking about how you can’t make money from the files, can’t sue them for bad data in the files, and blah blah blah. I appreciate that there are things to say and licensing and stuff like that. Just don’t do it in the damned dirty data file.
  2. Some brainiac figured that they could make smaller files by only including the key data (actor’s name) in the first record. Then subsequent records lack this piece of information until you hit a blank line. Then you start the next actor.
  3. The blank lines. Really. Why would you do this? All you need to do is have an actor name in the line to know you have started the next actor. You could use basic page type logic for printing to process it without adding the extra lines.
  4. There is a separate “actors” file and “actresses” file. Universally, I refer to anyone that is performing in a theatrical production as an actor. Why there are two files for essentially the same data – except that the actors in one file are male and the actors in the other file are female – is just crazy. You could have just had one file with actor’s sex as a column of data. However; this does give us an opportunity for some ball-park analysis of the data. The number of lines in the actors file is 20.4 million. The number of lines in the actresses file is 12.3 million. We can already see that there is a disparity (at least based on total number of lines) in male actors vs female actors having a credited roles in theatrical productions. I look forward to getting this data in to a database where I can do a more accurate analysis.

So, the line estimates are off – there aren’t that many lines of information in the files, but we have no way to know just how many lines of actual data are in the file, because actors can be in 1 movie, show, episode, whatever – or some of them could be in hundreds and there is no real way to estimate this. It isn’t a big deal. Just a minor irritation. Once I have the whole file loaded I can run a simple query to know how many blank lines were in the file.

While on the topic of criticizing the files and their contents… Spaces do take up space in a file. Not having the names on each line may save a little space; however, this could have been done in another way that would have saved far more space.

Since I am a database developer – one of the principles of databases is not to repeat data. You end up with a file of “actors” (or a table in your database) and a number representing each of those actors. It would be an integer (SQL Server 4 bytes) covers values just over 2 billion. There aren’t 2 billion actors. Judging on the mixed master data file of actors (actors and shows they are in) being only 20.4 million records (including those stupid blank lines) it will be centuries before we have 2 billion actors.

Then on every line that you have a separate file – relating actors to theatrical production – well, unless the name is very small including the spaces – the integer + theatrical production name will always be smaller than each line in the present form.

Now, if we extend this outward – there could be a theatrical production name table with integer ID numbers. Again, since at this point we are far from having over 2 billion theatrical productions a 4-byte integer is no problem to be an identifier for all theatrical productions.

Then, this file that I am processing the “actors” file, would simply be two integer values separated by a delimiter of some sort. More than likely as I go through processing the data I will be creating an actors table with integer ID numbers, a theatrical production master data table, and then what is presently the “actors” table will be a cross reference or crosswalk table.

I will publish these tables, mostly with the limitations of license like the IMDB license – can’t sue me for inaccuracy and you can’t make money using the table.

Ok, well, just a little status on the smaller files:

—-> 86222 complete-cast.list.txt <—- Completed 11/29/2016
—-> 49427 complete-crew.list.txt <—- Completed 11/29/2016
—> 75481 crazy-credits.list.txt <—- Completed 12/02/2016
—-> 43686 german-aka-titles.list.txt <— Completed 11/29/2016
—-> 2573 iso-aka-titles.list.txt <—- Completed 11/30/2016
—-> 55635 italian-aka-titles.list.txt <—- Completed 11/30/2016
—-> 72460 mpaa-ratings-reasons.list.txt <—- Completed 11/29/2016
—-> 76752 special-effects-companies.list.txt <—- Completed 11/29/2016

So, things are proceeding – and while the actors file does the initial processing I’ll be able to research these files and get them in to better formats and up in to the database with final row counts.