I have done some preliminary analysis of the files and written a small amount of code to transfer the files from the original text to SQLite database. My preference is SQL Server – even the free version, but I’m trying new things and well SQLite is free.
There does seem to be some limitations with SQLite compared to what I am used to even using the Microsoft SQL Express versions. Typically, I can have one database in Microsoft SQL Express and connect to it from multiple custom written clients and all write to the same database.
Now, I haven’t put a lot of energy in to this with SQLite; however, at present it seems I can only have 1 client to database file.
At first I began processing the files with a program that would rip through each file, inserting the records in to a RAW_data table – for later processing with Python in to separate tables and data normalization applied.
I started on my laptop and processing the first file the actors file – and it was working well. It was chunking through the data and inserting it into the SQLite database. The problem was that I didn’t know how long it was going to take. The actors file – is 1.5 GB; however, since I am processing the files line by line – the number of GB the file is – isn’t much of an indicator of how long it will take to process.
So, I wrote a second program – again a very simple program. And it ripped through all the files and gave me the number of lines in each file.
It turns out that the point where my patience was running low and I wrote this other program to help me get a progress report, I had processed over 400,000 records. For some reason my “gut” said it wouldn’t take long to process.
Well, the “gut” was wrong. Here is a list of the files and their lengths in terms of number of lines:
20494819 actors.list.txt 12310518 actresses.list.txt
2232725 aka-names.list.txt 1028301 aka-titles.list.txt
11774007 biographies.list.txt 2315726 business.list.txt
688418 certificates.list.txt 1651352 cinematographers.list.txt
1914881 color-info.list.txt 86222 complete-cast.list.txt
49427 complete-crew.list.txt 1378534 composers.list.txt
476732 costume-designers.list.txt 1968883 countries.list.txt
75481 crazy-credits.list.txt 3103707 directors.list.txt
1845204 distributors.list.txt 2121689 editors.list.txt
2384782 genres.list.txt 43686 german-aka-titles.list.txt
1390862 goofs.list.txt 2573 iso-aka-titles.list.txt
55635 italian-aka-titles.list.txt 6834098 keywords.list.txt
1938923 language.list.txt 357895 laserdisc.list.txt
428031 literature.list.txt 1086096 locations.list.txt
1113667 miscellaneous-companies.list.txt 7531399 miscellaneous.list.txt
2906536 movie-links.list.txt 4031167 movies.list.txt
72460 mpaa-ratings-reasons.list.txt 7771660 plot.list.txt
7585527 producers.list.txt 2571705 production-companies.list.txt
546929 production-designers.list.txt 5085511 quotes.list.txt
691804 ratings.list.txt 4678389 release-dates.list.txt
1339115 running-times.list.txt 631133 sound-mix.list.txt
2700214 soundtracks.list.txt 76752 special-effects-companies.list.txt
500669 taglines.list.txt 1796072 technical.list.txt
3660623 trivia.list.txt 4934220 writers.list.txt
As you can see – progress of 400,000 records is not impressive when the file contains over 20 million records. So, I moved the program, files and database file to two of my desktop computers. Then using a “poor man’s” multithreading I copied the python program – ripped out the logic to move from file to file and set it up to start on a specific record. Then I divided up the file by different start points and launched multiple copies of the program on both computers.
This is how I found out that databases can be locked and that SQLite might not be as handy as SQL Server Express. However; it is very easy to make new SQLite databases – and so this was a very minor hindrance. In addition, with a little research it is very easy to merge separate SQLite databases. At least the instructions look easy. In a later session I’ll describe the process of merging the SQLite databases.
At present, across the two computers 7 instances of the program running – I have completed moving 5.7 million records to SQLite databases. Earlier I had estimated that it would take 49 days for the laptop to transfer just the records from the actors file to SQLite. Unacceptable!
Now; however, it shouldn’t take much longer than a week. In combination with my laptop, I had it working against the smaller files – and a number are already completed and ready for me to start processing in to normalized data in the database.
Some minor irritations in the files.
- The first 100 to 200 or more lines are the people from IMDB talking about how you can’t make money from the files, can’t sue them for bad data in the files, and blah blah blah. I appreciate that there are things to say and licensing and stuff like that. Just don’t do it in the damned dirty data file.
- Some brainiac figured that they could make smaller files by only including the key data (actor’s name) in the first record. Then subsequent records lack this piece of information until you hit a blank line. Then you start the next actor.
- The blank lines. Really. Why would you do this? All you need to do is have an actor name in the line to know you have started the next actor. You could use basic page type logic for printing to process it without adding the extra lines.
- There is a separate “actors” file and “actresses” file. Universally, I refer to anyone that is performing in a theatrical production as an actor. Why there are two files for essentially the same data – except that the actors in one file are male and the actors in the other file are female – is just crazy. You could have just had one file with actor’s sex as a column of data. However; this does give us an opportunity for some ball-park analysis of the data. The number of lines in the actors file is 20.4 million. The number of lines in the actresses file is 12.3 million. We can already see that there is a disparity (at least based on total number of lines) in male actors vs female actors having a credited roles in theatrical productions. I look forward to getting this data in to a database where I can do a more accurate analysis.
So, the line estimates are off – there aren’t that many lines of information in the files, but we have no way to know just how many lines of actual data are in the file, because actors can be in 1 movie, show, episode, whatever – or some of them could be in hundreds and there is no real way to estimate this. It isn’t a big deal. Just a minor irritation. Once I have the whole file loaded I can run a simple query to know how many blank lines were in the file.
While on the topic of criticizing the files and their contents… Spaces do take up space in a file. Not having the names on each line may save a little space; however, this could have been done in another way that would have saved far more space.
Since I am a database developer – one of the principles of databases is not to repeat data. You end up with a file of “actors” (or a table in your database) and a number representing each of those actors. It would be an integer (SQL Server 4 bytes) covers values just over 2 billion. There aren’t 2 billion actors. Judging on the mixed master data file of actors (actors and shows they are in) being only 20.4 million records (including those stupid blank lines) it will be centuries before we have 2 billion actors.
Then on every line that you have a separate file – relating actors to theatrical production – well, unless the name is very small including the spaces – the integer + theatrical production name will always be smaller than each line in the present form.
Now, if we extend this outward – there could be a theatrical production name table with integer ID numbers. Again, since at this point we are far from having over 2 billion theatrical productions a 4-byte integer is no problem to be an identifier for all theatrical productions.
Then, this file that I am processing the “actors” file, would simply be two integer values separated by a delimiter of some sort. More than likely as I go through processing the data I will be creating an actors table with integer ID numbers, a theatrical production master data table, and then what is presently the “actors” table will be a cross reference or crosswalk table.
I will publish these tables, mostly with the limitations of license like the IMDB license – can’t sue me for inaccuracy and you can’t make money using the table.
Ok, well, just a little status on the smaller files:
—-> 86222 complete-cast.list.txt <—- Completed 11/29/2016
—-> 49427 complete-crew.list.txt <—- Completed 11/29/2016
—> 75481 crazy-credits.list.txt <—- Completed 12/02/2016
—-> 43686 german-aka-titles.list.txt <— Completed 11/29/2016
—-> 2573 iso-aka-titles.list.txt <—- Completed 11/30/2016
—-> 55635 italian-aka-titles.list.txt <—- Completed 11/30/2016
—-> 72460 mpaa-ratings-reasons.list.txt <—- Completed 11/29/2016
—-> 76752 special-effects-companies.list.txt <—- Completed 11/29/2016
So, things are proceeding – and while the actors file does the initial processing I’ll be able to research these files and get them in to better formats and up in to the database with final row counts.