There are some things that you do almost every day that are really annoying. So you are watching show A. You see an actor. You know that face. You know that voice.
Finally, it occurs to you that you know what show you have seen that person. You know show B.
So, you hop on to IMDB (Internet Movie Database) and you look up show B and go to the cast – then whole cast because the actor isn’t one of the top names. You still don’t know their name so you start clicking on cast members and eventually – ahah – you have found them. Then you go scan their filmography to see if they are in show A.
It takes a bunch of time. I know, it isn’t exactly a world crushing problem. I do watch shows on TV. And for some reason it really bothers me when I know I’ve seen an actor’s face before and I can’t figure it out. Sometimes days later I’ll figure it out – because I didn’t look it up or whatever.
Despite the name – IMDB isn’t all that great to search. It is good at single searches – show me this show, show me everything this actor was in, etc. But if you have to do something equivalent to a SQL Join to find out about x and y at the same time – it fails you.
How to fix this? I don’t necessarily want to replace IMDB. What I would like to do is be able to find out information quickly and easily with specialized queries. IMDB isn’t really optimized for searching, either. There are lots of graphics, stuff that all takes time to load. I know “The Internet is fast”, but it isn’t always. And even when it is it still takes time to load all the images they are in love with at IMDB. Google’s page is a page optimized for querying.
So, that’s my charter. Nice and simple. Make IMDB searchable and really useful for finding out information. And maybe do some data analysis on the side.
What are the steps to this project?
- Set up an account at IMDB
- Find the data (if available – and fortunately for us it is available)
- Download the data – using account information as necessary
- Expand the Data – the data source are compressed .gz files
- Data cleansing – and unfortunately the data isn’t in that great a form
- Load the data into a SQL Database – I’m a huge fan of SQL for data manipulation
- Develop an understanding of the data and the types of questions people ask – for example – who are the people that overlap show A and show B.
- Develop queries to deliver the information
- Develop web pages to run the queries and a web page to guide the users to the right place
- Done – well easier said that done, certainly.
This article will cover 1, 2, 3, and 4. It involves a surprising small number of lines of code.
Now, of course, everything changes. It this era things change quickly so some things that I have in this article might not be exactly as I have here, but if you are adept you can figure your way through it.
#1 Setup an account on IMDB.
Now you can use Facebook and other things, but I recommend setting up the IMDB account – this way you can type in the login data if the FTP site requests it.
The circled area on the top right of the screenshot above of the IMDB site is the login area. Click on “Other Sign in Options”.
The circled button to “Create a New Account”.
I think at this point in time – we all know what to do on a create account page like this – even if we are not programmers!
When you have entered your data press the Create your IMDb account and it should return you back to the start page for IMDB. You do not need a “Pro” membership and you don’t need to spend any money.
#2 Find the Data
Now, in this case I did some google searches before I located the data. Some people will write web crawlers and rip through the pages of the website. In this case; however, the IMDB organization does not permit you to do this. They can’t stop you easily, necessarily; however, you should respect the rules of your data source. In any case, they may not allow you to rip through all their web pages and gather data (which is a pain in the butt), but they do provide you with the data.
Go to this ((((link)))). While technically this is an FTP site, which brings you here:
#3 Download the data
Now, if there were hundreds of files perhaps we would develop a program to rip the files locally, but give the relatively small numbers – around 48 files – that a little manually downloading isn’t going to hurt. Even repeated once a month it isn’t much of an ordeal and we can always create an automation program to download the files later. Perhaps a later update will download the files directly from the ftp site; however, this could present some difficulties as I manually downloaded the files the site would periodically ask for credentials, but it wouldn’t ask for credentials every time. That makes it hard to code. At this point hard to code and providing little benefit means there is no reason to code this.
Once loaded locally in your downloads folder – collect these files in to a folder called IMDB – in the case of the assumptions in my program C:\IMDB
I’m in Windows 10 – so the screen shots and code may be somewhat specific to Windows. It shouldn’t be too bad. Below, is the original .gz compressed files in File Explorer and cmd.
(NOTE: Only a partial listing in the Windows File Explorer.)
(NOTE: The flexibility of using the command line in Windows. The dir /w command inside of a directory/folder allows you to see more files at one time. You can do this in Windows File Explorer as well; however, it isn’t as space efficient (shown below))
In order to get to the command line in Windows 10, Press the Windows button (yes, it is really just the start button renamed) and type cmd in the “Ask me anything” entry. Then navigate to the C:\IMDB directory by typing cd.. at the command line until you are at the C:\ and then type cd IMDB
This isn’t really intended to teach the command line. I’m writing a couple of things because it is handy – and we will need it later in order to run our Python program. So, how to navigate on the Command Line and Execute a Python program will be upcoming.
#4 Expand the Data
Now we come to the heart of this posting – the programming. The language I have chosen is Python. I’m sure there are lots of reasons to choose Python, but I’m choosing it for one main reason – compact and easy code.
I’m using Python 2.7 for this article. This code may work in Python 3 and it may not. I’m not familiar with Python 3 at present. You can easily find how to install Python 2.7 on the web. This link is a good place to start. When you install it – you must add it to your path if you are in Windows. In my experience the install program doesn’t seem to add Python’s install directory to the path properly and I have had to manually update it. Adding a path is a relatively simple task, so I’ll leave it to you. Remember, frustration is often a part of the application development process.
The development platform is also a question. In Windows 10 I am suggesting you choose Notepad++. In a course that I took in Coursera this was the recommendation, and I fought it for a while by using Editplus 3, but in the end Notepad++ has worked better for me.
You can easily search for Notepad++ on the web and install it – it is a free application. Here is a link to help you get going – if it still works by the time you use this article.
NOTES ABOUT PYTHON CODE (when copy pasting or editing):
- indenting by spaces indicates that the code is part of a coding block – or scope
- # sign is a comment character
- In your Notepad++ editor – make sure indents are set to a number of spaces or your scope will not work properly
#Build a Better IMDB - AJ 20161126 #need to process files - original .gz files from IMDB # 0) May make code to automate downloading of the files and do a difference # 1) Get list of files in the IMDB director # 2) Create a Date Folder, Original File Folder under the Date Folder, Unzipped File Folder # 3) Main Process # a) Unzip File - and save copy in the Unzipped Folder # b) Copy Original File to Original File Folder # c) Loop to next file import gzip import os from os import listdir from os import path from os import makedirs from os.path import isfile, join from shutil import copyfile def getfiles(homefolder): files = [f for f in listdir(homefolder) if isfile(join(homefolder, f))] return files def decompress(filein, fileout, fileend): try: with gzip.open(filein, 'rb') as f: file_content = f.read() #print file_content f.close() print fileout outf = open(fileout, 'w') outf.write(file_content) outf.close() print "filein: " + filein print "fileout: " + fileend copyfile(filein, fileend) os.remove(filein) except OSError, e: print ("Error %s - %s" % (e.filename, e.strerror)) except: pass #main program logic/data infolder = "C:\IMDB\\" zipfolder = "C:\IMDB\Unzip\\" archivefolder = "C:\IMDB\Original\\" fs = getfiles(infolder) #setup the folders if not path.exists(zipfolder): makedirs(zipfolder) if not path.exists(archivefolder): makedirs(archivefolder) # Loop through files and dump to screen for file in fs: decompress(infolder + file, zipfolder + file.replace(".gz",".txt"), archivefolder + file)
In the save file dialog, click the create folder icon (circled in red above on the upper right corner of the save file dialog). A new folder will show up – type in IMDBCode (or whatever you want to call it) and I’m using the root C drive for my IMDB data folder and the IMDBCode folder.
Once you have your folder created click in to it – and name the code you copied in as IMDB_InitialDataProcess.py . It is very important to include the file extension – as the Notepad++ editor will recognize the file as Python code and properly color the text as required (which is always nice).
OK, now I’m going to do a couple of screen shots to show you how the code should look before going in to detail about the code. This is because of the whole indent thing indicating scope. You could type the code (or copy it from above) and it might look perfect, and it won’t work.
OK. So, the code above starts with the relatively easy to understand to something that is going to take a bit of work. First off, the green code are comments – with the leading #. It isn’t a pound sign. It isn’t sharp. It is just a number sign – shift+3.
Line 10 starts a block of code of imports. If you are familiar with .NET programming languages or Java then you understand what an import does. If not, well, here is a brief explanation.
All code depend on libraries. We write code on the shoulders of giants. And we use the giants code in our own code. Basically, someone writes a whole suite of functionality and then we use it. There is something that I would call base functionality. The base functionality – you don’t have to do any import statement. Other libraries that extend the functionality of the language – for a specific area.
So, we import gzip – to gain the functionality to compress and uncompress files. Now, we can be more precise in what we import. In line 12 we write “from os import listdir” – this allows us to use listdir directly without preceeding it with the library and sublibrary, etc . This makes for concise lines of code.
On line 15 we have “from os.path import isfile, join” – which is taking two pieces of functionality of the library at once. Of course, this means we could have reduced lines 12,13, and 14 to a single line. We could even have potentially reduced the import lines even more since line 15 are additional pieces of the os library.
Libraries can make programming languages extremely versatile.
The next segment of code starts on line 18. “def getfiles(homefolder):” – This is the definition of a function. The function takes in 1 parameter – homefolder. It returns a list of files in the folder.
But, sadly, the code I found on the web while beautiful and concise is well, not very easy to read.
files = [f for f in listdir(homefolder) if isfile(join(homefolder, f))]
(the above is a single line of code with line wrap)
What the above line says is: Give me the directory listing for this folder. Then check if these are files and return only files – not directories. This is much more compressed code than I would write. I’ll get to this level after a while – it just takes comfort and more experience writing code in Python. I basically use this code as a black box as I know what it does and I know it works. Below is how I would have written it:
You can see there are a whole lot more lines of code in this function. For every line of code a programmer writes there is typically at least one error (depending on the complexity of the line of code). In the above code, the first line (after the function def-inition) says – get me the list of objects in the directory parameter. Then I define an empty list. For every object in the directory list – check if it is a file. If it is a file – add it to the files list. Finally, return the files list.
Now, the nature of Python is that you define a lot of functionality in a program before you use it. Previously we defined a function to return all the files in a directory. This function is more complex. It performs these tasks:
- Unizip the file
- Read the unzipped file and close the unzipped file
- Write the unzipped file to the output folder and close it
- Copy the original file to the processed folder
- Delete the file from the source folder
These pieces of functionality are wrapped in a try..except clause. When you deal with files – errors can happen. If you don’t catch these errors your program will crash. In this case we are just printing an error statement – or just passing. If somehow I downloaded a file (or a program loaded a file) that wasn’t a zipped file then we would just skip it – and we would skip moving the file and deleting it – so that we could see what files failed to process. The with statement is great for file processing – as when the block of code ends – it automatically closes the file. In this case, the gzip operation opens the zipped file and makes the stream available to the program as the variable “f”.
Line 25 we read the entire (uncompressed) file contents in to a variable. Line 27 closes the compressed file. Now, even though we have the with statement, we want to copy the original file and then delete it in the original location – and we can’t do that if the file is open.
There are a few print statements to print information for those times you are watching the program to make sure that it operating correctly. Line 29 – opens a new file for writing – and then on 30 we write the uncompressed file contents to the new file and finally on line 31 we close this new file. Line 34 copies the original file to the end or processed folder. Finally, line 35 deletes the file that was processed.
Above is our final sections of code – if I were in Java I would call this main function. Lines 41, 42, and 43 set up the folders that we are going to use. An interesting note here is that we are only doubling up the last slash, where in a language like Java we would probably double up all of them. It reduces the length of the line and makes it clearer so I appreciate the lack of additional escape characters.
Line 44 – calls the function defined earlier called getfiles – and sends it the infolder parameter.
Line 45 checks if the zipfolder exists – and if it doesn’t then it creates it.
Line 46 is virtually identical to line 45, except that it does it for the archivefolder – where we will store the files after they have been processed.
Finally, we have a standard for loop – for every element in the fs (list of files) it runs the decompress function (which we discussed earlier).
Then, well, we are done with the program. It has served the purpose for which it was designed. This code could be adapted in many ways for many different situations – for ETL tool type functionality.
That concludes the first four steps. Step 5 will be repeated over and over again for 48 files, so it may be covered in many posts. Still, if there is standardization in the file formatting (even if it is standard not pleasant) after solving the first file there should be a lot of code reuse.
Update and final run instructions
To run the code you need to save your file in the windows command prompt. Click on “Ask me anything” on your Windows 10 task bar and type cmd – you can either press [enter] or double clicking on the top of you results.
When you open the command prompt (as shown above) it starts you in the Users directory that is currently running the computer. Type in cd.. [enter] to move back one level in the command prompt. This moves you to the Users area. Type in cd.. [enter] and this places you at the root directory. Next we will move to the directory that we saved our code. Type cd IMDBCode [enter].
Now, if the path is set correctly in your computer for Python 2.7 – you can run the Python code by typing in the file name – IMDB_InitialDataProcess.py [enter] and the program will begin to run. Assuming the files are in the C:\IMDB folder – they will be uncompressed, saved as text files and moved/deleted.
Just to show the expected output:
This folder (IMDB) used to contain 48 files – now it contains two directories. All the files have been deleted. If there were any files remaining those files would be files that failed to be processed.
Above – we can see the contents of the “Original” folder under IMDB – contains 48 files – with a disk space of 1.71 GB.
Now, finally, we can see that the “Unzip” folder contains 48 files – with a disk space of 6.89GB.