Happy Christmas From Your AI Overlords

Happy Christmas!

Merry Christmas!

Happy Holidays!

Your AI overlords are here!

And they are coming for IT jobs, too.

There are hypothesis by people that creative jobs will survive the AI onslaught and so people should concentrate on these fields.  This is factually incorrect. Music, paintings, ability to determine a factual statement given human created evidence (Google’s AI Drawing Game) show that AI is not only able to be creative, but that it can be inferential on levels equivalent to humans.

However, being able to create art does not make an overload unless you are Hitler.

The first link is in regard to a company that is transforming the daily business activities by basically replacing middle management with an AI ratings system. In essence, it allows employees to rate and critique each other and stores and analyzes this data. This emphasizes the role of people as “cogs” in a machine or a clock. Like in the movie “The Incredibles“.

Bridgewater even reports that one-fifth of its hires cannot handle a year at the company, and those who do survive are often found crying in the bathrooms.

This statement in the article seems to indicate that crying in the bathroom of an employer is a result of AI management. I have seen management make people cry in the workplace – with no AI involvement at all. I have seen management bully people in to working excessive hours with no AI assistance.

I’m not sure I see a difference between a human manager making a person cry at work or an AI manager making a person cry at work.

As far as the creative arts are concerned – well, I think that would be an easy computer algorithm to “assist” artists to make better art. I’m envisioning a “Black Mirror” episode where people’s like rating directly impacted their daily life. Why not set up Facebook pages – and you’ll know if the next piece of art you create is better than the last – by the number of likes you get? Instagram and a little data analysis would work even better.

So, say hello to you AI overlords – they are already here – a bit earlier than expected.

Don’t Thank God, Thank AI

There is a lot of strife out there – when things go well medically, people in general like to thank god. When things go bad we always sue the doctor, the hospital, the insurance company or any company even remotely related to the procedures involved.

Now, people might try to sue IBM’s Watson, or Enlitic’s software for diagnosing lung cancer. They will, I suspect, be going after these pieces of AI software less often than they currently go after current malpractice lawsuits. Early detection is the best method for treating lung cancer – and if Enlitic’s software can detect it better than humans can – then more people have a chance at surviving lung cancer.

Maybe, just maybe, we’ll start thanking AI for saving our lives rather than god. At least AI might have more of a personal hand in saving your life. We don’t tend to thank tools for saving us – nor the operators of tools such as ultrasound devices and ultrasound techs. So, I won’t hold my breath, but I’ll be happy if AI or ultrasound tech saves my life.

In the future, it might also be a question if you should thank AI, if it attains intelligence to be treated as a sentient being.

IMDB Project Update – SQL Server

Well, that happened.

My database with the raw data for the IMDB project exceeded 10GB – which is greater than the free license for SQL Server Express 2016.

This isn’t a big deal – although it has stopped me for a bit. I found a place that was selling the developer editions; however, it turns out that you can belong to a free membership in Microsoft to get the developer edition.

Since I am basically doing development – this is what I need to do – and what you need to do if you are doing something like this as well (at least locally, on your computer – you could use Azure or a SQL Server on a website – which tends to be free on GoDaddy – although I’m not sure if there are SQL Server database size limitations on GoDaddy).

You need to have a sign-in at Microsoft (which I already have because I own a license to Microsoft Office 365) and then add to your account “Visual Studio Dev Essentials“. This will allow you to install many Microsoft Development Tools (Including SQL Server 2016 Developer Edition) for free.

In the end, though, I didn’t actually need to install or download anything new. If you restart the installer that you installed SQL Server Express -> go to maintenance -> Edition Upgrade -> Select “Developer” from the drop down, this works.

Well, it works after a couple of restarts. Then you can go to the Database in SSMS (SQL Server Management Studio) and right-click on your database -> click properties -> click Files – and then update the size of your database there. If your database is set to autogrow, you may not even have to do this – it should work appropriately, now.

I did get some funky errors and had to reboot a couple of times because the SQL Server service would not stop. Annoying, just not piss you off level of annoyance.

IMDB – Project Update

I’ve read the entire actors.list.txt file. Just a quick second to bitch about data.

When you have a data file – all it should have in it is the data. At the top of this file (and I assume all of their files) are licensing and instructions to read the file. This should be in a separate actors.readme file.

But then…. Then. I see the end of the file. It has a whole lot of instructions on how to send them updates to the files, add new information to the files in exhaustive detail.

I appreciate the instructions; however, the file is GBs in size and most normal people cannot even open it. Again, this information should be in a actors.readme file. Now, the program has to watch out for non-data bearing information at the top of the file as well as at the bottom of the file.

Which is ok. I like to bitch about it. But since I’ve loaded the entire file as raw data in to a database I can quickly locate these kinds of problems and just delete them from the raw data. It is frustrating; however, to have this added task – and it is probably in all the files as well.

Choose Your Database Wisely and my IMDB project

I am currently learning a lot about Python. Python is one of a suite of tools that are free that can be used for Data Analysis. In courses they teach you how to interact with a database and they show you the least effort possible. In the courses they hype the fact that SQLite is native and built-in to the core libraries of Python.

In Python, the database of the least possible effort is SQLite. So, armed with my new knowledge I charged forward using Python on my desktop against a SQLite database for millions of records.

I was so “in love” with the idea of all these free tools, that I forgot about the free tools for the desktop that I had used before – Microsoft SQL Server Express. As I did estimates of how long it would take to process 20.4 million records I was a little disappointed in SQLite, but unsure if that was causing the performance problems. Perhaps it was the disk use (100%) of the program.

There really wasn’t much to the program, though, so if the problem was the file access throttling the hard drive I was fairly stuck.

Eventually, even after processing millions of records I decided to install my old standby – Microsoft SQL Server Express – which like SQLite is free. These are two vastly different databases. SQLite is what it says – lite – and can be used on a wide variety of portable devices. SQL Server has a specific product for smaller or compact devices, but SQL Server Express is basically full SQL Server minus some features.

The people at Microsoft have done thousands of hours of programming to make the SQL Server engine perform. Quickly, I was able to rip through a million records to the SQL Server database, using virtually the same code I had been using against SQLite. Python was not the problem.

Falling in love with a way of doing things – in the tool to do things, was the problem. Now, instead of multiple computers with multiple instances of the program running I have one program running on one computer. It has been a relatively short period of time (I can be more precise later – with the datetime stamps) it has already processed 4.5 million records.

Now, instead of considering long lengths of time just to process the actors.list.txt file, I can consider getting through all the files in a relatively short period of time and focus on analyzing the data and figuring out master data structures.

Just to give a picture, against SQLite it took 1:25 seconds to process 1000 records from the actors.list.txt file to the SQLite database. It is too hard to even time 1000 records processing against Microsoft SQL Server Express – it takes 10 seconds to process 10,000 records.

I get no money from Microsoft or anyone else (no ads on my site at present). This is just the cold hard facts. Free tools are great and I would use them to teach people. It is the path of least resistance to getting someone started on a database. Microsoft SQL Server express involves installs and it take a serious commitment to get through the install process; however, the time invested in that is returned nearly immediately given the speed of storing information in the database.

Don’t fall in love with your tool!

IMDB Data Search Project Part 2 (project update – no code)

I have done some preliminary analysis of the files and written a small amount of code to transfer the files from the original text to SQLite database. My preference is SQL Server – even the free version, but I’m trying new things and well SQLite is free.

There does seem to be some limitations with SQLite compared to what I am used to even using the Microsoft SQL Express versions. Typically, I can have one database in Microsoft SQL Express and connect to it from multiple custom written clients and all write to the same database.

Now, I haven’t put a lot of energy in to this with SQLite; however, at present it seems I can only have 1 client to database file.

At first I began processing the files with a program that would rip through each file, inserting the records in to a RAW_data table – for later processing with Python in to separate tables and data normalization applied.

I started on my laptop and processing the first file the actors file – and it was working well. It was chunking through the data and inserting it into the SQLite database. The problem was that I didn’t know how long it was going to take. The actors file – is 1.5 GB; however, since I am processing the files line by line – the number of GB the file is – isn’t much of an indicator of how long it will take to process.

So, I wrote a second program – again a very simple program. And it ripped through all the files and gave me the number of lines in each file.

It turns out that the point where my patience was running low and I wrote this other program to help me get a progress report, I had processed over 400,000 records. For some reason my “gut” said it wouldn’t take long to process.

Well, the “gut” was wrong. Here is a list of the files and their lengths in terms of number of lines:

20494819 actors.list.txt                                                 12310518 actresses.list.txt
2232725 aka-names.list.txt                                            1028301 aka-titles.list.txt
11774007 biographies.list.txt                                           2315726 business.list.txt
688418 certificates.list.txt                                               1651352 cinematographers.list.txt
1914881 color-info.list.txt                                                   86222 complete-cast.list.txt
49427 complete-crew.list.txt                                           1378534 composers.list.txt
476732 costume-designers.list.txt                                  1968883 countries.list.txt
75481 crazy-credits.list.txt                                              3103707 directors.list.txt
1845204 distributors.list.txt                                            2121689 editors.list.txt
2384782 genres.list.txt                                                       43686 german-aka-titles.list.txt
1390862 goofs.list.txt                                                            2573 iso-aka-titles.list.txt
55635 italian-aka-titles.list.txt                                       6834098 keywords.list.txt
1938923 language.list.txt                                                  357895 laserdisc.list.txt
428031 literature.list.txt                                                 1086096 locations.list.txt
1113667 miscellaneous-companies.list.txt                       7531399 miscellaneous.list.txt
2906536 movie-links.list.txt                                            4031167 movies.list.txt
72460 mpaa-ratings-reasons.list.txt                               7771660 plot.list.txt
7585527 producers.list.txt                                               2571705 production-companies.list.txt
546929 production-designers.list.txt                             5085511 quotes.list.txt
691804 ratings.list.txt                                                    4678389 release-dates.list.txt
1339115 running-times.list.txt                                           631133 sound-mix.list.txt
2700214 soundtracks.list.txt                                               76752 special-effects-companies.list.txt
500669 taglines.list.txt                                                    1796072 technical.list.txt
3660623 trivia.list.txt                                                      4934220 writers.list.txt

As you can see – progress of 400,000 records is not impressive when the file contains over 20 million records. So, I moved the program, files and database file to two of my desktop computers. Then using a “poor man’s” multithreading I copied the python program – ripped out the logic to move from file to file and set it up to start on a specific record. Then I divided up the file by different start points and launched multiple copies of the program on both computers.

This is how I found out that databases can be locked and that SQLite might not be as handy as SQL Server Express. However; it is very easy to make new SQLite databases – and so this was a very minor hindrance. In addition, with a little research it is very easy to merge separate SQLite databases. At least the instructions look easy. In a later session I’ll describe the process of merging the SQLite databases.

At present, across the two computers 7 instances of the program running – I have completed moving 5.7 million records to SQLite databases. Earlier I had estimated that it would take 49 days for the laptop to transfer just the records from the actors file to SQLite. Unacceptable!

Now; however, it shouldn’t take much longer than a week. In combination with my laptop, I had it working against the smaller files – and a number are already completed and ready for me to start processing in to normalized data in the database.

Some minor irritations in the files.

  1. The first 100 to 200 or more lines are the people from IMDB talking about how you can’t make money from the files, can’t sue them for bad data in the files, and blah blah blah. I appreciate that there are things to say and licensing and stuff like that. Just don’t do it in the damned dirty data file.
  2. Some brainiac figured that they could make smaller files by only including the key data (actor’s name) in the first record. Then subsequent records lack this piece of information until you hit a blank line. Then you start the next actor.
  3. The blank lines. Really. Why would you do this? All you need to do is have an actor name in the line to know you have started the next actor. You could use basic page type logic for printing to process it without adding the extra lines.
  4. There is a separate “actors” file and “actresses” file. Universally, I refer to anyone that is performing in a theatrical production as an actor. Why there are two files for essentially the same data – except that the actors in one file are male and the actors in the other file are female – is just crazy. You could have just had one file with actor’s sex as a column of data. However; this does give us an opportunity for some ball-park analysis of the data. The number of lines in the actors file is 20.4 million. The number of lines in the actresses file is 12.3 million. We can already see that there is a disparity (at least based on total number of lines) in male actors vs female actors having a credited roles in theatrical productions. I look forward to getting this data in to a database where I can do a more accurate analysis.

So, the line estimates are off – there aren’t that many lines of information in the files, but we have no way to know just how many lines of actual data are in the file, because actors can be in 1 movie, show, episode, whatever – or some of them could be in hundreds and there is no real way to estimate this. It isn’t a big deal. Just a minor irritation. Once I have the whole file loaded I can run a simple query to know how many blank lines were in the file.

While on the topic of criticizing the files and their contents… Spaces do take up space in a file. Not having the names on each line may save a little space; however, this could have been done in another way that would have saved far more space.

Since I am a database developer – one of the principles of databases is not to repeat data. You end up with a file of “actors” (or a table in your database) and a number representing each of those actors. It would be an integer (SQL Server 4 bytes) covers values just over 2 billion. There aren’t 2 billion actors. Judging on the mixed master data file of actors (actors and shows they are in) being only 20.4 million records (including those stupid blank lines) it will be centuries before we have 2 billion actors.

Then on every line that you have a separate file – relating actors to theatrical production – well, unless the name is very small including the spaces – the integer + theatrical production name will always be smaller than each line in the present form.

Now, if we extend this outward – there could be a theatrical production name table with integer ID numbers. Again, since at this point we are far from having over 2 billion theatrical productions a 4-byte integer is no problem to be an identifier for all theatrical productions.

Then, this file that I am processing the “actors” file, would simply be two integer values separated by a delimiter of some sort. More than likely as I go through processing the data I will be creating an actors table with integer ID numbers, a theatrical production master data table, and then what is presently the “actors” table will be a cross reference or crosswalk table.

I will publish these tables, mostly with the limitations of license like the IMDB license – can’t sue me for inaccuracy and you can’t make money using the table.

Ok, well, just a little status on the smaller files:

—-> 86222 complete-cast.list.txt <—- Completed 11/29/2016
—-> 49427 complete-crew.list.txt <—- Completed 11/29/2016
—> 75481 crazy-credits.list.txt <—- Completed 12/02/2016
—-> 43686 german-aka-titles.list.txt <— Completed 11/29/2016
—-> 2573 iso-aka-titles.list.txt <—- Completed 11/30/2016
—-> 55635 italian-aka-titles.list.txt <—- Completed 11/30/2016
—-> 72460 mpaa-ratings-reasons.list.txt <—- Completed 11/29/2016
—-> 76752 special-effects-companies.list.txt <—- Completed 11/29/2016

So, things are proceeding – and while the actors file does the initial processing I’ll be able to research these files and get them in to better formats and up in to the database with final row counts.

IMDB Data Search Project

There are some things that you do almost every day that are really annoying. So you are watching show A. You see an actor. You know that face. You know that voice.

Finally, it occurs to you that you know what show you have seen that person. You know show B.

So, you hop on to IMDB (Internet Movie Database) and you look up show B and go to the cast – then whole cast because the actor isn’t one of the top names. You still don’t know their name so you start clicking on cast members and eventually – ahah – you have found them. Then you go scan their filmography to see if they are in show A.

It takes a bunch of time. I know, it isn’t exactly  a world crushing problem. I do watch shows on TV. And for some reason it really bothers me when I know I’ve seen an actor’s face before and I can’t figure it out. Sometimes days later I’ll figure it out – because I didn’t look it up or whatever.

Despite the name – IMDB isn’t all that great to search. It is good at single searches – show me this show, show me everything this actor was in, etc. But if you have to do something equivalent to a SQL Join to find out about x and y at the same time – it fails you.

How to fix this? I don’t necessarily want to replace IMDB. What I would like to do is be able to find out information quickly and easily with specialized queries. IMDB isn’t really optimized for searching, either. There are lots of graphics, stuff that all takes time to load. I know “The Internet is fast”, but it isn’t always. And even when it is it still takes time to load all the images they are in love with at IMDB. Google’s page is a page optimized for querying.

So, that’s my charter. Nice and simple. Make IMDB searchable and really useful for finding out information. And maybe do some data analysis on the side.

What are the steps to this project?

  1. Set up an account at IMDB
  2. Find the data (if available – and fortunately for us it is available)
  3. Download the data – using account information as necessary
  4. Expand the Data – the data source are compressed .gz files
  5. Data cleansing – and unfortunately the data isn’t in that great a form
  6. Load the data into a SQL Database – I’m a huge fan of SQL for data manipulation
  7. Develop an understanding of the data and the types of questions people ask – for example – who are the people that overlap show A and show B.
  8. Develop queries to deliver the information
  9. Develop web pages to run the queries and a web page to guide the users to the right place
  10. Done – well easier said that done, certainly.

This article will cover 1, 2, 3, and 4. It involves a surprising small number of lines of code.

Now, of course, everything changes. It this era things change quickly so some things that I have in this article might not be exactly as I have here, but if you are adept you can figure your way through it.

#1 Setup an account on IMDB.

Now you can use Facebook and other things, but I recommend setting up the IMDB account – this way you can type in the login data if the FTP site requests it.

imdb_initial_page

The circled area on the top right of the screenshot above of the IMDB site is the login area. Click on “Other Sign in Options”.

imdb_createaccount

The circled button to “Create a New Account”.

imdb_createaccount2

I think at this point in time – we all know what to do on a create account page like this – even if we are not programmers!

When you have entered your data press the Create your IMDb account and it should return you back to the start page for IMDB. You do not need a “Pro” membership and you don’t need to spend any money.

#2 Find the Data

Now, in this case I did some google searches before I located the data. Some people will write web crawlers and rip through the pages of the website. In this case; however, the IMDB organization does not permit you to do this. They can’t stop you easily, necessarily; however, you should respect the rules of your data source. In any case, they may not allow you to rip through all their web pages and gather data (which is a pain in the butt), but they do provide you with the data.

Go to this ((((link)))). While technically this is an FTP site, which brings you here:

imdb_ftp

#3 Download the data

Now, if there were hundreds of files perhaps we would develop a program to rip the files locally, but give the relatively small  numbers – around 48 files – that a little manually downloading isn’t going to hurt. Even repeated once a month it isn’t much of an ordeal and we can always create an automation program to download the files later. Perhaps a later update will download the files directly from the ftp site; however, this could present some difficulties as I manually downloaded the files the site would periodically ask for credentials, but it wouldn’t ask for credentials every time. That makes it hard to code. At this point hard to code and providing little benefit means there is no reason to code this.

Once loaded locally in your downloads folder – collect these files in to a folder called IMDB – in the case of the assumptions in my program C:\IMDB

I’m in Windows 10 – so the screen shots and code may be somewhat specific to Windows. It shouldn’t be too bad. Below, is the original .gz compressed files in File Explorer and cmd.

imdb_foldercontents

(NOTE: Only a partial listing in the Windows File Explorer.)

imdb_cmd_directory

(NOTE: The flexibility of using the command line in Windows. The dir /w command inside of a directory/folder allows you to see more files at one time. You can do this in Windows File Explorer as well; however, it isn’t as space efficient (shown below))

imdb_foldercontents2

In order to get to the command line in Windows 10, Press the Windows button (yes, it is really just the start button renamed) and type cmd in the “Ask me anything” entry. Then navigate to the C:\IMDB directory by typing cd.. at the command line until you are at the C:\ and then type cd IMDB

This isn’t really intended to teach the command line. I’m writing a couple of things because it is handy – and we will need it later in order to run our Python program. So, how to navigate on the Command Line and Execute a Python program will be upcoming.

#4 Expand the Data

Now we come to the heart of this posting – the programming. The language I have chosen is Python. I’m sure there are lots of reasons to choose Python, but I’m choosing it for one main reason – compact and easy code.

I’m using Python 2.7 for this article. This code may work in Python 3 and it may not. I’m not familiar with Python 3 at present. You can easily find how to install Python 2.7 on the web. This link is a good place to start. When you install it – you must add it to your path if you are in Windows. In my experience the install program doesn’t seem to add Python’s install directory to the path properly and I have had to manually update it. Adding a path is a relatively simple task, so I’ll leave it to you. Remember, frustration is often a part of the application development process.

The development platform is also a question. In Windows 10 I am suggesting you choose Notepad++. In a course that I took in Coursera this was the recommendation, and I fought it for a while by using Editplus 3, but in the end Notepad++ has worked better for me.

You can easily search for Notepad++ on the web and install it – it is a free application. Here is a link to help you get going – if it still works by the time you use this article.

Once you have Notepad++ installed, open it, create a new file notepadplusplusnewfileicon and then add the following source code (the source code will be explained):

NOTES ABOUT PYTHON CODE (when copy pasting or editing):

  1. indenting by spaces indicates that the code is part of a coding block – or scope
  2. # sign is a comment character
  3. In your Notepad++ editor – make sure indents are set to a number of spaces or your scope will not work properly
#Build a Better IMDB - AJ 20161126
#need to process files - original .gz files from IMDB
# 0) May make code to automate downloading of the files and do a difference
# 1) Get list of files in the IMDB director
# 2) Create a Date Folder, Original File Folder under the Date Folder, Unzipped File Folder
# 3) Main Process
# a) Unzip File - and save copy in the Unzipped Folder
# b) Copy Original File to Original File Folder
# c) Loop to next file
import gzip
import os
from os import listdir
from os import path
from os import makedirs
from os.path import isfile, join
from shutil import copyfile

def getfiles(homefolder):
    files = [f for f in listdir(homefolder) if isfile(join(homefolder, f))]
    return files

def decompress(filein, fileout, fileend):
    try:
        with gzip.open(filein, 'rb') as f:
        file_content = f.read()
        #print file_content
        f.close()
        print fileout
        outf = open(fileout, 'w')
        outf.write(file_content)
        outf.close()
        print "filein: " + filein
        print "fileout: " + fileend
        copyfile(filein, fileend)
        os.remove(filein)
    except OSError, e:
        print ("Error %s - %s" % (e.filename, e.strerror))
    except:
        pass

#main program logic/data
infolder = "C:\IMDB\\"
zipfolder = "C:\IMDB\Unzip\\"
archivefolder = "C:\IMDB\Original\\"
fs = getfiles(infolder)

#setup the folders
if not path.exists(zipfolder): makedirs(zipfolder)
if not path.exists(archivefolder): makedirs(archivefolder)
# Loop through files and dump to screen
for file in fs:
    decompress(infolder + file, zipfolder + file.replace(".gz",".txt"), archivefolder + file)

Save the file in Notepadd++ by using thenotepadpluplussavefileiconicon.

notepadplusplussavefiledialog

In the save file dialog, click the create folder icon (circled in red above on the upper right corner of the save file dialog). A new folder will show up – type in IMDBCode (or whatever you want to call it) and I’m using the root C drive for my IMDB data folder and the IMDBCode folder.

notepadplusplussavefiledialog2

Once you have your folder created click in to it – and name the code you copied in as IMDB_InitialDataProcess.py . It is very important to include the file extension – as the Notepad++ editor will recognize the file as Python code and properly color the text as required (which is always nice).

OK, now I’m going to do a couple of screen shots to show you how the code should look before going in to detail about the code. This is because of the whole indent thing indicating scope. You could type the code (or copy it from above) and it might look perfect, and it won’t work.

 

imdb_initialdata_sourcecode1

OK. So, the code above starts with the relatively easy to understand to something that is going to take a bit of work. First off, the green code are comments – with the leading #. It isn’t a pound sign. It isn’t sharp. It is just a number sign – shift+3.

Line 10 starts a block of code of imports. If you are familiar with .NET programming languages or Java then you understand what an import does. If not, well, here is a brief explanation.

All code depend on libraries. We write code on the shoulders of giants. And we use the giants code in our own code. Basically, someone writes a whole suite of functionality and then we use it. There is something that I would call base functionality. The base functionality – you don’t have to do any import statement. Other libraries that extend the functionality of the language – for a specific area.

So, we import gzip – to gain the functionality to compress and uncompress files. Now, we can be more precise in what we import. In line 12 we write “from os import listdir” – this allows us to use listdir directly without preceeding it with the library and sublibrary, etc . This makes for concise lines of code.

On line 15 we have “from os.path import isfile, join” – which is taking two pieces of functionality of the library at once. Of course, this means we could have reduced lines 12,13, and 14 to a single line. We could even have potentially reduced the import lines even more since line 15 are additional pieces of the os library.

Libraries can make programming languages extremely versatile.

The next segment of code starts on line 18. “def getfiles(homefolder):” – This is the definition of a function. The function takes in 1 parameter – homefolder. It returns a list of files in the folder.

But, sadly, the code I found on the web while beautiful and concise is well, not very easy to read.

 

files = [f for f in listdir(homefolder) if isfile(join(homefolder, f))]

(the above is a single line of code with line wrap)

What the above line says is: Give me the directory listing for this folder. Then check if these are files and return only files – not directories. This is much more compressed code than I would write. I’ll get to this level after a while – it just takes comfort and more experience writing code in Python. I basically use this code as a black box as I know what it does and I know it works. Below is how I would have written it:

imdb_getfiles2

You can see there are a whole lot more lines of code in this function. For every line of code a programmer writes there is typically at least one error (depending on the complexity of the line of code). In the above code, the first line (after the function def-inition) says – get me the list of objects in the directory parameter. Then I define an empty list. For every object in the directory list – check if it is a file. If it is a file – add it to the files list. Finally, return the files list.

imdb_initialdata_sourcecode2

Now, the nature of Python is that you define a lot of functionality in a program before you use it. Previously we defined a function to return all the files in a directory. This function is more complex. It performs these tasks:

  1. Unizip the file
  2. Read the unzipped file and close the unzipped file
  3. Write the unzipped file to the output folder and close it
  4. Copy the original file to the processed folder
  5. Delete the file from the source folder

These pieces of functionality are wrapped in a try..except clause. When you deal with files – errors can happen. If you don’t catch these errors your program will crash. In this case we are just printing an error statement – or just passing. If somehow I downloaded a file (or a program loaded a file) that wasn’t a zipped file then we would just skip it – and we would skip moving the file and deleting it – so that we could see what files failed to process. The with statement is great for file processing – as when the block of code ends – it automatically closes the file. In this case, the gzip operation opens the zipped file and makes the stream available to the program as the variable “f”.

Line 25 we read the entire (uncompressed) file contents in to a variable.  Line 27 closes the compressed file. Now, even though we have the with statement, we want to copy the original file and then delete it in the original location – and we can’t do that if the file is open.

There are a few print statements to print information for those times you are watching the program to make sure that it operating correctly. Line 29 – opens a new file for writing – and then on 30 we write the uncompressed file contents to the new file and finally on line 31 we close this new file. Line 34 copies the original file to the end or processed folder. Finally, line 35 deletes the file that was processed.

imdb_initialdata_sourcecode3

Above is our final sections of code – if I were in Java I would call this main function. Lines 41, 42, and 43 set up the folders that we are going to use. An interesting note here is that we are only doubling up the last slash, where in a language like Java we would probably double up all of them. It reduces the length of the line and makes it clearer so I appreciate the lack of additional escape characters.

Line 44 – calls the function defined earlier called getfiles – and sends it the infolder parameter.

Line 45 checks if the zipfolder exists – and if it doesn’t then it creates it.

Line 46 is virtually identical to line 45, except that it does it for the archivefolder – where we will store the files after they have been processed.

Finally, we have a standard for loop – for every element in the fs (list of files) it runs the decompress function (which we discussed earlier).

Then, well, we are done with the program. It has served the purpose for which it was designed. This code could be adapted in many ways for many different situations – for ETL tool type functionality.

That concludes the first four steps. Step 5 will be repeated over and over again for 48 files, so it may be covered in many posts. Still, if there is standardization in the file formatting (even if it is standard not pleasant) after solving the first file there should be a lot of code reuse.

Update and final run instructions

To run the code you need to save your file in the windows command prompt. Click on “Ask me anything” on your Windows 10 task bar and type cmd – you can either press [enter] or double clicking on the top of you results.

commandpromptopen

cmd_runcodeimdb

When you open the command prompt (as shown above) it starts you in the Users directory that is currently running the computer. Type in cd.. [enter] to move back one level in the command prompt. This moves you to the Users area. Type in cd.. [enter] and this places you at the root directory. Next we will move to the directory that we saved our code. Type cd IMDBCode [enter].

Now, if the path is set correctly in your computer for Python 2.7 – you can run the Python code by typing in the file name – IMDB_InitialDataProcess.py [enter] and the program will begin to run. Assuming the files are in the C:\IMDB folder – they will be uncompressed, saved as text files and moved/deleted.

Just to show the expected output:

dataprocess_endproduct1

This folder (IMDB) used to contain 48 files – now it contains two directories. All the files have been deleted. If there were any files remaining those files would be files that failed to be processed.

dataprocess_endproduct2

Above – we can see the contents of the “Original” folder under IMDB – contains 48 files – with a disk space of 1.71 GB.

dataprocess_endproduct3

Now, finally, we can see that the “Unzip” folder contains 48 files – with a disk space of 6.89GB.