IMDB Project Update – SQL Server

Well, that happened.

My database with the raw data for the IMDB project exceeded 10GB – which is greater than the free license for SQL Server Express 2016.

This isn’t a big deal – although it has stopped me for a bit. I found a place that was selling the developer editions; however, it turns out that you can belong to a free membership in Microsoft to get the developer edition.

Since I am basically doing development – this is what I need to do – and what you need to do if you are doing something like this as well (at least locally, on your computer – you could use Azure or a SQL Server on a website – which tends to be free on GoDaddy – although I’m not sure if there are SQL Server database size limitations on GoDaddy).

You need to have a sign-in at Microsoft (which I already have because I own a license to Microsoft Office 365) and then add to your account “Visual Studio Dev Essentials“. This will allow you to install many Microsoft Development Tools (Including SQL Server 2016 Developer Edition) for free.

In the end, though, I didn’t actually need to install or download anything new. If you restart the installer that you installed SQL Server Express -> go to maintenance -> Edition Upgrade -> Select “Developer” from the drop down, this works.

Well, it works after a couple of restarts. Then you can go to the Database in SSMS (SQL Server Management Studio) and right-click on your database -> click properties -> click Files – and then update the size of your database there. If your database is set to autogrow, you may not even have to do this – it should work appropriately, now.

I did get some funky errors and had to reboot a couple of times because the SQL Server service would not stop. Annoying, just not piss you off level of annoyance.

IMDB – Project Update

I’ve read the entire actors.list.txt file. Just a quick second to bitch about data.

When you have a data file – all it should have in it is the data. At the top of this file (and I assume all of their files) are licensing and instructions to read the file. This should be in a separate actors.readme file.

But then…. Then. I see the end of the file. It has a whole lot of instructions on how to send them updates to the files, add new information to the files in exhaustive detail.

I appreciate the instructions; however, the file is GBs in size and most normal people cannot even open it. Again, this information should be in a actors.readme file. Now, the program has to watch out for non-data bearing information at the top of the file as well as at the bottom of the file.

Which is ok. I like to bitch about it. But since I’ve loaded the entire file as raw data in to a database I can quickly locate these kinds of problems and just delete them from the raw data. It is frustrating; however, to have this added task – and it is probably in all the files as well.

Choose Your Database Wisely and my IMDB project

I am currently learning a lot about Python. Python is one of a suite of tools that are free that can be used for Data Analysis. In courses they teach you how to interact with a database and they show you the least effort possible. In the courses they hype the fact that SQLite is native and built-in to the core libraries of Python.

In Python, the database of the least possible effort is SQLite. So, armed with my new knowledge I charged forward using Python on my desktop against a SQLite database for millions of records.

I was so “in love” with the idea of all these free tools, that I forgot about the free tools for the desktop that I had used before – Microsoft SQL Server Express. As I did estimates of how long it would take to process 20.4 million records I was a little disappointed in SQLite, but unsure if that was causing the performance problems. Perhaps it was the disk use (100%) of the program.

There really wasn’t much to the program, though, so if the problem was the file access throttling the hard drive I was fairly stuck.

Eventually, even after processing millions of records I decided to install my old standby – Microsoft SQL Server Express – which like SQLite is free. These are two vastly different databases. SQLite is what it says – lite – and can be used on a wide variety of portable devices. SQL Server has a specific product for smaller or compact devices, but SQL Server Express is basically full SQL Server minus some features.

The people at Microsoft have done thousands of hours of programming to make the SQL Server engine perform. Quickly, I was able to rip through a million records to the SQL Server database, using virtually the same code I had been using against SQLite. Python was not the problem.

Falling in love with a way of doing things – in the tool to do things, was the problem. Now, instead of multiple computers with multiple instances of the program running I have one program running on one computer. It has been a relatively short period of time (I can be more precise later – with the datetime stamps) it has already processed 4.5 million records.

Now, instead of considering long lengths of time just to process the actors.list.txt file, I can consider getting through all the files in a relatively short period of time and focus on analyzing the data and figuring out master data structures.

Just to give a picture, against SQLite it took 1:25 seconds to process 1000 records from the actors.list.txt file to the SQLite database. It is too hard to even time 1000 records processing against Microsoft SQL Server Express – it takes 10 seconds to process 10,000 records.

I get no money from Microsoft or anyone else (no ads on my site at present). This is just the cold hard facts. Free tools are great and I would use them to teach people. It is the path of least resistance to getting someone started on a database. Microsoft SQL Server express involves installs and it take a serious commitment to get through the install process; however, the time invested in that is returned nearly immediately given the speed of storing information in the database.

Don’t fall in love with your tool!

IMDB Data Search Project Part 2 (project update – no code)

I have done some preliminary analysis of the files and written a small amount of code to transfer the files from the original text to SQLite database. My preference is SQL Server – even the free version, but I’m trying new things and well SQLite is free.

There does seem to be some limitations with SQLite compared to what I am used to even using the Microsoft SQL Express versions. Typically, I can have one database in Microsoft SQL Express and connect to it from multiple custom written clients and all write to the same database.

Now, I haven’t put a lot of energy in to this with SQLite; however, at present it seems I can only have 1 client to database file.

At first I began processing the files with a program that would rip through each file, inserting the records in to a RAW_data table – for later processing with Python in to separate tables and data normalization applied.

I started on my laptop and processing the first file the actors file – and it was working well. It was chunking through the data and inserting it into the SQLite database. The problem was that I didn’t know how long it was going to take. The actors file – is 1.5 GB; however, since I am processing the files line by line – the number of GB the file is – isn’t much of an indicator of how long it will take to process.

So, I wrote a second program – again a very simple program. And it ripped through all the files and gave me the number of lines in each file.

It turns out that the point where my patience was running low and I wrote this other program to help me get a progress report, I had processed over 400,000 records. For some reason my “gut” said it wouldn’t take long to process.

Well, the “gut” was wrong. Here is a list of the files and their lengths in terms of number of lines:

20494819 actors.list.txt                                                 12310518 actresses.list.txt
2232725 aka-names.list.txt                                            1028301 aka-titles.list.txt
11774007 biographies.list.txt                                           2315726 business.list.txt
688418 certificates.list.txt                                               1651352 cinematographers.list.txt
1914881 color-info.list.txt                                                   86222 complete-cast.list.txt
49427 complete-crew.list.txt                                           1378534 composers.list.txt
476732 costume-designers.list.txt                                  1968883 countries.list.txt
75481 crazy-credits.list.txt                                              3103707 directors.list.txt
1845204 distributors.list.txt                                            2121689 editors.list.txt
2384782 genres.list.txt                                                       43686 german-aka-titles.list.txt
1390862 goofs.list.txt                                                            2573 iso-aka-titles.list.txt
55635 italian-aka-titles.list.txt                                       6834098 keywords.list.txt
1938923 language.list.txt                                                  357895 laserdisc.list.txt
428031 literature.list.txt                                                 1086096 locations.list.txt
1113667 miscellaneous-companies.list.txt                       7531399 miscellaneous.list.txt
2906536 movie-links.list.txt                                            4031167 movies.list.txt
72460 mpaa-ratings-reasons.list.txt                               7771660 plot.list.txt
7585527 producers.list.txt                                               2571705 production-companies.list.txt
546929 production-designers.list.txt                             5085511 quotes.list.txt
691804 ratings.list.txt                                                    4678389 release-dates.list.txt
1339115 running-times.list.txt                                           631133 sound-mix.list.txt
2700214 soundtracks.list.txt                                               76752 special-effects-companies.list.txt
500669 taglines.list.txt                                                    1796072 technical.list.txt
3660623 trivia.list.txt                                                      4934220 writers.list.txt

As you can see – progress of 400,000 records is not impressive when the file contains over 20 million records. So, I moved the program, files and database file to two of my desktop computers. Then using a “poor man’s” multithreading I copied the python program – ripped out the logic to move from file to file and set it up to start on a specific record. Then I divided up the file by different start points and launched multiple copies of the program on both computers.

This is how I found out that databases can be locked and that SQLite might not be as handy as SQL Server Express. However; it is very easy to make new SQLite databases – and so this was a very minor hindrance. In addition, with a little research it is very easy to merge separate SQLite databases. At least the instructions look easy. In a later session I’ll describe the process of merging the SQLite databases.

At present, across the two computers 7 instances of the program running – I have completed moving 5.7 million records to SQLite databases. Earlier I had estimated that it would take 49 days for the laptop to transfer just the records from the actors file to SQLite. Unacceptable!

Now; however, it shouldn’t take much longer than a week. In combination with my laptop, I had it working against the smaller files – and a number are already completed and ready for me to start processing in to normalized data in the database.

Some minor irritations in the files.

  1. The first 100 to 200 or more lines are the people from IMDB talking about how you can’t make money from the files, can’t sue them for bad data in the files, and blah blah blah. I appreciate that there are things to say and licensing and stuff like that. Just don’t do it in the damned dirty data file.
  2. Some brainiac figured that they could make smaller files by only including the key data (actor’s name) in the first record. Then subsequent records lack this piece of information until you hit a blank line. Then you start the next actor.
  3. The blank lines. Really. Why would you do this? All you need to do is have an actor name in the line to know you have started the next actor. You could use basic page type logic for printing to process it without adding the extra lines.
  4. There is a separate “actors” file and “actresses” file. Universally, I refer to anyone that is performing in a theatrical production as an actor. Why there are two files for essentially the same data – except that the actors in one file are male and the actors in the other file are female – is just crazy. You could have just had one file with actor’s sex as a column of data. However; this does give us an opportunity for some ball-park analysis of the data. The number of lines in the actors file is 20.4 million. The number of lines in the actresses file is 12.3 million. We can already see that there is a disparity (at least based on total number of lines) in male actors vs female actors having a credited roles in theatrical productions. I look forward to getting this data in to a database where I can do a more accurate analysis.

So, the line estimates are off – there aren’t that many lines of information in the files, but we have no way to know just how many lines of actual data are in the file, because actors can be in 1 movie, show, episode, whatever – or some of them could be in hundreds and there is no real way to estimate this. It isn’t a big deal. Just a minor irritation. Once I have the whole file loaded I can run a simple query to know how many blank lines were in the file.

While on the topic of criticizing the files and their contents… Spaces do take up space in a file. Not having the names on each line may save a little space; however, this could have been done in another way that would have saved far more space.

Since I am a database developer – one of the principles of databases is not to repeat data. You end up with a file of “actors” (or a table in your database) and a number representing each of those actors. It would be an integer (SQL Server 4 bytes) covers values just over 2 billion. There aren’t 2 billion actors. Judging on the mixed master data file of actors (actors and shows they are in) being only 20.4 million records (including those stupid blank lines) it will be centuries before we have 2 billion actors.

Then on every line that you have a separate file – relating actors to theatrical production – well, unless the name is very small including the spaces – the integer + theatrical production name will always be smaller than each line in the present form.

Now, if we extend this outward – there could be a theatrical production name table with integer ID numbers. Again, since at this point we are far from having over 2 billion theatrical productions a 4-byte integer is no problem to be an identifier for all theatrical productions.

Then, this file that I am processing the “actors” file, would simply be two integer values separated by a delimiter of some sort. More than likely as I go through processing the data I will be creating an actors table with integer ID numbers, a theatrical production master data table, and then what is presently the “actors” table will be a cross reference or crosswalk table.

I will publish these tables, mostly with the limitations of license like the IMDB license – can’t sue me for inaccuracy and you can’t make money using the table.

Ok, well, just a little status on the smaller files:

—-> 86222 complete-cast.list.txt <—- Completed 11/29/2016
—-> 49427 complete-crew.list.txt <—- Completed 11/29/2016
—> 75481 crazy-credits.list.txt <—- Completed 12/02/2016
—-> 43686 german-aka-titles.list.txt <— Completed 11/29/2016
—-> 2573 iso-aka-titles.list.txt <—- Completed 11/30/2016
—-> 55635 italian-aka-titles.list.txt <—- Completed 11/30/2016
—-> 72460 mpaa-ratings-reasons.list.txt <—- Completed 11/29/2016
—-> 76752 special-effects-companies.list.txt <—- Completed 11/29/2016

So, things are proceeding – and while the actors file does the initial processing I’ll be able to research these files and get them in to better formats and up in to the database with final row counts.

How I achieved my MCP certification

My philosophy is that as a Technical Project Manager in IT – I need to know as much as I can about the technologies that I am managing.

My background is that I’ve been a programmer, a project manager, and what I would call a technology to business liaison for many years.  When confronting someone I have never met before – how do I convince them that I am knowledgeable about technology, project management, and business process improvement?  Get certifications.

So, unlike many people working to getting a certification – the certification I choose at Microsoft was on a topic that I had over a decade of experience.

Even with that experience I may have been a bit cocky taking the test the first time and I failed it. 😦  It happens and I lost $150 for the cost of the test, but I prepared better the second time around.

MCP – stands for Microsoft Certified Professional – and it is the certification you get for passing any of the technical tests offered by Microsoft.  The test I choose was 70-461 “Querying Microsoft SQL Server 2012”.

The first thing I did was purchase the book “70-461 Querying Microsoft SQL Server 2012 Training Kit” from Microsoft Press.  I made it to page 342 out of 675 pages.  It is rather unfortunately a very dry book.

Let me put it another way.  If you have trouble sleeping in bright sunlight with children playing around you – this will help.

The book isn’t entirely useless.  It just isn’t very good.  No, that’s not right.

The content of the book is good.  The enthusiasm and personality of the text is zero.

The practice exam questions at the back of the book are very poor.  No, that’s not right either.  They are accurate questions, but entirely too easy.  If you took those test questions and went in to the exam thinking you were prepared because you aced them you would be entirely wrong and fail horribly.

So, how exactly did I study for the exam?  A better question perhaps is, what did I study for the exam?

The best thing was to watch the content at Microsoft Virtual Academy.  We live in an incredible era where a lot of the knowledge necessary to do things is out there on the internet.

There were two specific courses at MVA that contributed greatly to my passing the 70-461 exam.

Querying Microsoft SQL Server 2012 Databases Jump Start

Boost Your T-SQL with the APPLY Operator

Both courses offer a vast amount of information and should not be watched like watching a TV, but watched like you are in a classroom of one and you can stop the professor every other word if necessary to take notes and perform what they did in the class in your own instance of SQL Server.

Go through both those courses once taking notes, review your notes, and once just watching them through going over concepts that are unfamiliar even after the first go through.

I would also advise taking another MVA course “Database Fundamentals“.  I know, you already know all of this stuff, right?  The problem is that we ‘know’ all we need to know to drive a car and pass the drivers test at age 17.  How much have you forgotten about the basics of databases?  Even if it is a little it could come out and bite you during the exam.

Another valuable resource was mindhub.com’s “MeasureUp Microsoft Certification Practice Exam: 30 day“.  This was very valuable as these questions were as difficult as the ones in the actual Microsoft exam, and some were a little bit harder.  It helped that I was able to combo paying for my second attempt and getting some money off on the MeasureUp Practice Exam.

Please be aware if you purchase the MeasureUp Practice Exam that you have to select the appropriate exam after you purchase it that you are studying.  Otherwise you may well waste your money and time on a preparation exam for the wrong certification.

The day before the exam I went through every practice exam question from the book (useless) and from MeasureUp (good).

A combination of all these things and tightening up on my XML outputs (something I’ve never used in practice) and I was able to pass the exam the second time through.

Will I get another Microsoft Certification?  I don’t really know.  My primary career right now is as a project manager, but if I get a MCSA and MCSE that should only make me more valuable on the employment marketplace, no?

Installing Microsoft SQL Server Express

The idea behind the programming and educational portion of this blog is to have very brief sessions that convey a lot of information.  Bite size, five minute, educational blurbs.  Edurbs?  Blureds?

The platform I work on is Microsoft Windows and Microsoft SQL Server.  The reasoning is that most people have Microsoft Windows and are familiar with this environment and Microsoft SQL Server Express is a free product and easy to work – except for the install.

But first…  You have to install the free version of Microsoft SQL Server.

Fortunately, many people have covered the install of Microsoft SQL Server:

This is a good install for SQL Server Express.

This is another good install site, with a lot of information about the limits of SQL Server.

I think our 5 minutes are up.  It will more than likely take more than 5 minutes to install SQL Server Express Edition.

Next time, I will go over how to look at the databases on a database server (your SQL Server Install), how to create databases, delete databases and how to detach a database and a brief explanation of why you would do those things!

Programming and Ignorance

Ignorance does not have to be good or bad.  It is.  No matter how much any single individual knows they are ignorant on some other topics.

In other words, there are no Renaissance men anymore.

Even within our chosen professions we can all suffer from ignorance.  Many years ago I was at a company and they asked a programmer to sort data that was retrieved from SQL Server.

Everyone connected back up sometime later to find out what was taking so long.  The programmer was writing a bubble sort and said it would be some more time before it could be finished.

We all had a big laugh and then taught the programmer the ORDER BY clause which sorts output from a query.

I had been programming in T-SQL under Microsoft SQL Server for many years before I decided to get a certification for Querying Microsoft SQL Server 2012.  I began to study and there were queries of mine in the past that I had always wondered why they didn’t work.

Queries using an expression twice in the select clause or using an aliased column name in the where clause.

It all came down to ignorance.  The order you write a SQL statement is different than the order it is processed.

Select Dept, Count(*) as CNT
From Employees
Where Sex = ‘F’
Group By Dept, Country
Having Country = ‘USA’
Order by CNT;

The above query is most frequently written top down just as you see it.  However; when it is process by SQL Server it is processed in this order:

  1. FROM
  2. WHERE
  3. GROUP BY
  4. HAVING
  5. SELECT
  6. ORDER BY

So, using the alias as above works since the ORDER BY clause is processed after the SELECT clause.  The below query fails because the WHERE clause contains an alias created in the SELECT clause, but the SELECT clause has not been processed yet.

Select Dept, Count(*) as CNT
From Employees
Where CNT > 1000
Group By Dept, Country
Having Country = ‘USA’
Order by CNT;

In addition, SQL Server processes clauses ‘all at once’ so you can’t even use the CNT alias in another part of the SELECT clause.  This type of error is shown below with the CNT + 1 in the SELECT clause.

Select Dept, Count(*) as CNT, CNT +1
From Employees
Where Sex = ‘F’
Group By Dept, Country
Having Country = ‘USA’
Order by CNT;