Processing Data Files with Hive and MapReduce
In the Processing Big Data course at NYU, we recently dug into Netflix and IMDb datasets to understand whether Netflix produces good shows. Unfortunately, we could not arrive at any meaningful conclusions. We did, however, encounter several interesting challenges during the data cleaning phase of our research.
Datasets
We used two datasets from Kaggle. The Netflix dataset includes titles on Netflix as well as other attributes such as directors, casting, and more. The IMDb datasets include multiple tables that contain detailed information about various TV shows, movies, documentaries, and more.
Cleaning data challenge one: parsing CSV rows
We quickly encountered the challenge of understanding where a column starts and ends in a CSV file’s row. The CSV file format stands for “comma-separated values,” which leads one to believe that obtaining each column’s content is as simple as identifying each comma’s location. Each column’s content is then the string that is in between the commas.
For example, the data of a column content with comma may be written as:
“A Star Is Born, starring Lady Gaga and Bradley Cooper, is a movie about…”, Bradley Cooper
Where the commas within the quotes should be ignored.
The character of interest to the algorithm is the quote symbol, “. Note that to CSV files, there is no distinction between the open and close quote symbol, and we shall subsequently treat the quote symbol as a singular entity. CSV files used a pair of quote symbols to demarcate the beginning and end of a column’s content if the content itself contains a comma. The need to use the quote symbols has been explained in the previous section and is not elaborated further here.
Recognizing pairs of quote symbols that balance is central to the parsing of CSV columns. When a quote symbol is encountered, we can interpret this as the beginning of a column’s content, and the column’s content does not end until the next quote symbol is encountered. Once we reach the next quote symbol, we have reached the end of the current column’s content.
We formalize the logic in the pseudocode below:
Def csv_row_split(row: string) -> number[]:
col_ends = []
quotes_balanced = True
For char, index in row:
Match char:
Case ‘“’: // quote detected: “ == ” here
quotes_balanced = !quotes_balanced
Case ‘,’: // comma detected
col_ends.append(index) if quotes_balanced == True
Ret col_ends
This approach dramatically improved the reliability of obtaining a column’s content with its correct index.
Cleaning data challenge 2: joining different datasets
Because the Netflix dataset and the IMDb dataset used different schemas and ID systems, joining these tables became a challenge. It was particularly important for us to combine the Netflix and the IMDb datasets, because this was the only way for us to obtain Netflix titles’ rating information.
In this table, we should include in this table every title that is both listed on Netflix and has an IMDb rating. In theory, the titles listed on Netflix should be a subset of the IMDb titles, but the lack of a universal ID schema means some titles cannot be readily matched across datasets.
We opted to match based on each title’s release title — the name of a TV show or a movie. Using Hive, we performed the Join operation between the Netflix dataset and an IMDb dataset, which has been joined from multiple IMDb datasets. Using this technique, we obtained a reasonably clean dataset. 87% of the titles listed in Netflix successfully identified their IMDb counterpart and obtained its IMDb rating.
Analytic results
We were able to arrive at the following statistics:
- Netflix rating average: 6.769 out of 10
- IMDb rating average: 6.886 out of 10
- IMDb title rating standard deviation: 1.4
We were also able to glimpse at where Netflix titles come from by taking a look at their production locations. Below are two visualizations of the same data: the distribution of Netflix titles’ production locations.
Concluding thoughts
While we understand the lack of Netflix ratings data have limited the scope of our analysis, we still considered the data cleaning challenges in our research to be quite interesting.
We have open sourced the code written for our research here.