Analyzing Netflix Datasets
Netflix is one of the fastest-growing entertainment hubs comprising movies, anime, documentaries and TV Shows. A lot of us turn to Netflix when we get bored, stressed or are in a mood to chill and have fun, as we would say “Netflix and Chill”.
According to Wikipedia, Netflix Inc. is an American over-the-top content platform and production company. Netflix happens to have a lot of movies, TV shows, etc. which are created yearly. As a Netflix user, I got interested in collecting data from different movies and TV Shows across the year. As a Data Analyst, it was a pleasing experience analyzing and cleaning these datasets. I would be putting us through my steps taken while analyzing these Datasets and also be pointing out important key movie stars and directors. The data can be gotten from Kaggle and through this link.
Data Loading
1. Importing Important Modules
The major module I used when analyzing these datasets is the pandas. Pandas is a module used in loading data from files (e.g. CSV, Excel, JSON, etc.). And to import and load a CSV file, you use the following Syntax:
import pandas as pdnf = pd.read_csv("netflix_titles.csv")
As seen in the code above, our Netflix Dataset which is a CSV file(i.e. Comma-Separated Values) is called using the read_csv which is used in loading up data. Viewing the first 5 rows of the dataset is done by calling the head() functions.
nf.head()
#this gives out the first 5 rows as Output.
The result of this is provided below:
Then to get the last 5 rows of these same datasets we call the tail function.
nf.tail(5)
I proceeded to get a little summary of the whole dataset, we then call the info() function. Also to get the shape of the datasets, we use the shape functions.
nf.info()
nf.shape
This function shows the number of entries under each column if the column contains null types, the total number of entries in the datasets, the type of data present in the datasets, the memory used(730.2kb) and then the shape of the dataset in (rows, columns). In this case, our dataset has (7787, 12).
Due to the unorganized manner of the data, we have to sort the data in an ascending manner from the first year to the last year under the release year of different movies and TV shows. To get to work on this we use the sort_values function.
sorted_nf = nf.sort_values('release_year', ascending = True, kind= 'heapsort', na_position= 'last')
sorted_nf.head()
This then leads to the cleaning of missing values and other useless data.
Data Cleaning
As the name implies data cleaning involves the removal of wrong data, empty data, wrong format and duplicates. According to the summary we got above it showed that some entries were empty in the datasets, in this I decided to get rid of the missing data using the dropna function which is used to drop Null values in the data. In order to use this function we are to create a new variable that will take in the rest of the data that appear to be non-null values.
new_nf = sorted_nf.dropna()
new_nf.head()
We then go ahead to check the summary and shape of our new dataset, this is done by using the info function we used before.
new_nf.info()
new_nf.shape()
From the above result, we can see there are no other missing entries, one integer type and 11 object types. Also, we had 2979 missing values which would have led to inaccurate analysis. The shape of our new dataset is (4808, 120).
In the case of duplicates, we try to filter out the duplicates present in our data in other to avoid unnecessary repetitions. So to do this we do the drop_duplicates function.
new_nf.drop_duplicates()
In the case of this dataset, there are no duplicates as the shape of the datasets remains the same after dropping off duplicates.
Important facts gotten from analyzing the Data.
In conclusion, according to my analysis of this data, I was able to find out that over 4670 movies and over 130 TV shows were shot from the year 1944–2021.
Most of the movies and TV Shows were shot after 2010. Also, Movies and TV shows rated for Mature Audiences (TV-MA) were shot the most over the years. The United States has the record for the highest number of movies shot during this period.
The top and most respected Directors during this period were Raul Campos and Jan Suter directing 18 movies and TV shows to their name. Also, a top artiste during this period was Samuel West.
The most viewed categories of movies are International movies, followed by Dramas, Comedies, Documentaries, Actions and Adventure, in that order.
All this can be seen in the code used in this article, which can be gotten through this GitHub link.