How Long is Too Long?

Kai Graham
7 min readMay 29, 2020

--

We’ve all been there. You throw some popcorn in the microwave, plop down on the couch with some friends, and start scrolling through Netflix, trying to decide what movie you’ll watch that evening. One friend wants to watch an action movie, another friend wants to watch a comedy, and a third just wants to watch a movie starring Denzel Washington. You may be different, but one factor that almost never influences my movie selection process is runtime. As the late film critic, Roger Ebert, once said: “No good movie is too long and no bad movie is short enough”. But what does the data say? Do longer movies generate more revenue than shorter movies? Using The Movie DB’s API, I set out to answer this question. Below you will find a walkthrough of my process for pulling data from TMDb and my findings.

First, the data. To start, I used TMDb’s “Discover” API to pull in movies with release dates between a start and end date. Following API documentation provided by TMDb, the first step is to register for an API Key, use that API Key to generate an Access Token, and then use that Access Token in the headers of API requests sent to TMDb.

#Discover API url and headers
url = 'https://api.themoviedb.org/3/discover/movie'
headers = {'Authorization': 'Bearer {}'.format(access_token),
'Content-Type': 'application/json;charset=utf-8'}

It is important to note that TMDb limits the number of results returned by the API to 20 results per page and a maximum of 500 pages (despite documentation stating 1000). As a result, retrieving movie data for 2000 through YTD 2020 in a single API call is not possible. I found that requesting more than ~3 months of data at a time tends to exceed these limitations. Furthermore, the API only returns one page at a time. To handle this, I created a get_num_pages function that returns the number of pages of results from TMDb for the date range provided.

def get_num_pages(url, headers, start_date, end_date):
"""
Takes as input an API url, headers containing
authentication information, a start date, and end date.
Returns the number of pages of results returned by the
API call as an int.
"""
params = {'release_date.gte': start_date,
'release_date.lte': end_date}
returned_movies = requests.get(url=url, headers=headers,
params=params).json()
return returned_movies['total_pages']

I then created a function to loop through every page of results returned for that specific start and end date.

def get_movies_data(start_date, end_date, url, headers):
"""
Takes a start date, end date, API url, and headers with
authentication information.
Uses get_num_pages function to check the number of pages
returned by the API.
Loops through all pages, requesting data from API, concatenating
results to a dataframe.
Returns dataframe of movie information between start and end
date.
"""
df = pd.DataFrame()
num_pages = get_num_pages(url, headers, start_date, end_date)
for i in range(1, num_pages+1):
parameters = {'release_date.gte': start_date,
'release_data.lte': end_date,
'page': i}
request = requests.get(url, headers=headers,
params=parameters).json()
df = pd.concat([df, pd.DataFrame(request['results'])],
sort=False)
return df

Now that we have a way of getting the number of pages of results returned and a way of pulling in movie data for a given start and end date, we can create lists of start and end dates to loop through and call get_movies_data. I decided to look at movies with release dates between 1/1/2000 and 12/31/2020 (TMDb includes upcoming releases for future movies). Due to API limitations mentioned above, I also created a function to split a year into quarters, returning a list of start dates and a list of end dates.

def create_quarter_date_list(year):
"""
Returns a list of quarterly start dates and a list of
quarterly end dates from int input representing year.
"""
start_dates = [f'{year}-01-01', f'{year}-04-01',
f'{year}-07-01', f'{year}-10-01']
end_dates = [f'{year}-03-31', f'{year}-06-30',
f'{year}-09-30', f'{year}-12-31']
return start_dates, end_dates

Use the function to create start and end dates for all years in our range (2000–2020).

#create start and end dates for 2000 to 2020
start_dates = []
end_dates = []
for year in range(2000, 2021):
start_date, end_date = create_quarter_date_list(year)
start_dates += start_date
end_dates += end_date

Now loop through all start and end dates, calling get_movies_data for each set of start and end dates, and concatenate the results to a Pandas DataFrame.

#get movie data and concat to current dataframe
df = pd.DataFrame()
url = 'https://api.themoviedb.org/3/discover/movie'
#loop through all start and end dates and make an API call for each date range
#append results to df
for i, start_date in enumerate(start_dates):
temp_df = get_movies_data(start_date=start_date,
end_date=end_dates[i],
url=url,
headers=headers)
df = pd.concat([df, temp_df], sort=False)

You will note that this DataFrame does not include a runtime or revenue or ratings attribute.

DataFrame showing results of first API calls

To pull in the relevant attributes, an additional API call is needed to TMDb’s Movies API. Our headers object will not change, and the URL for this API is found here:

url = 'https://api.themoviedb.org/3/movie'

Using the “id” column from the DataFrame we just created, loop through all movie IDs and for each call TMDb’s Movies API. While TMDb documentation provides instructions for passing in movie ID to the API call via path parameters, I found these path parameters would not work. To get around this, I opted to use f string to append the movie ID to the API URL. Append all results to a list (I am using movie_details).

movie_details = []#loop through all movie IDs, request details, and append to movie_details
for index, movie_id in enumerate(movies['id']):
response = requests.get(f'{url}/{movie_id}',
headers=headers).json()
movie_details.append(response)
update_progress(index / len(movies['id']))

This ran for close to an hour before all results were pulled in. Once the loop is finished, create a DataFrame to begin analyzing the data.

movies_df = pd.DataFrame(movie_details)

As shown below, we now have all the attributes we need for our runtime analysis.

For this analysis we will mainly be focusing on the “runtime” and “revenue” columns. Next, I cleaned the data by removing duplicate IDs, removing rows with missing data, and dropped the following columns that are not relevant to this analysis (backdrop_path, belongs_to_collection, homepage, and poster_path) as these are primarily links to things like the movie’s poster image and backdrop. Looking at the cleaned DataFrame, there are still missing runtime and imdb_id values:

movies_df.info()

Given the amount of missing runtime data as a percentage of all results (~1.3%), I decided to fill the missing values with the median runtime. I decided to fill missing imdb_id cells with the string “missing”.

movies_df['imdb_id'].fillna('missing', inplace=True)
movies_df['runtime'].fillna(movies_df['runtime'].median(),
inplace=True)

With the dataset cleaned, analysis can begin. To start, I saved a copy of movies_df into a runtime_df variable, and plotted a box plot of runtimes to get a sense of the distribution.

As can be seen, there are a number of outliers. The median runtime is 93 minutes and the average runtime is 89.3 minutes. Standard deviation is 86.5 minutes. Given the similarity between the mean and median, I decided to remove any runtimes that were more than 2 standard deviations above the mean, or movies with runtimes greater than 4.4 hours. The other issue is that there are movies with runtimes of 0. Conceptually this does not make sense. 0 runtime movies comprise ~6% of the data. I decided to fill these with the median. Additionally, 75% of the dataset contains revenue amounts of 0. It is possible for a movie to do so poorly that it generates 0 revenue, but more likely, these 0’s are placeholders for missing data. Because we are concerned about what runtime movies produce the most revenue, I removed all movies with revenues of 0.

runtime_df = runtime_df.loc[runtime_df['revenue'] > 0]

With outliers handled, and placeholders taken care of, we can plot a scatterplot showing the relationship between revenue and runtime.

It seems the majority of high rated movies tend to have runtimes that fall between 100 minutes and 180 minutes. As runtime increases, revenue seems to increase, but only to a certain point.

To get a better sense of how movies with certain runtimes tend to perform, I created subsets of the data, broken out by runtime length.

movies_lt_30 = runtime_df.loc[runtime_df['runtime'] <= 30]
movies_lt_60 = runtime_df.loc[(runtime_df['runtime'] > 30) &
(runtime_df['runtime'] <= 60)]
movies_lt_90 = runtime_df.loc[(runtime_df['runtime'] > 60) &
(runtime_df['runtime'] <= 90)]
movies_lt_120 = runtime_df.loc[(runtime_df['runtime'] > 90) &
(runtime_df['runtime'] <= 120)]
movies_lt_150 = runtime_df.loc[(runtime_df['runtime'] > 120) &
(runtime_df['runtime'] <= 150)]
movies_lt_180 = runtime_df.loc[(runtime_df['runtime'] > 150) &
(runtime_df['runtime'] <= 180)]
movies_gt_180 = runtime_df.loc[(runtime_df['runtime'] > 180)]

Then, I calculated the median revenue for each subset:

And created a bar chart to visualize the data.

The sweet spot falls between 2 hours and 3 hours, with median revenue falling off for movies with runtimes greater than 3 hours. Conceptually, this makes sense. Consumers likely want to see movies that are long enough to keep them entertained (and differentiate from television shows/series), but not so long that a large chunk of their day is taken up by the movie (3+ hours).

--

--