Analysis of James Bond franchise movies#
This is a small notebook where I am doing some small analysis on the movies from the James Bond franchise. I am using ratings data from the MovieLens data set recommended for education and development, hand-copied data from The Numbers Movie Franchises for the budget, release dates, and total earnings. I am also using the IMDb ratings data set for comparison to the MovieLens data set.
This was actually prompted by trying to analyze different franchises and I saw that horror movie franchises are actually really profitable and it made me curious as to what other movie franchises have. I then saw that the James Bond franshise had a lot of the early movies be quite profitable (high return on investment) and I started to dig a bit deeper and it turned into this notebook where I try to find a reason why with the data that I have available to me.
I hope you enjoy going through this notebook as much as it was for me to create it all and learn a lot more about how pandas can be used in data manipulation and how dataframes are joined together like in SQL.
Imports and function definitions#
[1]:
from matplotlib.lines import Line2D
import ipywidgets as widgets
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import os
[2]:
def get_roi(data):
'''
Get the return on investment for the data set
Args:
data (pandas.DataFrame): Data frame with the budget
and earnings data.
Returns:
df (pandas.DataFrame): Data frame copied from the
original data with two new columns 'profit' and
'roi'.
'''
df = data.copy()
df['profit'] = df['earningsTotal'] - df['budgetTotal']
df['roi'] = df['profit'] / df['budgetTotal'] * 100
return df
def align_y_axis(ax1, ax2):
'''
Align two matplolib.axes.Axes objects such that the
0's on the y-axis of the plots are aligned correctly.
This function works such that the plots are zoomed out
by a ratio that will force the 0's to align.
Args:
ax1 (matplotlib.axes.Axes): Axes object from the
plot.
ax2 (maptlotlib.axes.Axes): Axes object from the
plot. Typically a twinx object.
'''
axes = (ax1, ax2)
extrema = [ax.get_ylim() for ax in axes]
tops = [extr[1] / (extr[1] - extr[0]) for extr in extrema]
if tops[0] > tops[1]:
axes, extrema, tops = [list(reversed(l)) for l in (axes, extrema, tops)]
tot_span = tops[1] + 1 - tops[0]
b_new_t = extrema[0][0] + tot_span * (extrema[0][1] - extrema[0][0])
t_new_b = extrema[1][1] - tot_span * (extrema[1][1] - extrema[1][0])
axes[0].set_ylim(extrema[0][0], b_new_t)
axes[1].set_ylim(t_new_b, extrema[1][1])
def get_name(df, column):
''' Just a small helper function '''
name = df[column].unique()
if name.shape[0] > 1:
raise ValueError("Failed determining name for column {}".format(column))
else:
name = name[0]
return name
def get_actorName(actorId):
actors = data_df['franchise_tags'][['actor', 'actorId']].drop_duplicates() \
.set_index('actorId').T.to_dict(orient='list')
try:
actorName = actors[actorId][0]
except KeyError:
raise KeyError('There is no actor with an actorId value of {}'.format(actorId))
return actorName
def get_movieName(movieId):
movies = jb_ratings[['title', 'movieId']].drop_duplicates() \
.set_index('movieId').T.to_dict(orient='list')
try:
movieName = movies[movieId][0]
except KeyError:
raise KeyError('There is no movie with the movieId value of {}'.format(movieId))
return movieName
def get_movieDate(movieId):
movies = jb_ratings[['date', 'movieId']].drop_duplicates() \
.set_index('movieId').T.to_dict(orient='list')
try:
movieDate = movies[movieId][0]
except KeyError:
raise KeyError('There is no movie with the movieId value of {}'.format(movieId))
return movieDate
def get_actorId(actorName):
actors = data_df['franchise_tags'][['actor', 'actorId']].drop_duplicates() \
.set_index('actor').T.to_dict(orient='list')
try:
actorId = actors[actorName][0]
except KeyError:
raise KeyError('There is no actor with the name {}'.format(actorName))
return actorId
def generate_ratings(actorName, df, fstart):
'''
Generate a plot figure for the ratings of the specific
actor given as an input. It is hard-coded that it will
get the average rating over a yearly period.
Args:
actorName (str): Name of the actor for which you
want to generate the plot.
Returns:
fig (matplotlib.pyplot.figure): Matplotlib figure
object with the plotted data.
'''
actorId = get_actorId(actorName)
label = '{:s} ({:d})'
data = df.groupby('actorId').get_group(actorId)
fig = plt.figure(fstart+actorId, figsize=(10,5), dpi=75)
ax = fig.add_subplot(111)
for movieId, sub_data in data.groupby('movieId'):
movieName = get_name(sub_data, 'title')
releaseDate = get_name(sub_data, 'date')
all_time_avg = sub_data['rating'].mean()
tmp = sub_data.groupby(pd.Grouper(key='reviewDate', freq='1Y'))['rating'].mean()
ax.plot(tmp.index, tmp, label=label.format(movieName, releaseDate.year))
ax.set_title(actorName, fontsize=18)
ax.legend(bbox_to_anchor=(1.01,1), loc='upper left',
fontsize=14)
fig.tight_layout()
return fig
def generate_hist(actorName, df, fstart):
'''
Generate a histogram plot of the ratings for each movie
of the specific actor given as an input. Will display
the mean, standard deviation, and total votes in the plot
title. We also add the probability density with the
parameters as calculated by the matplotlib histogram
method. The plot will be displayed as the probability
density instead of the total number of votes for the bin.
Args:
actorName (str): Name of the actor for which you
want to generate the plot.
Returns:
fig (matplotlib.pyplot.figure): Matplotlib figure
object with the plotted data.
'''
actorId = get_actorId(actorName)
data = df.groupby('actorId').get_group(actorId)
label = '{name:s} ({date:d})\navg: {avg:.2f} | std: {std:.2f}' \
+'\nmedian: {med:.2f} | votes: {nvotes:d}'
nrows = int(np.ceil(data['movieId'].unique().shape[0]/2))
fig = plt.figure(fstart+actorId, figsize=(10,nrows*4), dpi=75)
for idx, (movieId, sub_data) in enumerate(data.groupby('movieId')):
ax = fig.add_subplot(nrows,2,idx+1)
movieName = get_name(sub_data, 'title')
releaseDate = get_name(sub_data, 'date')
all_time_avg = sub_data['rating'].mean()
all_time_median = sub_data['rating'].median()
nvotes = sub_data.shape[0]
hist_data = np.histogram(sub_data['rating'], bins=10, range=[0,5])
n, bins, patches = ax.hist(sub_data['rating'], bins=10, range=[0,5],
width=0.4, align='mid', density=True)
# add a 'best fit' line
sigma = sub_data['rating'].std()
mu = all_time_avg
y = ((1 / (np.sqrt(2 * np.pi) * sigma)) *
np.exp(-0.5 * (1 / sigma * (bins - mu))**2))
ax.plot(bins, y, '--')
ax.axvline(all_time_avg, color='k', linestyle='-')
ax.axvline(all_time_median, color='k', linestyle=(0, (5, 6)))
kwargs = dict(name=movieName, date=releaseDate.year, avg=all_time_avg,
med=all_time_median, std=sigma, nvotes=nvotes)
ax.set_title(label.format(**kwargs), fontsize=18)
fig.suptitle('Histogram of ratings with {} as lead actor'.format(actorName), fontsize=24)
fig.tight_layout(rect=(0,0,1,0.99))
return fig
def generate_lineplot(actorName, df, fstart):
# print(df)
actorId = get_actorId(actorName)
data = df.groupby('actorId').get_group(actorId)
nrows = int(np.ceil(data['movieId'].unique().shape[0]/2))
fig = plt.figure(fstart+actorId, figsize=(10,nrows*4), dpi=75)
for idx, (movieId, sub_data) in enumerate(data.groupby('movieId')):
ax = fig.add_subplot(nrows,2,idx+1)
ax1 = ax.twinx()
orig = sub_data.groupby('type').get_group('original')
sub_data.drop(orig.index, inplace=True)
movieName = get_movieName(movieId)
movieDate = get_movieDate(movieId)
xticks = [x+' ('+str(y)+')' for x, y in zip(sub_data['type'], sub_data['count'])]
ax.plot(xticks, sub_data['average'], label='Average', color='tab:blue',
marker='o')
ax.plot(xticks, sub_data['median'], label='Median', color='tab:orange',
marker='o')
ax1.plot(xticks, sub_data['std'], color='tab:green', label='STD',
marker='o')
ax.axhline(orig['average'].values[0], color='tab:blue', linestyle=(0, (5, 6)))
ax.axhline(orig['median'].values[0], color='tab:orange', linestyle=(0, (5, 6)))
ax1.axhline(orig['std'].values[0], color='tab:green', linestyle=(0, (5, 6)))
h1, l1 = ax.get_legend_handles_labels()
h2, l2 = ax1.get_legend_handles_labels()
custom = [Line2D([0], [0], color='k', linestyle='--')]
handles = h1+h2+custom
labels = l1+l2+['Original']
ax.legend(handles, labels, bbox_to_anchor=(1.15,1), loc='upper left')
ax.set_ylabel('Rating / out of 5')
ax1.set_ylabel('STD')
ax.set_title('{:s} ({:d})'.format(movieName, movieDate.year))
for label in ax.get_xticklabels():
label.set_rotation(40)
label.set_horizontalalignment('right')
fig.tight_layout()
return fig
def gen_tab_object(func, df, fstart):
names = ['Daniel Craig', 'Pierce Brosnan', 'Roger Moore', 'Sean Connery',
'Timothy Dalton', 'George Lazenby', 'David Niven']
out1 = widgets.Output()
out2 = widgets.Output()
out3 = widgets.Output()
out4 = widgets.Output()
out5 = widgets.Output()
out6 = widgets.Output()
out7 = widgets.Output()
with out1: plt.show(func('Daniel Craig', df=df, fstart=fstart))
with out2: plt.show(func('Pierce Brosnan', df=df, fstart=fstart))
with out3: plt.show(func('Roger Moore', df=df, fstart=fstart))
with out4: plt.show(func('Sean Connery', df=df, fstart=fstart))
with out5: plt.show(func('Timothy Dalton', df=df, fstart=fstart))
with out6: plt.show(func('George Lazenby', df=df, fstart=fstart))
with out7: plt.show(func('David Niven', df=df, fstart=fstart))
tabs = widgets.Tab(children=[out1, out2, out3, out4, out5, out6, out7])
[tabs.set_title(i, name) for i, name in enumerate(names)]
return tabs
Parse the data#
[3]:
dfs = {}
dirs = dict(large='ml-latest')
files = ['franchise_movies.csv', 'franchise_tags.csv', 'links.csv', 'ratings.csv']
for key, parent in dirs.items():
dfs[key] = {}
for fn in os.listdir(parent):
if not fn.endswith('.csv'): continue
if fn not in files: continue
sub_key = fn.replace('.csv', '')
dtypes = dict(userId=int, movieId=int,
imdbId=str, tmdbId=str)
filename = os.path.join(parent, fn)
dfs[key][sub_key] = pd.read_csv(filename, dtype=dtypes)
all_data = dfs
[4]:
main = 'imdb-data'
parent = 'title-ratings'
dtypes = dict(tconst=str, numVotes=int)
fn = os.path.join(main, parent, 'data.tsv')
df = pd.read_csv(fn, sep='\t', dtype=dtypes)
df['tconst'] = df['tconst'].apply(lambda x: x[2:])
df = df.merge(all_data['large']['links'], left_on='tconst', right_on='imdbId', how='right')
all_data['imdb-ratings'] = df
Convert timestamps to human readable dates#
[5]:
data_df = all_data['large'].copy()
data_df['franchise_movies']['date'] = pd.to_datetime(data_df['franchise_movies']['releaseDate'], unit='s')
Analysis of return on investment#
[6]:
indv_roi = get_roi(data_df['franchise_movies'])
[7]:
jb_movies = indv_roi.groupby('seriesId').get_group(10).copy()
jb_movies['date'] = pd.to_datetime(jb_movies['releaseDate'], unit='s')
jb_movies.sort_values(by=['roi'], ascending=False)[['title', 'date', 'budgetTotal', 'profit', 'roi']].reset_index(drop=True)
[7]:
title | date | budgetTotal | profit | roi | |
---|---|---|---|---|---|
0 | Dr. No | 1963-05-08 | 1000000 | 58567035 | 5856.703500 |
1 | Goldfinger | 1964-12-22 | 3000000 | 121900000 | 4063.333333 |
2 | From Russia with Love | 1964-04-08 | 2000000 | 76900000 | 3845.000000 |
3 | Live and Let Die | 1973-06-27 | 7000000 | 154800000 | 2211.428571 |
4 | Diamonds Are Forever | 1971-12-17 | 7200000 | 108799985 | 1511.110903 |
5 | Thunderball | 1965-12-29 | 9000000 | 132200000 | 1468.888889 |
6 | The Man with the Golden Gun | 1974-12-20 | 7000000 | 90600000 | 1294.285714 |
7 | The Spy Who Loved Me | 1977-07-13 | 14000000 | 171400000 | 1224.285714 |
8 | You Only Live Twice | 1967-06-13 | 9500000 | 102100000 | 1074.736842 |
9 | On Her Majesty's Secret Service | 1969-12-18 | 8000000 | 74000000 | 925.000000 |
10 | For Your Eyes Only | 1981-06-26 | 28000000 | 167300000 | 597.500000 |
11 | Octopussy | 1983-06-10 | 27500000 | 160000000 | 581.818182 |
12 | Moonraker | 1979-06-29 | 31000000 | 179300000 | 578.387097 |
13 | Goldeneye | 1995-11-17 | 60000000 | 296429933 | 494.049888 |
14 | Casino Royale | 2006-11-17 | 102000000 | 492420216 | 482.764918 |
15 | Skyfall | 2012-11-08 | 200000000 | 910526981 | 455.263490 |
16 | A View to a Kill | 1985-05-24 | 30000000 | 122627960 | 408.759867 |
17 | The Living Daylights | 1987-07-31 | 40000000 | 151199996 | 377.999990 |
18 | Never Say Never Again | 1983-10-07 | 36000000 | 124000000 | 344.444444 |
19 | Licence to Kill | 1989-07-14 | 42000000 | 114167015 | 271.826226 |
20 | Casino Royale | 1967-04-28 | 12000000 | 29744718 | 247.872650 |
21 | Tomorrow Never Dies | 1997-12-19 | 110000000 | 229504276 | 208.640251 |
22 | Die Another Day | 2002-11-22 | 142000000 | 289942139 | 204.184605 |
23 | No Time to Die | 2021-10-08 | 250000000 | 509959662 | 203.983865 |
24 | Spectre | 2015-11-06 | 300000000 | 579077344 | 193.025781 |
25 | The World is Not Enough | 1999-11-19 | 135000000 | 226730660 | 167.948637 |
26 | Quantum of Solace | 2008-11-14 | 230000000 | 361692078 | 157.257425 |
What you can actually see is that a lot of the older James Bond movies actually have a much higher return on investment that the newer ones with Quantum of Solace actually performing the worst. Could this stem from how action movies themselves are more expensive to produce?
[8]:
fig = plt.figure(1, figsize=(6,6), dpi=75)
x = np.linspace(jb_movies['releaseDate'].min(), jb_movies['releaseDate'].max(), 100)
ax = fig.add_subplot(111)
ax2 = ax.twinx()
ax.axhline(0, color='k', linewidth=0.7)
ax.plot(jb_movies['date'], jb_movies['roi'], 'o', label='Return on Investment')
ax2.plot(jb_movies['date'], jb_movies['profit']/1e6, '^', color='tab:red', label='Profit')
ax.set_xlabel('Date')
ax.set_ylabel('Return on investment / %')
ax2.set_ylabel('Profit (not adjusted for inflation) / 10$^6$ USD')
ax.set_title('Return on investment for James Bond movies')
h1, l1 = ax.get_legend_handles_labels()
h2, l2 = ax2.get_legend_handles_labels()
handles = h1 + h2
labels = l1 + l2
ax.legend(handles, labels, bbox_to_anchor=(1.01,1), loc='upper left')
align_y_axis(ax, ax2)

Here we see a plot of the return on investment for the James Bond series in blue and the profit in red. We see how the profit (not adjusted for inflation) actually goes up significantly. So let’s maybe take a look at how the ratings for the movies has changed over time.
Ratings per lead actor#
[9]:
df1 = jb_movies.copy()
df1.reset_index(drop=True, inplace=True)
tmp = data_df['ratings'].merge(df1[['movieId', 'title', 'releaseDate', 'date', 'roi']], on='movieId')
jb_ratings = tmp.merge(data_df['franchise_tags'][['movieId', 'actor', 'actorId']], on='movieId')
jb_ratings['reviewDate'] = pd.to_datetime(jb_ratings['timestamp'], unit='s')
[10]:
tabs = gen_tab_object(generate_ratings, jb_ratings, 2)
tabs
[10]:
Unfortunately, our dataset that we got from MovieLens is incomplete for all but the James Bond movies starring Daniel Craig and Pierce Brosnan as they are just too old. Overall, however, there does not seem to be a consistent trend in the data showing a reason for either a decrease or increase of return on investment or profit, respectively.
A quick look through Wikipedia ( this in particular) actually shows a steady increase in the salary of the lead actors and if we factor the increased use of VFX and stunts it might be enough to really show how the movies don’t have as big of a return on investment, and with an increased use of VFX and the advancement in the realism of VFX could be a contributing factor to the increase in profits over time.
One last thing that I wanted to look at is the stability of the data. I.e. how does the data we used above really compare with other datasets. I will compare the average ratings to those from IMDb and see how far from the median the average is.
Comparing to IMDb data#
[11]:
grouped = jb_ratings.groupby('movieId')
count = grouped['rating'].count().rename('count')
all_time_avg = grouped['rating'].mean().rename('average')
all_time_median = grouped['rating'].median().rename('median')
all_time_std = grouped['rating'].std().rename('std')
df = pd.concat([all_time_avg, all_time_median, all_time_std, count], axis=1)
df = df.merge(all_data['imdb-ratings'][['averageRating', 'numVotes', 'movieId']], on='movieId')
df = df.merge(jb_movies[['title', 'movieId', 'date']], on='movieId')
df['title'] = df['title'] + df['date'].apply(lambda x: ' ('+str(x.year)+')')
df['averageRating'] /= 2
df['numVotes'] = df['numVotes'].astype(int)
rcols = dict(averageRating='imdbRating', numVotes='imdbVotes')
df.rename(columns=rcols).set_index('title').sort_values(by=['date']).drop(['movieId', 'date'], axis=1)
[11]:
average | median | std | count | imdbRating | imdbVotes | |
---|---|---|---|---|---|---|
title | ||||||
Dr. No (1963) | 3.665154 | 4.00 | 0.835024 | 9694 | 3.60 | 175470 |
From Russia with Love (1964) | 3.688400 | 4.00 | 0.849023 | 9586 | 3.65 | 141772 |
Goldfinger (1964) | 3.729858 | 4.00 | 0.865900 | 15527 | 3.85 | 198327 |
Thunderball (1965) | 3.599912 | 3.50 | 0.846492 | 5690 | 3.45 | 124284 |
Casino Royale (1967) | 2.877232 | 3.00 | 1.054743 | 1120 | 2.50 | 31603 |
You Only Live Twice (1967) | 3.593695 | 3.50 | 0.802609 | 3394 | 3.40 | 114921 |
On Her Majesty's Secret Service (1969) | 3.428017 | 3.50 | 0.932231 | 3605 | 3.35 | 96895 |
Diamonds Are Forever (1971) | 3.495077 | 3.50 | 0.852021 | 5992 | 3.25 | 111678 |
Live and Let Die (1973) | 3.488243 | 3.50 | 0.857393 | 6294 | 3.35 | 112948 |
The Man with the Golden Gun (1974) | 3.464959 | 3.50 | 0.852398 | 5037 | 3.35 | 110691 |
The Spy Who Loved Me (1977) | 3.531111 | 3.50 | 0.840140 | 5609 | 3.50 | 113778 |
Moonraker (1979) | 3.158547 | 3.00 | 0.939335 | 6014 | 3.10 | 106341 |
For Your Eyes Only (1981) | 3.437548 | 3.50 | 0.868478 | 5212 | 3.35 | 106002 |
Octopussy (1983) | 3.310463 | 3.50 | 0.881729 | 2332 | 3.25 | 110735 |
Never Say Never Again (1983) | 3.266388 | 3.50 | 0.931870 | 1556 | 3.05 | 71465 |
A View to a Kill (1985) | 3.182722 | 3.00 | 0.920832 | 4526 | 3.15 | 102587 |
The Living Daylights (1987) | 3.290004 | 3.00 | 0.890285 | 2731 | 3.35 | 103645 |
Licence to Kill (1989) | 2.625000 | 2.75 | 1.187735 | 8 | 2.75 | 1028 |
Goldeneye (1995) | 3.434835 | 3.50 | 0.878003 | 34942 | 3.60 | 265904 |
Tomorrow Never Dies (1997) | 3.230008 | 3.00 | 0.935325 | 15919 | 3.25 | 201493 |
The World is Not Enough (1999) | 3.204092 | 3.00 | 0.952916 | 11583 | 3.20 | 206889 |
Die Another Day (2002) | 3.086181 | 3.00 | 0.999052 | 8720 | 3.05 | 225752 |
Casino Royale (2006) | 3.837693 | 4.00 | 0.862241 | 28517 | 4.00 | 680473 |
Quantum of Solace (2008) | 3.300951 | 3.50 | 0.935187 | 10196 | 3.30 | 463211 |
Skyfall (2012) | 3.741676 | 4.00 | 0.890015 | 15918 | 3.90 | 718361 |
Spectre (2015) | 3.392451 | 3.50 | 0.936909 | 5630 | 3.40 | 457116 |
No Time to Die (2021) | 3.484932 | 3.50 | 0.884875 | 1626 | 3.65 | 430310 |
The overall conclusion from this is that the data from MovieLens seems to not have a large skew, albeit having a smaller sample size than IMDb with the exception of Licence to Kill (1989) which only has 8 votes and that is reflected in the larger standard deviation to the rest. It would be interesting to see how the standard deviation and median compare for the IMDb data, but I don’t have access to that data, unfortunately.
See how normalized the ratings are#
Something else that we could look at to see how the data is actually behaving is to look at how close to the average the individual data points are. So we will plot the votes into a histogram and see how nice of a bell-curve it makes.
[12]:
tabs = gen_tab_object(generate_hist, jb_ratings, 9)
tabs
[12]:
Note: Average is represented by the solid black line and the median is the dashed black line.
What we can actually see here is that for most of the plots the maximum in the probability density distribution function lines up pretty well with the bins with the highest number of votes. However, for movies starring Daniel Craig, the distribution is very even, whereas for movies starring someone like Pierce Brosnan there are some peaks and valleys in the data for movies like Goldeneye (1995) which seems to be the most popular one with the highest number of votes out of any other James Bond film. This does not necessarily point to any issues with the dataset, however, I am wondering if there might be some kind of bias towards whole numbers from those voting. This seems to also be a trend is some of the other movies with Roger Moore and Sean Connery as James Bond.
Comparison of voting with different kinds of viewers#
Here I just want to look at differences in voting for people that are casual watchers and fans. I am using the following definitions
casual: people that have voted for less than 2 movies inclusive
story fans: people that voted for at least 3 movies but less than 8 movies
super story fans: people that voted for at least 8 movies
godly fans: people that have voted for at least 16 movies
super godly fans: people that have voted for at least 20 movies
[13]:
df = jb_ratings.copy()
count = df.groupby('userId')['rating'].count().rename('count')
Casual group#
[14]:
ids = count[count <= 2]
casual_df = df.merge(ids, on='userId', how='right')
[15]:
tabs = gen_tab_object(generate_hist, casual_df, 9)
tabs
[15]:
Story fans group#
[16]:
ids = count[count.between(3, 7)]
stfans_df = df.merge(ids, on='userId', how='right')
[17]:
tabs = gen_tab_object(generate_hist, stfans_df, 9)
tabs
[17]:
Super story fans group#
[18]:
ids = count[count.between(8, 15)]
spstfans_df = df.merge(ids, on='userId', how='right')
[19]:
tabs = gen_tab_object(generate_hist, spstfans_df, 9)
tabs
[19]:
Godly fans group#
[20]:
ids = count[count >= 16]
godfans_df = df.merge(ids, on='userId', how='right')
[21]:
tabs = gen_tab_object(generate_hist, godfans_df, 9)
tabs
[21]:
Super Godly fans group#
[22]:
ids = count[count >= 22]
spgodfans_df = df.merge(ids, on='userId', how='right')
[23]:
tabs = gen_tab_object(generate_hist, spgodfans_df, 9)
tabs
[23]:
Based on what we see here, the distributions seem to become a bit more normalized and don’t seem to favor integer numbers for the ratings. This does not necessarily mean that the data is better than before because now there may be a bias as some of the groups, like super godly, are made of people that have watched many James Bond movies and may no longer represent the general public.
Putting it all together#
Here I will compile all of the data from above and try to make a solid judgement.
[24]:
dfs = {'original': jb_ratings, 'casual': casual_df, 'story': stfans_df,
'super story': spstfans_df, 'godly': godfans_df,
'super godly': spgodfans_df}
arr = []
for key, val in dfs.items():
grouped = val.groupby(['actorId', 'movieId'])
avg = grouped['rating'].mean().rename('average')
med = grouped['rating'].median().rename('median')
std = grouped['rating'].std().rename('std')
cnt = grouped['rating'].count().rename('count')
df = pd.concat([avg, med, std, cnt], axis=1)
df['type'] = key
df.reset_index(inplace=True)
arr.append(df)
df = pd.concat(arr, ignore_index=True)
combined_df = df
[25]:
tabs = gen_tab_object(generate_lineplot, combined_df, 9)
tabs
[25]:
Here we see that for the most part limiting those who voted to the criteria that I outlined before does not seem to have as big of an effect as I originally thought. However, we do see that the standard deviation does decrease for some movies and for others there does not seem to be any improvement. This is not supposed to say that the people that have watched more James Bond movies make a “better” reviewer. It is just to say that some of the users that have watched more James Bond movies may look for different things in the movie as opposed to those that do not watch them as much. Some things could include the story, visual effects, action sequences, villains, to name a few.
Conclusion#
I am happy to claim that the dataset that I am using, while is not necessarily the largest as there are only 8 votes for Licence to Kill (1989), does have similar ratings to those from IMDb and the median values are not that far off the average. In addition, the distribution of the votes seems pretty normalized. Finally, we tried to see how the number of James Bond movies would affect the averages, medians, and standard deviations, where there was not a large dependence on it. However, the distributions seem a bit more normalized and not favoring integer numbers for the ratings. However, by grouping the users like this it may no longer be a good description of the entire population as it is adding a bias to the results. It is, however, interesting to see this trend.
So, based on the data I have I cannot make a conclusion on what has caused the huge decrease in the return on investment for James Bond movies over the years. Only when I went to Wikipedia and did some manual digging did I come across something that could be a factor, salary of the lead actors.
Thank you for coming along with me on this fun journey!!
[ ]: