Analysis of James Bond franchise movies#

This is a small notebook where I am doing some small analysis on the movies from the James Bond franchise. I am using ratings data from the MovieLens data set recommended for education and development, hand-copied data from The Numbers Movie Franchises for the budget, release dates, and total earnings. I am also using the IMDb ratings data set for comparison to the MovieLens data set.

This was actually prompted by trying to analyze different franchises and I saw that horror movie franchises are actually really profitable and it made me curious as to what other movie franchises have. I then saw that the James Bond franshise had a lot of the early movies be quite profitable (high return on investment) and I started to dig a bit deeper and it turned into this notebook where I try to find a reason why with the data that I have available to me.

I hope you enjoy going through this notebook as much as it was for me to create it all and learn a lot more about how pandas can be used in data manipulation and how dataframes are joined together like in SQL.

Imports and function definitions#

[1]:

from matplotlib.lines import Line2D
import ipywidgets as widgets
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import os

[2]:

def get_roi(data):
    '''
    Get the return on investment for the data set

    Args:
        data (pandas.DataFrame): Data frame with the budget
            and earnings data.

    Returns:
        df (pandas.DataFrame): Data frame copied from the
            original data with two new columns 'profit' and
            'roi'.
    '''
    df = data.copy()
    df['profit'] = df['earningsTotal'] - df['budgetTotal']
    df['roi'] = df['profit'] / df['budgetTotal'] * 100
    return df

def align_y_axis(ax1, ax2):
    '''
    Align two matplolib.axes.Axes objects such that the
    0's on the y-axis of the plots are aligned correctly.
    This function works such that the plots are zoomed out
    by a ratio that will force the 0's to align.

    Args:
        ax1 (matplotlib.axes.Axes): Axes object from the
            plot.
        ax2 (maptlotlib.axes.Axes): Axes object from the
            plot. Typically a twinx object.
    '''
    axes = (ax1, ax2)
    extrema = [ax.get_ylim() for ax in axes]
    tops = [extr[1] / (extr[1] - extr[0]) for extr in extrema]
    if tops[0] > tops[1]:
        axes, extrema, tops = [list(reversed(l)) for l in (axes, extrema, tops)]
    tot_span = tops[1] + 1 - tops[0]
    b_new_t = extrema[0][0] + tot_span * (extrema[0][1] - extrema[0][0])
    t_new_b = extrema[1][1] - tot_span * (extrema[1][1] - extrema[1][0])
    axes[0].set_ylim(extrema[0][0], b_new_t)
    axes[1].set_ylim(t_new_b, extrema[1][1])

def get_name(df, column):
    ''' Just a small helper function '''
    name = df[column].unique()
    if name.shape[0] > 1:
        raise ValueError("Failed determining name for column {}".format(column))
    else:
        name = name[0]
    return name

def get_actorName(actorId):
    actors = data_df['franchise_tags'][['actor', 'actorId']].drop_duplicates() \
                .set_index('actorId').T.to_dict(orient='list')
    try:
        actorName = actors[actorId][0]
    except KeyError:
        raise KeyError('There is no actor with an actorId value of {}'.format(actorId))
    return actorName

def get_movieName(movieId):
    movies = jb_ratings[['title', 'movieId']].drop_duplicates() \
                .set_index('movieId').T.to_dict(orient='list')
    try:
        movieName = movies[movieId][0]
    except KeyError:
        raise KeyError('There is no movie with the movieId value of {}'.format(movieId))
    return movieName

def get_movieDate(movieId):
    movies = jb_ratings[['date', 'movieId']].drop_duplicates() \
                .set_index('movieId').T.to_dict(orient='list')
    try:
        movieDate = movies[movieId][0]
    except KeyError:
        raise KeyError('There is no movie with the movieId value of {}'.format(movieId))
    return movieDate

def get_actorId(actorName):
    actors = data_df['franchise_tags'][['actor', 'actorId']].drop_duplicates() \
                .set_index('actor').T.to_dict(orient='list')
    try:
        actorId = actors[actorName][0]
    except KeyError:
        raise KeyError('There is no actor with the name {}'.format(actorName))
    return actorId

def generate_ratings(actorName, df, fstart):
    '''
    Generate a plot figure for the ratings of the specific
    actor given as an input. It is hard-coded that it will
    get the average rating over a yearly period.

    Args:
        actorName (str): Name of the actor for which you
            want to generate the plot.

    Returns:
        fig (matplotlib.pyplot.figure): Matplotlib figure
            object with the plotted data.
    '''
    actorId = get_actorId(actorName)
    label = '{:s} ({:d})'
    data = df.groupby('actorId').get_group(actorId)
    fig = plt.figure(fstart+actorId, figsize=(10,5), dpi=75)
    ax = fig.add_subplot(111)
    for movieId, sub_data in data.groupby('movieId'):
        movieName = get_name(sub_data, 'title')
        releaseDate = get_name(sub_data, 'date')
        all_time_avg = sub_data['rating'].mean()
        tmp = sub_data.groupby(pd.Grouper(key='reviewDate', freq='1Y'))['rating'].mean()
        ax.plot(tmp.index, tmp, label=label.format(movieName, releaseDate.year))
    ax.set_title(actorName, fontsize=18)
    ax.legend(bbox_to_anchor=(1.01,1), loc='upper left',
              fontsize=14)
    fig.tight_layout()
    return fig

def generate_hist(actorName, df, fstart):
    '''
    Generate a histogram plot of the ratings for each movie
    of the specific actor given as an input. Will display
    the mean, standard deviation, and total votes in the plot
    title. We also add the probability density with the
    parameters as calculated by the matplotlib histogram
    method. The plot will be displayed as the probability
    density instead of the total number of votes for the bin.

    Args:
        actorName (str): Name of the actor for which you
            want to generate the plot.

    Returns:
        fig (matplotlib.pyplot.figure): Matplotlib figure
            object with the plotted data.
    '''
    actorId = get_actorId(actorName)
    data = df.groupby('actorId').get_group(actorId)
    label = '{name:s} ({date:d})\navg: {avg:.2f} | std: {std:.2f}' \
            +'\nmedian: {med:.2f} | votes: {nvotes:d}'
    nrows = int(np.ceil(data['movieId'].unique().shape[0]/2))
    fig = plt.figure(fstart+actorId, figsize=(10,nrows*4), dpi=75)
    for idx, (movieId, sub_data) in enumerate(data.groupby('movieId')):
        ax = fig.add_subplot(nrows,2,idx+1)
        movieName = get_name(sub_data, 'title')
        releaseDate = get_name(sub_data, 'date')
        all_time_avg = sub_data['rating'].mean()
        all_time_median = sub_data['rating'].median()
        nvotes = sub_data.shape[0]
        hist_data = np.histogram(sub_data['rating'], bins=10, range=[0,5])
        n, bins, patches = ax.hist(sub_data['rating'], bins=10, range=[0,5],
                                   width=0.4, align='mid', density=True)
        # add a 'best fit' line
        sigma = sub_data['rating'].std()
        mu = all_time_avg
        y = ((1 / (np.sqrt(2 * np.pi) * sigma)) *
             np.exp(-0.5 * (1 / sigma * (bins - mu))**2))
        ax.plot(bins, y, '--')
        ax.axvline(all_time_avg, color='k', linestyle='-')
        ax.axvline(all_time_median, color='k', linestyle=(0, (5, 6)))
        kwargs = dict(name=movieName, date=releaseDate.year, avg=all_time_avg,
                      med=all_time_median, std=sigma, nvotes=nvotes)
        ax.set_title(label.format(**kwargs), fontsize=18)
    fig.suptitle('Histogram of ratings with {} as lead actor'.format(actorName), fontsize=24)
    fig.tight_layout(rect=(0,0,1,0.99))
    return fig

def generate_lineplot(actorName, df, fstart):
    # print(df)
    actorId = get_actorId(actorName)
    data = df.groupby('actorId').get_group(actorId)
    nrows = int(np.ceil(data['movieId'].unique().shape[0]/2))
    fig = plt.figure(fstart+actorId, figsize=(10,nrows*4), dpi=75)
    for idx, (movieId, sub_data) in enumerate(data.groupby('movieId')):
        ax = fig.add_subplot(nrows,2,idx+1)
        ax1 = ax.twinx()
        orig = sub_data.groupby('type').get_group('original')
        sub_data.drop(orig.index, inplace=True)
        movieName = get_movieName(movieId)
        movieDate = get_movieDate(movieId)
        xticks = [x+' ('+str(y)+')' for x, y in zip(sub_data['type'], sub_data['count'])]
        ax.plot(xticks, sub_data['average'], label='Average', color='tab:blue',
                marker='o')
        ax.plot(xticks, sub_data['median'], label='Median', color='tab:orange',
                marker='o')
        ax1.plot(xticks, sub_data['std'], color='tab:green', label='STD',
                marker='o')
        ax.axhline(orig['average'].values[0], color='tab:blue', linestyle=(0, (5, 6)))
        ax.axhline(orig['median'].values[0], color='tab:orange', linestyle=(0, (5, 6)))
        ax1.axhline(orig['std'].values[0], color='tab:green', linestyle=(0, (5, 6)))
        h1, l1 = ax.get_legend_handles_labels()
        h2, l2 = ax1.get_legend_handles_labels()
        custom = [Line2D([0], [0], color='k', linestyle='--')]
        handles = h1+h2+custom
        labels = l1+l2+['Original']
        ax.legend(handles, labels, bbox_to_anchor=(1.15,1), loc='upper left')
        ax.set_ylabel('Rating / out of 5')
        ax1.set_ylabel('STD')
        ax.set_title('{:s} ({:d})'.format(movieName, movieDate.year))
        for label in ax.get_xticklabels():
            label.set_rotation(40)
            label.set_horizontalalignment('right')
    fig.tight_layout()
    return fig

def gen_tab_object(func, df, fstart):
    names = ['Daniel Craig', 'Pierce Brosnan', 'Roger Moore', 'Sean Connery',
         'Timothy Dalton', 'George Lazenby', 'David Niven']
    out1 = widgets.Output()
    out2 = widgets.Output()
    out3 = widgets.Output()
    out4 = widgets.Output()
    out5 = widgets.Output()
    out6 = widgets.Output()
    out7 = widgets.Output()
    with out1: plt.show(func('Daniel Craig', df=df, fstart=fstart))
    with out2: plt.show(func('Pierce Brosnan', df=df, fstart=fstart))
    with out3: plt.show(func('Roger Moore', df=df, fstart=fstart))
    with out4: plt.show(func('Sean Connery', df=df, fstart=fstart))
    with out5: plt.show(func('Timothy Dalton', df=df, fstart=fstart))
    with out6: plt.show(func('George Lazenby', df=df, fstart=fstart))
    with out7: plt.show(func('David Niven', df=df, fstart=fstart))
    tabs = widgets.Tab(children=[out1, out2, out3, out4, out5, out6, out7])
    [tabs.set_title(i, name) for i, name in enumerate(names)]
    return tabs

Parse the data#

[3]:

dfs = {}
dirs = dict(large='ml-latest')
files = ['franchise_movies.csv', 'franchise_tags.csv', 'links.csv', 'ratings.csv']
for key, parent in dirs.items():
    dfs[key] = {}
    for fn in os.listdir(parent):
        if not fn.endswith('.csv'): continue
        if fn not in files: continue
        sub_key = fn.replace('.csv', '')
        dtypes = dict(userId=int, movieId=int,
                      imdbId=str, tmdbId=str)
        filename = os.path.join(parent, fn)
        dfs[key][sub_key] = pd.read_csv(filename, dtype=dtypes)
all_data = dfs

[4]:

main = 'imdb-data'
parent = 'title-ratings'
dtypes = dict(tconst=str, numVotes=int)
fn = os.path.join(main, parent, 'data.tsv')
df = pd.read_csv(fn, sep='\t', dtype=dtypes)
df['tconst'] = df['tconst'].apply(lambda x: x[2:])
df = df.merge(all_data['large']['links'], left_on='tconst', right_on='imdbId', how='right')
all_data['imdb-ratings'] = df

Convert timestamps to human readable dates#

[5]:

data_df = all_data['large'].copy()
data_df['franchise_movies']['date'] = pd.to_datetime(data_df['franchise_movies']['releaseDate'], unit='s')

Analysis of return on investment#

[6]:

indv_roi = get_roi(data_df['franchise_movies'])

[7]:

jb_movies = indv_roi.groupby('seriesId').get_group(10).copy()
jb_movies['date'] = pd.to_datetime(jb_movies['releaseDate'], unit='s')
jb_movies.sort_values(by=['roi'], ascending=False)[['title', 'date', 'budgetTotal', 'profit', 'roi']].reset_index(drop=True)

[7]:

	title	date	budgetTotal	profit	roi
0	Dr. No	1963-05-08	1000000	58567035	5856.703500
1	Goldfinger	1964-12-22	3000000	121900000	4063.333333
2	From Russia with Love	1964-04-08	2000000	76900000	3845.000000
3	Live and Let Die	1973-06-27	7000000	154800000	2211.428571
4	Diamonds Are Forever	1971-12-17	7200000	108799985	1511.110903
5	Thunderball	1965-12-29	9000000	132200000	1468.888889
6	The Man with the Golden Gun	1974-12-20	7000000	90600000	1294.285714
7	The Spy Who Loved Me	1977-07-13	14000000	171400000	1224.285714
8	You Only Live Twice	1967-06-13	9500000	102100000	1074.736842
9	On Her Majesty's Secret Service	1969-12-18	8000000	74000000	925.000000
10	For Your Eyes Only	1981-06-26	28000000	167300000	597.500000
11	Octopussy	1983-06-10	27500000	160000000	581.818182
12	Moonraker	1979-06-29	31000000	179300000	578.387097
13	Goldeneye	1995-11-17	60000000	296429933	494.049888
14	Casino Royale	2006-11-17	102000000	492420216	482.764918
15	Skyfall	2012-11-08	200000000	910526981	455.263490
16	A View to a Kill	1985-05-24	30000000	122627960	408.759867
17	The Living Daylights	1987-07-31	40000000	151199996	377.999990
18	Never Say Never Again	1983-10-07	36000000	124000000	344.444444
19	Licence to Kill	1989-07-14	42000000	114167015	271.826226
20	Casino Royale	1967-04-28	12000000	29744718	247.872650
21	Tomorrow Never Dies	1997-12-19	110000000	229504276	208.640251
22	Die Another Day	2002-11-22	142000000	289942139	204.184605
23	No Time to Die	2021-10-08	250000000	509959662	203.983865
24	Spectre	2015-11-06	300000000	579077344	193.025781
25	The World is Not Enough	1999-11-19	135000000	226730660	167.948637
26	Quantum of Solace	2008-11-14	230000000	361692078	157.257425

What you can actually see is that a lot of the older James Bond movies actually have a much higher return on investment that the newer ones with Quantum of Solace actually performing the worst. Could this stem from how action movies themselves are more expensive to produce?

[8]:

fig = plt.figure(1, figsize=(6,6), dpi=75)
x = np.linspace(jb_movies['releaseDate'].min(), jb_movies['releaseDate'].max(), 100)
ax = fig.add_subplot(111)
ax2 = ax.twinx()
ax.axhline(0, color='k', linewidth=0.7)
ax.plot(jb_movies['date'], jb_movies['roi'], 'o', label='Return on Investment')
ax2.plot(jb_movies['date'], jb_movies['profit']/1e6, '^', color='tab:red', label='Profit')
ax.set_xlabel('Date')
ax.set_ylabel('Return on investment / %')
ax2.set_ylabel('Profit (not adjusted for inflation) / 10$^6$ USD')
ax.set_title('Return on investment for James Bond movies')
h1, l1 = ax.get_legend_handles_labels()
h2, l2 = ax2.get_legend_handles_labels()
handles = h1 + h2
labels = l1 + l2
ax.legend(handles, labels, bbox_to_anchor=(1.01,1), loc='upper left')
align_y_axis(ax, ax2)

../_images/movie-analysis_james_bond_analysis_12_0.png

Here we see a plot of the return on investment for the James Bond series in blue and the profit in red. We see how the profit (not adjusted for inflation) actually goes up significantly. So let’s maybe take a look at how the ratings for the movies has changed over time.

Ratings per lead actor#

[9]:

df1 = jb_movies.copy()
df1.reset_index(drop=True, inplace=True)
tmp = data_df['ratings'].merge(df1[['movieId', 'title', 'releaseDate', 'date', 'roi']], on='movieId')
jb_ratings = tmp.merge(data_df['franchise_tags'][['movieId', 'actor', 'actorId']], on='movieId')
jb_ratings['reviewDate'] = pd.to_datetime(jb_ratings['timestamp'], unit='s')

[10]:

tabs = gen_tab_object(generate_ratings, jb_ratings, 2)
tabs

[10]:

Unfortunately, our dataset that we got from MovieLens is incomplete for all but the James Bond movies starring Daniel Craig and Pierce Brosnan as they are just too old. Overall, however, there does not seem to be a consistent trend in the data showing a reason for either a decrease or increase of return on investment or profit, respectively.

A quick look through Wikipedia ( this in particular) actually shows a steady increase in the salary of the lead actors and if we factor the increased use of VFX and stunts it might be enough to really show how the movies don’t have as big of a return on investment, and with an increased use of VFX and the advancement in the realism of VFX could be a contributing factor to the increase in profits over time.

One last thing that I wanted to look at is the stability of the data. I.e. how does the data we used above really compare with other datasets. I will compare the average ratings to those from IMDb and see how far from the median the average is.

Comparing to IMDb data#

[11]:

grouped = jb_ratings.groupby('movieId')
count = grouped['rating'].count().rename('count')
all_time_avg = grouped['rating'].mean().rename('average')
all_time_median = grouped['rating'].median().rename('median')
all_time_std = grouped['rating'].std().rename('std')
df = pd.concat([all_time_avg, all_time_median, all_time_std, count], axis=1)
df = df.merge(all_data['imdb-ratings'][['averageRating', 'numVotes', 'movieId']], on='movieId')
df = df.merge(jb_movies[['title', 'movieId', 'date']], on='movieId')
df['title'] = df['title'] + df['date'].apply(lambda x: ' ('+str(x.year)+')')
df['averageRating'] /= 2
df['numVotes'] = df['numVotes'].astype(int)
rcols = dict(averageRating='imdbRating', numVotes='imdbVotes')
df.rename(columns=rcols).set_index('title').sort_values(by=['date']).drop(['movieId', 'date'], axis=1)

[11]:

	average	median	std	count	imdbRating	imdbVotes
title
Dr. No (1963)	3.665154	4.00	0.835024	9694	3.60	175470
From Russia with Love (1964)	3.688400	4.00	0.849023	9586	3.65	141772
Goldfinger (1964)	3.729858	4.00	0.865900	15527	3.85	198327
Thunderball (1965)	3.599912	3.50	0.846492	5690	3.45	124284
Casino Royale (1967)	2.877232	3.00	1.054743	1120	2.50	31603
You Only Live Twice (1967)	3.593695	3.50	0.802609	3394	3.40	114921
On Her Majesty's Secret Service (1969)	3.428017	3.50	0.932231	3605	3.35	96895
Diamonds Are Forever (1971)	3.495077	3.50	0.852021	5992	3.25	111678
Live and Let Die (1973)	3.488243	3.50	0.857393	6294	3.35	112948
The Man with the Golden Gun (1974)	3.464959	3.50	0.852398	5037	3.35	110691
The Spy Who Loved Me (1977)	3.531111	3.50	0.840140	5609	3.50	113778
Moonraker (1979)	3.158547	3.00	0.939335	6014	3.10	106341
For Your Eyes Only (1981)	3.437548	3.50	0.868478	5212	3.35	106002
Octopussy (1983)	3.310463	3.50	0.881729	2332	3.25	110735
Never Say Never Again (1983)	3.266388	3.50	0.931870	1556	3.05	71465
A View to a Kill (1985)	3.182722	3.00	0.920832	4526	3.15	102587
The Living Daylights (1987)	3.290004	3.00	0.890285	2731	3.35	103645
Licence to Kill (1989)	2.625000	2.75	1.187735	8	2.75	1028
Goldeneye (1995)	3.434835	3.50	0.878003	34942	3.60	265904
Tomorrow Never Dies (1997)	3.230008	3.00	0.935325	15919	3.25	201493
The World is Not Enough (1999)	3.204092	3.00	0.952916	11583	3.20	206889
Die Another Day (2002)	3.086181	3.00	0.999052	8720	3.05	225752
Casino Royale (2006)	3.837693	4.00	0.862241	28517	4.00	680473
Quantum of Solace (2008)	3.300951	3.50	0.935187	10196	3.30	463211
Skyfall (2012)	3.741676	4.00	0.890015	15918	3.90	718361
Spectre (2015)	3.392451	3.50	0.936909	5630	3.40	457116
No Time to Die (2021)	3.484932	3.50	0.884875	1626	3.65	430310

The overall conclusion from this is that the data from MovieLens seems to not have a large skew, albeit having a smaller sample size than IMDb with the exception of Licence to Kill (1989) which only has 8 votes and that is reflected in the larger standard deviation to the rest. It would be interesting to see how the standard deviation and median compare for the IMDb data, but I don’t have access to that data, unfortunately.

See how normalized the ratings are#

Something else that we could look at to see how the data is actually behaving is to look at how close to the average the individual data points are. So we will plot the votes into a histogram and see how nice of a bell-curve it makes.

[12]:

tabs = gen_tab_object(generate_hist, jb_ratings, 9)
tabs

[12]:

Note: Average is represented by the solid black line and the median is the dashed black line.

What we can actually see here is that for most of the plots the maximum in the probability density distribution function lines up pretty well with the bins with the highest number of votes. However, for movies starring Daniel Craig, the distribution is very even, whereas for movies starring someone like Pierce Brosnan there are some peaks and valleys in the data for movies like Goldeneye (1995) which seems to be the most popular one with the highest number of votes out of any other James Bond film. This does not necessarily point to any issues with the dataset, however, I am wondering if there might be some kind of bias towards whole numbers from those voting. This seems to also be a trend is some of the other movies with Roger Moore and Sean Connery as James Bond.

Comparison of voting with different kinds of viewers#

Here I just want to look at differences in voting for people that are casual watchers and fans. I am using the following definitions

casual: people that have voted for less than 2 movies inclusive
story fans: people that voted for at least 3 movies but less than 8 movies
super story fans: people that voted for at least 8 movies
godly fans: people that have voted for at least 16 movies
super godly fans: people that have voted for at least 20 movies

[13]:

df = jb_ratings.copy()
count = df.groupby('userId')['rating'].count().rename('count')

Casual group#

[14]:

ids = count[count <= 2]
casual_df = df.merge(ids, on='userId', how='right')

[15]:

tabs = gen_tab_object(generate_hist, casual_df, 9)
tabs

[15]:

Story fans group#

[16]:

ids = count[count.between(3, 7)]
stfans_df = df.merge(ids, on='userId', how='right')

[17]:

tabs = gen_tab_object(generate_hist, stfans_df, 9)
tabs

[17]:

Super story fans group#

[18]:

ids = count[count.between(8, 15)]
spstfans_df = df.merge(ids, on='userId', how='right')

[19]:

tabs = gen_tab_object(generate_hist, spstfans_df, 9)
tabs

[19]:

Godly fans group#

[20]:

ids = count[count >= 16]
godfans_df = df.merge(ids, on='userId', how='right')

[21]:

tabs = gen_tab_object(generate_hist, godfans_df, 9)
tabs

[21]:

Super Godly fans group#

[22]:

ids = count[count >= 22]
spgodfans_df = df.merge(ids, on='userId', how='right')

[23]:

tabs = gen_tab_object(generate_hist, spgodfans_df, 9)
tabs

[23]:

Based on what we see here, the distributions seem to become a bit more normalized and don’t seem to favor integer numbers for the ratings. This does not necessarily mean that the data is better than before because now there may be a bias as some of the groups, like super godly, are made of people that have watched many James Bond movies and may no longer represent the general public.

Putting it all together#

Here I will compile all of the data from above and try to make a solid judgement.

[24]:

dfs = {'original': jb_ratings, 'casual': casual_df, 'story': stfans_df,
       'super story': spstfans_df, 'godly': godfans_df,
       'super godly': spgodfans_df}
arr = []
for key, val in dfs.items():
    grouped = val.groupby(['actorId', 'movieId'])
    avg = grouped['rating'].mean().rename('average')
    med = grouped['rating'].median().rename('median')
    std = grouped['rating'].std().rename('std')
    cnt = grouped['rating'].count().rename('count')
    df = pd.concat([avg, med, std, cnt], axis=1)
    df['type'] = key
    df.reset_index(inplace=True)
    arr.append(df)
df = pd.concat(arr, ignore_index=True)
combined_df = df

[25]:

tabs = gen_tab_object(generate_lineplot, combined_df, 9)
tabs

[25]:

Here we see that for the most part limiting those who voted to the criteria that I outlined before does not seem to have as big of an effect as I originally thought. However, we do see that the standard deviation does decrease for some movies and for others there does not seem to be any improvement. This is not supposed to say that the people that have watched more James Bond movies make a “better” reviewer. It is just to say that some of the users that have watched more James Bond movies may look for different things in the movie as opposed to those that do not watch them as much. Some things could include the story, visual effects, action sequences, villains, to name a few.

Conclusion#

I am happy to claim that the dataset that I am using, while is not necessarily the largest as there are only 8 votes for Licence to Kill (1989), does have similar ratings to those from IMDb and the median values are not that far off the average. In addition, the distribution of the votes seems pretty normalized. Finally, we tried to see how the number of James Bond movies would affect the averages, medians, and standard deviations, where there was not a large dependence on it. However, the distributions seem a bit more normalized and not favoring integer numbers for the ratings. However, by grouping the users like this it may no longer be a good description of the entire population as it is adding a bias to the results. It is, however, interesting to see this trend.

So, based on the data I have I cannot make a conclusion on what has caused the huge decrease in the return on investment for James Bond movies over the years. Only when I went to Wikipedia and did some manual digging did I come across something that could be a factor, salary of the lead actors.

Thank you for coming along with me on this fun journey!!

[ ]:

Analysis of James Bond franchise movies

Contents