Data Science

LEGO Data Science

technic_car.jpgWhen not working with data I can often be found building something with LEGO.  How pleased was I to find a Kaggle dataset with a complete inventory of LEGO sets and parts going back to 1950!  The other cool thing with Kaggle is that you can analyze with Python or R, right in the browser.  So let’s take it for a spin and do some data wrangling!

Data work in Kaggle is done in Kernels, where you can run scripts interactively and see results immediately.  Upload your data or pick an existing Kaggle dataset, then choose from R, Python or Julia for coding, and get to it!  It’s all in the cloud and your browser, so no painful install process or library dependency issues to go through.  Under the hood Kernels are actually Jupyter notebooks, so it’s all standard stuff.

I am particularly fond of LEGO Technic, which is aimed at building mechanized models of things like cars or construction vehicles.  Technic parts are more specialized than your typical LEGO bricks, and more than a few are specific to a handful of sets. I wanted to write a Kernel for analyzing the distribution of Technic parts across LEGO sets.

To get right to the Kernel go here, else continue below where I cover some key things regarding the Python code and process.

To start with we will be using the Pandas and NumPy libraries in Python to do our data wrangling.  These get imported into the Kernel as follows:

import numpy as np
import pandas as pd

In Pandas we use a data structure called a DataFrame, which comes with a long list of methods to work with data.  You can think of DataFrames like a SQL table or spreadsheet.  To get a DataFrame going, we can load it up right from a CSV file:

parts = pd.read_csv('../input/parts.csv')

Note that the file path is actually the default location inside the Kernel where the data files are placed.  Commands similar to the above load the other files we need:

sets = pd.read_csv('../input/sets.csv')
inventories = pd.read_csv('../input/inventories.csv')
inv_parts = pd.read_csv('../input/inventory_parts.csv')

It is often helpful to view DataFrames to verify what’s in there.  This command displays the column names and the first five lines:

sets.head()

sets

Next we filter the LEGO sets DataFrame to select only the ones having a Technic theme ID and that were released in recent years:

technic_sets = sets.loc[sets['year'].isin([2015,2016,2017]) & 
sets['theme_id'].isin([1,11,12])]

Python and R notebooks in Kaggle are pre-wired for visualizations.  Here we use matplotlib to do a simple bar chart:

import matplotlib.pyplot
technic_sets.plot(kind="bar")

technic_sets_plot

To perform useful analysis, two Pandas DataFrames can be combined with a join, similar to SQL table joins.  Here the Technic sets and LEGO inventories DataFrames are joined, using the set number as the key column.  The resulting new DataFrame is then joined to the set inventories (which is a parts list).  The end product of these joins is essentially a flattened table of LEGO Technic sets with all of their parts and quantities.

technic_inv = technic_sets.set_index('set_num').join(inventories[['id','set_num']].set_index('set_num'))
# join Technic inventories to parts lists
technic_inv_parts = technic_inv.set_index('id').join(inv_parts.set_index('inventory_id'))
technic_inv_parts.head()

technic_inv_parts

The describe method of a DataFrame provides a convenient summary of basic statistics.  Since this method attempts to perform aggregations on any numeric columns, including IDs, not every result will make sense and can simply be ignored.  In this case, num_parts and quantity are the columns we want, for the total number of parts in each set and the individual part counts, respectively.  Here we get fun facts such as the mean, min, max, and percentiles.  We can readily see that for 75% of the Technic parts, we can expect to find a half dozen or less of each per set.

technic_inv_parts.describe()

describe_inv_parts

To have more control over aggregation we can use the DataFrame groupby method.  Here we total up the quantities for each part from all of the sets.

part_sums = technic_inv_parts.groupby('part_num')['quantity'].sum()

The next set of statements performs a lot of work.  We create another DataFrame using a groupby, this time to get a count of sets each part appears in.  We then join this to the part count tallies from earlier.  Note that the column rename is required or a conflict will occur and cause an error.  We also want to have the part names and filter out any non-Technic pieces from the list.  We need to join to the parts DataFrame for both of those things.

parts_sets = technic_inv_parts.groupby('part_num')['name'].count()
parts_sets = pd.DataFrame(parts_sets)
parts_sets = parts_sets.rename(columns={'name': 'setsfoundin'})
parts_matrix = parts_sets.join(part_sums)
include_category = [12,52,40,29,26,25,22,55,51,53,30,46,54,8,45,44]
part_matrix_names = parts_matrix.join(parts.set_index('part_num'))
part_matrix_names = part_matrix_names.loc[part_matrix_names['part_cat_id'].isin(include_category)]
part_matrix_names.head()

part_matrix

What we have been doing is creating, joining, and filtering DataFrames, using the result from each operation as input for the next step in the process.  This process can be referred to as a pipeline or a workflow.  By using Kaggle, or any platform that supports Jupyter notebooks (which could even be set up on your local machine), you can have this repeatable and easily editable workflow for experimental analysis on your data.

When you are done you will no doubt want to save your result.  When a command to save a DataFrame is in your notebook, you can access the file and download it from the Output tab after running your Kernel.

part_matrix_names.to_csv('part_matrix_names.csv')

One final thing, we must have some kind of visualization for our result!  This data will appear too cluttered in a bar chart, so we will use a scatter plot.  We also filter the DataFrame to “zoom in” on a range of values for part quantity to cut down on overplotting.  This aims to clarify the relationship between totals of LEGO parts across all sets and the number of Technic sets the parts appear in.  Note the use of the Seaborn visualization library, a variant of Matplotlib.

import seaborn as sea
partscatter = parts_matrix.loc[(parts_matrix['quantity'] > 15) & 
(parts_matrix['quantity'] < 160)] 
sea.regplot(x=partscatter["quantity"], y=partscatter["setsfoundin"])

scatterplot

View and run all the code here.