Commentary, Data Science

Any Major Data Dude Will Tell You

Apologies to Steely Dan.  I am referencing a song of theirs which mentions a mythical creature called a squonk.  (And yes, Genesis has a Squonk song as well).  Anyway, the beast is so ashamed of it’s appearance that it wallows in constant sorrow.  Kind of like the way many long-time database professionals feel these days.  Us SQL folks didn’t get asked to dance, let alone be crowned king or queen, at the data science homecoming this year.  Looks like we’ll get to stay home on prom night as well, while the people with the Sexiest Job of the 21st Century party on.

Note, this article is filled with nothing but opinions and some things that might generously be called observations. Almost no facts were gathered, and everything stated has been greatly exaggerated.  Oh, and some folks might get really put off by some things I’ll say here.  Be warned.

While the frozen wind whips off The Lake in this hinterland, where I remain in exile from all things bleeding edge, I contemplate my future as a data professional.  Thought leadership can come from a very dark place sometimes, so brace yourself for the next few sentences.  We are in a Data Science Bubble.  It’s gonna burst and leave behind scattered remains of algorithms and shattered dreams carrying advanced degrees.  This is not a new idea, as noted in this warning from Oracle back in 2014 that so far seems to have fallen on deaf ears.  Nobody got the memo and the hype goes on unabated.  

If you come from a background similar to mine, with DBA, data warehousing, data analyst, or even data architect sprinkled all over your resume, you might see data science as the next logical career progression.  Between this much-hyped new kind of data wrangler, and the rise of full-stack ninjas with NoSQL chops, today’s DBA is in a squeeze.  Better find new work right quick.  

Check out this choice quote from Alok Kumar, in his rather poignant article:

Being a Data Analyst and also a DBA for sometime now, I realized that I have already been around petabytes of data till date, so the transition shouldn’t be that difficult with my expertise in SQL, Excel and reporting tools like PowerBI, SAP BO, etc.”

Data scientists in the corporate world certainly appear to bask in job perks and respect that have long since been stripped away from other data professionals.  Not to mention you can turn off your phone at night and on weekends, since there’s no on-call.  Surely the years of SQL wizardry, savvy data modelling, and many terabytes of data wrangled will mean something when aspiring to this fabulous new career.  Just need to grab some Python and get hip to a few machine learning libraries, and we’re riding the gravy train!

Whoa, hold on a minute there.  Where’s your PhD in statistics, or better yet, physics?  Never mind that, look at all those people blogging, with the pretty charts!  They got all of that just from a handful of massively open night classes, and so can you!  Yes, you and the 29,000 other people this month enrolled in the exact same five-week course.  And that’s just one out of about 129 other programs being offered simultaneously.  Every week, all year long.  OK, I admit I just pulled those numbers out of nowhere, but have you ever looked at the size of student bodies at actual universities?  Have you accounted for the fact these populations are spread out over every major from Accounting to Zoology?  And yet every week it seems an entire university-sized crowd “graduates” in this one niche specialization, which in turn is a subset of a relatively narrow field. 

If you don’t find this believable, don’t look at me.  It’s just an off-the-cuff hunch I have.  Go get the data yourself and run the numbers.  You’re the budding data wrangler after all, aren’t you?  The point is, that’s a whole lot of newly minted data scientists soon to hit the streets to collect on their golden meal ticket.  Sure, some will wash out after discovering they can’t spell “algorithm” or “regression”.  Others will get lost in the labyrinth of installing all those libraries and frameworks and other shiny things, only to never come back.  Dealing with dirty, messy data all the time will certainly spur on a few retirements.  But many will make it (or fake it) and be extremely anxious to show the world who the boss is now.

Some of this new stuff is hard to learn, which may seem perplexing to those who spent countless years working on relational databases.  SQL folks used to their old ways may have a tough row to hoe in this transition.  Again I refer to the brilliant insight of Alok Kumar, the one who tells it like it is:

This was extremely depressing for me because instead of coming closer to my king-size lifestyle and fancy perks, I was getting even more distant from it.”

You can’t be on LinkedIn or Medium without getting confronted with a deluge of thought leadership pieces and tutorials appearing to report from the front lines of data science.  The bandwagon is getting pretty full and about to tip over.  At first it seemed like data science would be the answer to everything.  By now though I’m sure at least a few companies having buyers remorse will be glad to finally take down their ivory tower bit bucket wizards a peg or two.  Paying big salaries and professional respect to a mere geek is anathema to execs and shareholders alike, once they are done obtaining whatever competitive advantage they were after.  Off to the open office boiler room with you, where we put all those Java guys and Oracle DBAs years ago.  

Commoditization and consolidation always follow every big hype trend requiring an initially scarce skill set.  Usually it hits right after the poor saps trying to avoid layoffs in another functional area, or who are simply chasing easy money, hop on to ride the trend.

Well we’ve beaten the supply side of data science workers to death, what about the demand?  First off, forget the great need regarding Big Data.  Overwhelmingly large data sets are more of an exception rather than the rule, even within the data science community.  Beyond that, much of what a company wants, unless they are a Google or an Amazon, could easily still be served with Business Intelligence tools.  All of the established big vendors in the BI space, and some small ones, are upping their game to package data science capabilities into their tool sets.  The functionality will be automatic and the plumbing neatly hidden.  Savvy business users will be able to leverage these tools to achieve insights without needing a resident data scientist.  Traditional DBAs and the like will be needed to keep these tools fed and happy with clean data on fast, reliable systems.

Moreover, organizations face many hurdles that are decidedly non-technical when attempting to leverage their data science assets.  Management might not know what to actually do with this new-found talent to create something productive.  There are politics, budget constraints, poorly trained support staff, or inadequate data infrastructure that all serve to impede data science efforts.  Things like this will pile up and eventually create a backlash from the top brass who have yet to see any results.  In the end they will stop trying to build it, and just buy it.

This coming home to roost will inevitably be perfectly timed with the arrival of a massive horde trained for, and wanting to do nothing but, data science.  Right around this time the blogging focused on this area will peak at an all-time high.  Only this will stem from all of the underemployed people with extra time on their hands, desperately harping away on social media to market themselves.  The vendors in this segment clogging the market with their wares will undoubtedly be doing their part to keep the hype machine fed.  All of this buzz will make the casual observer believe things are better than OK, when in fact the crash is well under way.

One last word from our good friend Alok:

because I chose the wrong side, as the only thing I (unconsciously)focused on was the Data Scientist Salary, associated perks,recognition and a lavish life ahead. “

So let this serve as a warning before you abandon your years of hard-won knowledge and experience to chase a new trend.  By all means pick up a little Python, R, stats, or machine learning.  Such things are fun and ultimately will serve you well.  But remember what your core skills and strengths are. SQL and traditional databases are not going away anytime soon.

Chances are if you got in and stayed this long as a DBA or data warehouse person or whatever, it’s because you love working with data and have a lot of tenacity.  Turns out these will always be among the most important qualities for staying power in this field, whatever it ends up turning into. This goes back to when it was COBOL, for those who remember the 80s, and is certainly true today in data science.  It’s going to remain true when this iteration of hype dies down and a brand new trend in the world of data emerges.

Companies will always have a need for people with meat and potatoes, down to earth data skills like SQL or BI to keep their operations going and gain insights for growth.  VenaData exists in large part to fulfill this exact need, for clients who seek tangible value from data.  

Data Visualization

Putting US Tax Reform on the Map

Major income tax changes in the US proposed by Republicans have been big news recently.  The impact will be significant for nearly every taxpayer, creating winners and losers in varying degrees. The complexity of the Federal tax system and numerous variables when filing returns result in no one-size-fits-all formula to assess the impact on any given wallet.

We will start with a the most basic taxation scenario, simplify it some more, and present some findings on a map like this one:


The savings are highest in certain Northeast and Mid-Atlantic states, hovering around $600 per year.  This correlates with higher median wages in the region.  In the South and Midwest, it’s less. The savings are under $300 in Mississippi, where the median annual pay of $24,000 is about $11,000 less than in the high-wage states.  But you should ask, who are the people on this map anyway?

They are a composite person between the ages of 18 and 65, not married with no dependents, earning a median wage for the state they live in.  They also rent their home, and they get all of their income as an employee earning a wage.  State income taxes don’t come into play at any of the median wages used here.  All players on the board take the standard deduction under both existing and proposed rules.

This is about as simple as it gets, other than not needing to file taxes at all.  I should also note that incomes below $10,650 were excluded from the dataset before calculating the median, since income under that is not taxable.  All of this data was distilled from 2013 American Survey courtesy of the US Census Bureau. About 261,000 respondents all told.  Real wages haven’t exactly skyrocketed in recent years, so 2013 vintage income data is good enough for this purpose.


Things change when homeownership enters the picture.  For this scenario we will deduct 30% off the median incomes for mortgage interest and property taxes, as well as factor in state income or sales taxes (applies to every state except Alaska).  Now the tax savings are greatly reduced under the new rules, and in fact some would pay more under the new system.  Some of this impact on homeowners can be attributed to eliminating deductions of state tax under the proposed system.  Also, unlike for non-homeowners, the tax changes are less favorable in the Northeast and Mid-Atlantic high wage states than elsewhere.  The loss is as much as $226 per year in Massachusetts at the median wage level.


This is done with Python code in Kaggle, where there are also interactive versions of the above maps.  The homeowner map was generated here.  Details about the source data and how it was processed can be found in these links as well.  I am hoping this will inspire others who enjoy analyzing data to try it out with other tax scenarios.  Such examples could be a married couple, adding dependent children, and using different income brackets.  I believe more maps of this sort will arm taxpayers in the US with a better understanding of any new tax law proposals.


Data Science, Data Visualization

Big Data Or Big Hype?

We have certainly heard a lot about Big Data in recent years, especially with regards to data science and machine learning.  Just how large of a data set constitutes Big Data?  What amount of data science and machine learning work involves truly stratospheric volumes of bits and bytes?  There’s a survey for that, courtesy of Kaggle.

Several thousand data scientists responded to a variety of questions covering many facets of their work.  Just over 7000 respondents gave the size of data sets used in their training models.  They were also asked about the number of employees and industry segment for their organization.  Kaggle provides these results in their entirety as CSV files, and a means to analyze and visualize them within their web-based Kernel environment.  I went with Python, as is my usual custom when I do a Kaggle Kernel run.

Let’s cut to the chase and get to the main takeaway.  Behold, the chart:


It seems that the vast majority of data science and machine learning action happens below 10 GB.  Granted, the question was with regards to model training, so one might suppose that live production data sets are significantly larger.  Less than 2 percent of respondents claimed to be wrangling data sets in excess of 100 TB.


Among the top five industries that employ data science practitioners, we can compare usage of large and moderately-sized data sets.  Unsurprisingly, tech companies are the leading industry segment.



Another viewpoint is by employer size.  Very large organizations comprise the largest segment, followed closely by those with 20 to 500 employees.  One might suspect that startups are a good percentage of the latter group.


This is merely the tip of the iceberg for how we can slice, dice, and visualize with this data alone.  There are dozens more answers in the survey, covering job titles, tools and technologies used, salary, and much more collected from these data scientists.  Check out my Kaggle Kernel to explore further, or contact me to inquire about a customized analysis of the 2017 Machine Learning and Data Science Survey results tailored to your needs.

Stay tuned for more updates from the world of VenaData by following me on Twitter.

Data Science

LEGO Data Science

technic_car.jpgWhen not working with data I can often be found building something with LEGO.  How pleased was I to find a Kaggle dataset with a complete inventory of LEGO sets and parts going back to 1950!  The other cool thing with Kaggle is that you can analyze with Python or R, right in the browser.  So let’s take it for a spin and do some data wrangling!

Data work in Kaggle is done in Kernels, where you can run scripts interactively and see results immediately.  Upload your data or pick an existing Kaggle dataset, then choose from R, Python or Julia for coding, and get to it!  It’s all in the cloud and your browser, so no painful install process or library dependency issues to go through.  Under the hood Kernels are actually Jupyter notebooks, so it’s all standard stuff.

I am particularly fond of LEGO Technic, which is aimed at building mechanized models of things like cars or construction vehicles.  Technic parts are more specialized than your typical LEGO bricks, and more than a few are specific to a handful of sets. I wanted to write a Kernel for analyzing the distribution of Technic parts across LEGO sets.

To get right to the Kernel go here, else continue below where I cover some key things regarding the Python code and process.

To start with we will be using the Pandas and NumPy libraries in Python to do our data wrangling.  These get imported into the Kernel as follows:

import numpy as np
import pandas as pd

In Pandas we use a data structure called a DataFrame, which comes with a long list of methods to work with data.  You can think of DataFrames like a SQL table or spreadsheet.  To get a DataFrame going, we can load it up right from a CSV file:

parts = pd.read_csv('../input/parts.csv')

Note that the file path is actually the default location inside the Kernel where the data files are placed.  Commands similar to the above load the other files we need:

sets = pd.read_csv('../input/sets.csv')
inventories = pd.read_csv('../input/inventories.csv')
inv_parts = pd.read_csv('../input/inventory_parts.csv')

It is often helpful to view DataFrames to verify what’s in there.  This command displays the column names and the first five lines:



Next we filter the LEGO sets DataFrame to select only the ones having a Technic theme ID and that were released in recent years:

technic_sets = sets.loc[sets['year'].isin([2015,2016,2017]) & 

Python and R notebooks in Kaggle are pre-wired for visualizations.  Here we use matplotlib to do a simple bar chart:

import matplotlib.pyplot


To perform useful analysis, two Pandas DataFrames can be combined with a join, similar to SQL table joins.  Here the Technic sets and LEGO inventories DataFrames are joined, using the set number as the key column.  The resulting new DataFrame is then joined to the set inventories (which is a parts list).  The end product of these joins is essentially a flattened table of LEGO Technic sets with all of their parts and quantities.

technic_inv = technic_sets.set_index('set_num').join(inventories[['id','set_num']].set_index('set_num'))
# join Technic inventories to parts lists
technic_inv_parts = technic_inv.set_index('id').join(inv_parts.set_index('inventory_id'))


The describe method of a DataFrame provides a convenient summary of basic statistics.  Since this method attempts to perform aggregations on any numeric columns, including IDs, not every result will make sense and can simply be ignored.  In this case, num_parts and quantity are the columns we want, for the total number of parts in each set and the individual part counts, respectively.  Here we get fun facts such as the mean, min, max, and percentiles.  We can readily see that for 75% of the Technic parts, we can expect to find a half dozen or less of each per set.



To have more control over aggregation we can use the DataFrame groupby method.  Here we total up the quantities for each part from all of the sets.

part_sums = technic_inv_parts.groupby('part_num')['quantity'].sum()

The next set of statements performs a lot of work.  We create another DataFrame using a groupby, this time to get a count of sets each part appears in.  We then join this to the part count tallies from earlier.  Note that the column rename is required or a conflict will occur and cause an error.  We also want to have the part names and filter out any non-Technic pieces from the list.  We need to join to the parts DataFrame for both of those things.

parts_sets = technic_inv_parts.groupby('part_num')['name'].count()
parts_sets = pd.DataFrame(parts_sets)
parts_sets = parts_sets.rename(columns={'name': 'setsfoundin'})
parts_matrix = parts_sets.join(part_sums)
include_category = [12,52,40,29,26,25,22,55,51,53,30,46,54,8,45,44]
part_matrix_names = parts_matrix.join(parts.set_index('part_num'))
part_matrix_names = part_matrix_names.loc[part_matrix_names['part_cat_id'].isin(include_category)]


What we have been doing is creating, joining, and filtering DataFrames, using the result from each operation as input for the next step in the process.  This process can be referred to as a pipeline or a workflow.  By using Kaggle, or any platform that supports Jupyter notebooks (which could even be set up on your local machine), you can have this repeatable and easily editable workflow for experimental analysis on your data.

When you are done you will no doubt want to save your result.  When a command to save a DataFrame is in your notebook, you can access the file and download it from the Output tab after running your Kernel.


One final thing, we must have some kind of visualization for our result!  This data will appear too cluttered in a bar chart, so we will use a scatter plot.  We also filter the DataFrame to “zoom in” on a range of values for part quantity to cut down on overplotting.  This aims to clarify the relationship between totals of LEGO parts across all sets and the number of Technic sets the parts appear in.  Note the use of the Seaborn visualization library, a variant of Matplotlib.

import seaborn as sea
partscatter = parts_matrix.loc[(parts_matrix['quantity'] > 15) & 
(parts_matrix['quantity'] < 160)] 
sea.regplot(x=partscatter["quantity"], y=partscatter["setsfoundin"])


View and run all the code here.