Data Visualization

Putting US Tax Reform on the Map

Major income tax changes in the US proposed by Republicans have been big news recently.  The impact will be significant for nearly every taxpayer, creating winners and losers in varying degrees. The complexity of the Federal tax system and numerous variables when filing returns result in no one-size-fits-all formula to assess the impact on any given wallet.

We will start with a the most basic taxation scenario, simplify it some more, and present some findings on a map like this one:


The savings are highest in certain Northeast and Mid-Atlantic states, hovering around $600 per year.  This correlates with higher median wages in the region.  In the South and Midwest, it’s less. The savings are under $300 in Mississippi, where the median annual pay of $24,000 is about $11,000 less than in the high-wage states.  But you should ask, who are the people on this map anyway?

They are a composite person between the ages of 18 and 65, not married with no dependents, earning a median wage for the state they live in.  They also rent their home, and they get all of their income as an employee earning a wage.  State income taxes don’t come into play at any of the median wages used here.  All players on the board take the standard deduction under both existing and proposed rules.

This is about as simple as it gets, other than not needing to file taxes at all.  I should also note that incomes below $10,650 were excluded from the dataset before calculating the median, since income under that is not taxable.  All of this data was distilled from 2013 American Survey courtesy of the US Census Bureau. About 261,000 respondents all told.  Real wages haven’t exactly skyrocketed in recent years, so 2013 vintage income data is good enough for this purpose.


Things change when homeownership enters the picture.  For this scenario we will deduct 30% off the median incomes for mortgage interest and property taxes, as well as factor in state income or sales taxes (applies to every state except Alaska).  Now the tax savings are greatly reduced under the new rules, and in fact some would pay more under the new system.  Some of this impact on homeowners can be attributed to eliminating deductions of state tax under the proposed system.  Also, unlike for non-homeowners, the tax changes are less favorable in the Northeast and Mid-Atlantic high wage states than elsewhere.  The loss is as much as $226 per year in Massachusetts at the median wage level.


This is done with Python code in Kaggle, where there are also interactive versions of the above maps.  The homeowner map was generated here.  Details about the source data and how it was processed can be found in these links as well.  I am hoping this will inspire others who enjoy analyzing data to try it out with other tax scenarios.  Such examples could be a married couple, adding dependent children, and using different income brackets.  I believe more maps of this sort will arm taxpayers in the US with a better understanding of any new tax law proposals.


Data Science, Data Visualization

Big Data Or Big Hype?

We have certainly heard a lot about Big Data in recent years, especially with regards to data science and machine learning.  Just how large of a data set constitutes Big Data?  What amount of data science and machine learning work involves truly stratospheric volumes of bits and bytes?  There’s a survey for that, courtesy of Kaggle.

Several thousand data scientists responded to a variety of questions covering many facets of their work.  Just over 7000 respondents gave the size of data sets used in their training models.  They were also asked about the number of employees and industry segment for their organization.  Kaggle provides these results in their entirety as CSV files, and a means to analyze and visualize them within their web-based Kernel environment.  I went with Python, as is my usual custom when I do a Kaggle Kernel run.

Let’s cut to the chase and get to the main takeaway.  Behold, the chart:


It seems that the vast majority of data science and machine learning action happens below 10 GB.  Granted, the question was with regards to model training, so one might suppose that live production data sets are significantly larger.  Less than 2 percent of respondents claimed to be wrangling data sets in excess of 100 TB.


Among the top five industries that employ data science practitioners, we can compare usage of large and moderately-sized data sets.  Unsurprisingly, tech companies are the leading industry segment.



Another viewpoint is by employer size.  Very large organizations comprise the largest segment, followed closely by those with 20 to 500 employees.  One might suspect that startups are a good percentage of the latter group.


This is merely the tip of the iceberg for how we can slice, dice, and visualize with this data alone.  There are dozens more answers in the survey, covering job titles, tools and technologies used, salary, and much more collected from these data scientists.  Check out my Kaggle Kernel to explore further, or contact me to inquire about a customized analysis of the 2017 Machine Learning and Data Science Survey results tailored to your needs.

Stay tuned for more updates from the world of VenaData by following me on Twitter.