We have certainly heard a lot about Big Data in recent years, especially with regards to data science and machine learning. Just how large of a data set constitutes Big Data? What amount of data science and machine learning work involves truly stratospheric volumes of bits and bytes? There’s a survey for that, courtesy of Kaggle.
Several thousand data scientists responded to a variety of questions covering many facets of their work. Just over 7000 respondents gave the size of data sets used in their training models. They were also asked about the number of employees and industry segment for their organization. Kaggle provides these results in their entirety as CSV files, and a means to analyze and visualize them within their web-based Kernel environment. I went with Python, as is my usual custom when I do a Kaggle Kernel run.
Let’s cut to the chase and get to the main takeaway. Behold, the chart:
It seems that the vast majority of data science and machine learning action happens below 10 GB. Granted, the question was with regards to model training, so one might suppose that live production data sets are significantly larger. Less than 2 percent of respondents claimed to be wrangling data sets in excess of 100 TB.
Among the top five industries that employ data science practitioners, we can compare usage of large and moderately-sized data sets. Unsurprisingly, tech companies are the leading industry segment.
Another viewpoint is by employer size. Very large organizations comprise the largest segment, followed closely by those with 20 to 500 employees. One might suspect that startups are a good percentage of the latter group.
This is merely the tip of the iceberg for how we can slice, dice, and visualize with this data alone. There are dozens more answers in the survey, covering job titles, tools and technologies used, salary, and much more collected from these data scientists. Check out my Kaggle Kernel to explore further, or contact me to inquire about a customized analysis of the 2017 Machine Learning and Data Science Survey results tailored to your needs.
Stay tuned for more updates from the world of VenaData by following me on Twitter.