Commentary, Data Science

Any Major Data Dude Will Tell You

Apologies to Steely Dan.  I am referencing a song of theirs which mentions a mythical creature called a squonk.  (And yes, Genesis has a Squonk song as well).  Anyway, the beast is so ashamed of it’s appearance that it wallows in constant sorrow.  Kind of like the way many long-time database professionals feel these days.  Us SQL folks didn’t get asked to dance, let alone be crowned king or queen, at the data science homecoming this year.  Looks like we’ll get to stay home on prom night as well, while the people with the Sexiest Job of the 21st Century party on.

Note, this article is filled with nothing but opinions and some things that might generously be called observations. Almost no facts were gathered, and everything stated has been greatly exaggerated.  Oh, and some folks might get really put off by some things I’ll say here.  Be warned.

While the frozen wind whips off The Lake in this hinterland, where I remain in exile from all things bleeding edge, I contemplate my future as a data professional.  Thought leadership can come from a very dark place sometimes, so brace yourself for the next few sentences.  We are in a Data Science Bubble.  It’s gonna burst and leave behind scattered remains of algorithms and shattered dreams carrying advanced degrees.  This is not a new idea, as noted in this warning from Oracle back in 2014 that so far seems to have fallen on deaf ears.  Nobody got the memo and the hype goes on unabated.  

If you come from a background similar to mine, with DBA, data warehousing, data analyst, or even data architect sprinkled all over your resume, you might see data science as the next logical career progression.  Between this much-hyped new kind of data wrangler, and the rise of full-stack ninjas with NoSQL chops, today’s DBA is in a squeeze.  Better find new work right quick.  

Check out this choice quote from Alok Kumar, in his rather poignant article:

Being a Data Analyst and also a DBA for sometime now, I realized that I have already been around petabytes of data till date, so the transition shouldn’t be that difficult with my expertise in SQL, Excel and reporting tools like PowerBI, SAP BO, etc.”

Data scientists in the corporate world certainly appear to bask in job perks and respect that have long since been stripped away from other data professionals.  Not to mention you can turn off your phone at night and on weekends, since there’s no on-call.  Surely the years of SQL wizardry, savvy data modelling, and many terabytes of data wrangled will mean something when aspiring to this fabulous new career.  Just need to grab some Python and get hip to a few machine learning libraries, and we’re riding the gravy train!

Whoa, hold on a minute there.  Where’s your PhD in statistics, or better yet, physics?  Never mind that, look at all those people blogging, with the pretty charts!  They got all of that just from a handful of massively open night classes, and so can you!  Yes, you and the 29,000 other people this month enrolled in the exact same five-week course.  And that’s just one out of about 129 other programs being offered simultaneously.  Every week, all year long.  OK, I admit I just pulled those numbers out of nowhere, but have you ever looked at the size of student bodies at actual universities?  Have you accounted for the fact these populations are spread out over every major from Accounting to Zoology?  And yet every week it seems an entire university-sized crowd “graduates” in this one niche specialization, which in turn is a subset of a relatively narrow field. 

If you don’t find this believable, don’t look at me.  It’s just an off-the-cuff hunch I have.  Go get the data yourself and run the numbers.  You’re the budding data wrangler after all, aren’t you?  The point is, that’s a whole lot of newly minted data scientists soon to hit the streets to collect on their golden meal ticket.  Sure, some will wash out after discovering they can’t spell “algorithm” or “regression”.  Others will get lost in the labyrinth of installing all those libraries and frameworks and other shiny things, only to never come back.  Dealing with dirty, messy data all the time will certainly spur on a few retirements.  But many will make it (or fake it) and be extremely anxious to show the world who the boss is now.

Some of this new stuff is hard to learn, which may seem perplexing to those who spent countless years working on relational databases.  SQL folks used to their old ways may have a tough row to hoe in this transition.  Again I refer to the brilliant insight of Alok Kumar, the one who tells it like it is:

This was extremely depressing for me because instead of coming closer to my king-size lifestyle and fancy perks, I was getting even more distant from it.”

You can’t be on LinkedIn or Medium without getting confronted with a deluge of thought leadership pieces and tutorials appearing to report from the front lines of data science.  The bandwagon is getting pretty full and about to tip over.  At first it seemed like data science would be the answer to everything.  By now though I’m sure at least a few companies having buyers remorse will be glad to finally take down their ivory tower bit bucket wizards a peg or two.  Paying big salaries and professional respect to a mere geek is anathema to execs and shareholders alike, once they are done obtaining whatever competitive advantage they were after.  Off to the open office boiler room with you, where we put all those Java guys and Oracle DBAs years ago.  

Commoditization and consolidation always follow every big hype trend requiring an initially scarce skill set.  Usually it hits right after the poor saps trying to avoid layoffs in another functional area, or who are simply chasing easy money, hop on to ride the trend.

Well we’ve beaten the supply side of data science workers to death, what about the demand?  First off, forget the great need regarding Big Data.  Overwhelmingly large data sets are more of an exception rather than the rule, even within the data science community.  Beyond that, much of what a company wants, unless they are a Google or an Amazon, could easily still be served with Business Intelligence tools.  All of the established big vendors in the BI space, and some small ones, are upping their game to package data science capabilities into their tool sets.  The functionality will be automatic and the plumbing neatly hidden.  Savvy business users will be able to leverage these tools to achieve insights without needing a resident data scientist.  Traditional DBAs and the like will be needed to keep these tools fed and happy with clean data on fast, reliable systems.

Moreover, organizations face many hurdles that are decidedly non-technical when attempting to leverage their data science assets.  Management might not know what to actually do with this new-found talent to create something productive.  There are politics, budget constraints, poorly trained support staff, or inadequate data infrastructure that all serve to impede data science efforts.  Things like this will pile up and eventually create a backlash from the top brass who have yet to see any results.  In the end they will stop trying to build it, and just buy it.

This coming home to roost will inevitably be perfectly timed with the arrival of a massive horde trained for, and wanting to do nothing but, data science.  Right around this time the blogging focused on this area will peak at an all-time high.  Only this will stem from all of the underemployed people with extra time on their hands, desperately harping away on social media to market themselves.  The vendors in this segment clogging the market with their wares will undoubtedly be doing their part to keep the hype machine fed.  All of this buzz will make the casual observer believe things are better than OK, when in fact the crash is well under way.

One last word from our good friend Alok:

because I chose the wrong side, as the only thing I (unconsciously)focused on was the Data Scientist Salary, associated perks,recognition and a lavish life ahead. “

So let this serve as a warning before you abandon your years of hard-won knowledge and experience to chase a new trend.  By all means pick up a little Python, R, stats, or machine learning.  Such things are fun and ultimately will serve you well.  But remember what your core skills and strengths are. SQL and traditional databases are not going away anytime soon.

Chances are if you got in and stayed this long as a DBA or data warehouse person or whatever, it’s because you love working with data and have a lot of tenacity.  Turns out these will always be among the most important qualities for staying power in this field, whatever it ends up turning into. This goes back to when it was COBOL, for those who remember the 80s, and is certainly true today in data science.  It’s going to remain true when this iteration of hype dies down and a brand new trend in the world of data emerges.

Companies will always have a need for people with meat and potatoes, down to earth data skills like SQL or BI to keep their operations going and gain insights for growth.  VenaData exists in large part to fulfill this exact need, for clients who seek tangible value from data.  

performance tuning, SQL Server

Fix SQL Server Error 1919 During Index Create

When working on a Microsoft SQL Server database tuning project with a client, I encountered a puzzling error message. Based on performance metrics and analyzing query execution plans I had recommended a couple of new indexes. The index creation scripts looked similar to the following (I redacted the actual object names here):

CREATE NONCLUSTERED INDEX [IX_tblA_colX] ON [dbo].[tblA] 
    ([ColX] ASC) ON [PRIMARY]
GO

The first index went in just fine, while the second failed with an error message similar to this one:

Msg 1919, Level 16, State 1, Line 1
Column 'ColX' in table 'dbo.tblA' is of a type that is invalid for use as a key column in an index.

This error commonly occurs when the key column for the index is of type NVARCHAR(MAX) which is not supported, or if the index key exceeds 900 bytes. Neither of these conditions were true, as the column was defined with something on the order of NVARCHAR(150), well within the constraints for a non-clustered index key.

One fact I was already aware of is that this column had actually started out as an NVARCHAR(MAX). In fact there were many columns like this across the entire database, which I suspect was a result of using Microsoft Entity Framework to develop the application. On my advice the client had converted all of these due to all of the performance issues they were experiencing.

Converting the column type was straightforward using this SQL statement:

ALTER TABLE [dbo].[tblA]
    ALTER COLUMN [ColX] [nvarchar](150) NOT NULL
GO

So how is it that this error message was encountered and we could not create the index? This one nearly had me stumped. I was fairly certain this had something to do with the data type conversion somehow appearing incomplete from the viewpoint of the index creation statement.

I suggested to my client that they run two T-SQL maintenance commands similar to the following:

UPDATE STATISTICS tblA WITH FULL SCAN

Followed by:

dbcc updateusage (mydb,'dbo.tblA')

Once these commands were finished, creating the index succeeded.

Evidently what had happened is that either the table column statistics or space allocation had not updated to reflect the column data type change. By default SQL Server maintains this information automatically, but on large, busy databases this should be taken care of proactively. This is typically done with a weekly or even nightly SQL Server Agent job. I subsequently learned that my client up to this point had no such maintenance job in place. They were in startup mode on project that had gone live fairly recently. As is the common case they did not have an in-house DBA and relied on application developers to run the database.

Typically out-of-date statistics results in less-than-optimal query performance. Inaccurate space allocation usually shows up when reporting the size of your tables. Getting an error message when creating an index due to these conditions is extremely rare, and did not turn up in my online search. Fortunately I was able to draw upon my many years of experience as a SQL Server DBA to make an educated guess and root out the problem.

The best defense against this type of issue is to always implement a good scheduled maintenance regimen in your database environment. Automatic backups go without saying regardless of the kind of database you use (SQL or NoSQL). In the case of MS SQL Server this regularly scheduled maintenance also includes index defragmentation and statistics updates. An excellent script for doing this can be found here.

Rick Ciesla owns and operates VenaData, a provider of database administration solutions, services and consulting. Contact Rick for more information.

%d bloggers like this: