The Problem with Big Data and “Data Scientists”

I’m sure that simply appending the term “scientist” to someone that deals with data makes them feel important, given the almost priestly aura scientists are accorded in this society, along with the priestly incomes for the most part – one would think many scientists had taken the same vow of poverty as Roman Catholic clerics. Perhaps this feeling of importance helps in rationalizing the poverty level salaries many pure scientists are forced to subsist on.

However, as most people are beginning to realize, not only is “big data” difficult to define (other than that it’s data, and it’s big, which is more than obvious but says remarkably little), and “data scientist” is as good as meaningless.

The positive sciences begin with a posit: physics begins with the posit of motion, biology with that of life, etc. Within the scope of study determined by that posit scientists of whatever stripe analyse data relevant to that study. But to call someone a “data scientist” as opposed to, say, a physicist, removes the scope within which the data the scientist studies is meaningful in any way.

Scientists, of whatever stripe, analyse “things”. Only insofar as a thing is determined to be such is the analysis analysing anything of any relevance. Yet we have trouble understanding what a thing is as such, and the equivocations we make regarding things are one of the greatest causes of confusion and bogus arguments in human interaction. First, outside the biological mode of experience, things are not equivalent to “bodies”. A body is something that would be real to an animal. As animals, we also have an interest in bodies, in fact so much so that whatever our mode of experience we want the things of that context to be “really real”, which for the most part means that they correspond to a body. While it’s obvious that political things such as public opinion don’t correspond to a body, to an “out there now real” that we can touch, move around etc. It can fail to be as obvious that the same is true of intellectual, analytic things, such as subatomic particles. The result is the kind of nonsense where particle physicists try to convice men of common sense that their things of common sense are “really” aggregations of subatomic particles.

Physicists are trained to understand and recognize things in physics. Mathematicians, likewise, recognize mathematical things. Politicians recognise political things and people of common sense recognise common sense things, albeit not as well as one might like. We all, as well as the higher animals, recognize biological things, or bodies, with the difference that we recognize them as such, i.e. we recognize milk as milk, whereas the kitten simply recognizes it and is drawn to ingest it. Yet for the most part we remain unaware that we do so and even when we do think about it, are most often at a loss to understand precisely how we do it. The technical term for determining a thing as such is ontology. “on” in Greek refers to being, and metaphysics takes this as the whatness of a thing, its essence.

We may seem to have taken a major detour from big data and the question of what a data scientist is and therefore what skills might be helpful for someone in that field, if indeed there is such a field, but bear with me for a bit longer.

I said earlier that intellectual things, as opposed to bodies, are unities grasped in data. Part of the reason things, even those of intellectual analysis, are so easily conflated with bodies is that prior to technological instruments capable of gathering data that is inaccessible to a human being on their own the unity that is grasped was already there initially, since the thing was grasped as a unity prior to any data being gathered, grasped precisely as a body.

Big data is, by and large, bad data. What differentiates the sciences from one another (indeed makes different sciences necessary) is that each science works at a different generic level of comprehensiveness. Physics is the lowest level of comprehensiveness in terms of its description of reality. What can only appear as random interactions of particles at that level prove to be systemic at the level of chemistry, what appear to be random molecular interactions often prove to be systemic in the more comprehensive view of molecular biology, etc. Big data is big, with a few exceptions, because we have not been able to determine an appropriately more comprehensive view of it. This is largely because software engineers are not particularly good at finding unities in data. This hasn’t been such a problem in the past simply due to the significantly lower volume of data, but today it’s becoming a crucial lack. The majority of software engineers are primarily coders. Coders are very good at tactically dealing with data in a local fashion, but dealing with data at the more comprehensive levels, even in terms of instance state data for a single object, is largely beyond most coders, hence the popularity of frameworks that do everything possible to avoid maintaining state REST, anyone?), even though it contradicts the very definition of a computer program, which is a state machine. The usual result of sending a coder, rebadged as a data scientist, into a “data lake”, is a drowned coder.

So, if computer engineers, by and large, are unlikely to come up with reasonable solutions, because they have the wrong mindset to do so, who is? Here we bringhhh in the notion of the “data scientist”, but that is no more than a scientist without a science.

The point of most projects dealing with data, whether fast data, big data, or plain old normalized relational data, is analysis (at least once the transaction that generates the data is complete). For something to become an object for analysis, first it has to be determined as a thing insofar as that, what and how it is. As I noted above, this ability is known as ontology. Further, “big data” needs to be ontologically modeled in an appropriate manner in order to yield appropriate and relevant things for analysis, which involves building more and more comprehensive views of data, omitting much of the detail (and also therefore the size) in the process, without omitting relevant detail. Ontology isn’t exactly a massively popular subject at post-secondary facilities these days, yet it is still taught. It also has another name, the name its originator, a Greek by the name of Aristotle, gave it: first philosophy. Ontologists are the only people educated and skilled in grasping unities in data without that data being already specific to a determinate field of study, such as physics, politics, or carpentry.

Not by accident, its originator also defined the posits that 2300 years later still largely define the sciences and their specific differences from one another, from physics to economics. Ontologists today include critical thinkers, phenomenologists and various others involved in the individual and social studies of philosophy within the humanities.


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s