Tuesday, May 7, 2013

Meaning of NOSql and BigData for Software Engineers


The Term NoSql as is No SQL doesn't convey the real meaning of the concept behind NOSql. After-all    SQL stands for Standard Query Language. It is a language of querying relations  which is based on Relational Algebra and Relational Calculus. It is a query language has no bearing on the kind of processing that NoSql implies. In fact you can use SQL to query your “NoSql/BigData” as in Hive.

The Term NoSql as “Not Just SQL” is closer to the meaning implied by NoSQL but still doesn't really convey what the NoSQL is all about.  It is not about what language you use to query your data.

Big Data is also not really meaningful. Sure the size of data might be large but you can have NoSQL  problem with small data (at least in today's relative terms).

IMHO, NoSQL intends to say your data is not ACID. ACID as in Atomic, Consistent, Isolated, and Durable has been the corner stone of the transactional databases. In "NoSql"  you are dealing persisted data without strict grantees on its Atomicity, Consistency, Isolation, and/or Durability. Another word you have noisy data with duplication, inconsistency, loss. The goal of NoSql is to develop software that can work with such data. Even if you have small size but noisy data, the standard algorithms that work on ACID data would not yield useful results.  To them non-ACID data is garbage and you endup with garbage-in-garbage-out dilema.

A better way to think of NoSql is to think of problem of inference about the underlying model in the data, prediction on future data, and/or making decisions all using the noisy data (of any size). That is the problem that has been address in the statistics community as Bayesian analysis. The challenge for software developers tackling NoSql/Big Data problems is to understand and incorporate statistical analysis in their application. A good place to start on this are an excellent encyclopedic write up by Professor David Drapper's Bayesian Statistics or his priceless in-depth lectures on the topic  Bayesian Modeling, Inference, Prediction and Decision-Making.

No comments: