..

Database performance benchmarks are useless

Can you name the fastest data analytics (databases, log analytics, etc.) product?

That’s a trick question. If you start Googling, you will inevitably stumble upon various companies claiming their product is the fastest. Sometimes you can even grab popcorn and enjoy two vendors fighting each other in the Internet courts defending their claims. But sometimes, marketing teams forget there are no winners in the Internet courts. Perhaps, they know it very well, and that’s the point? (to get more website hits?)

So, what should you do if you are looking for a fast analytics product?
That’s another trick question. The speed of a query should never be the only criterion for picking an analytics product. Just don’t go and choose a product because it is fast. Evaluate holistically-

  • How easy is it to get data in? Do I need to create tables and manage schema? Or is the product schema-less?
  • What does the query experience look like? Does it provide a SQL interface? What if I need to query nested JSON? What about JSON arrays?
  • What does it take to operationalize it?
    • What’s the backup and restore experience like?
    • What metrics does it expose that you can use for optimizing queries/schema?
  • What’s the RBAC story?

Those are just a few things you ask about product selection.
The performance benchmarks are useless for another vital reason- they don’t reflect performance on your datasets Although there are so many standardized datasets available- from NYC taxi to TPC etc. I can’t recommend enough that when you benchmark, use your own dataset. Much like every living being has its unique personality- datasets are like that too- they have their own personality. It is your data, so the benchmarking should be against your data.