The word analytics is one of the top 5 highly discussed subjects across the IT industry. Invariably the brother word Big Data also comes along with analytics. Every time we meet our customers who have got BI or analytics implementation, we hear about performance issues as well. It starts very well, but as days go by, all dashboards become slow. This gets attraction because the senior management is actively using those dashboards.
There are 3 top items one must take care while implementing data analytics.
Management constantly compares OLTP
side infrastructure investment vs the analytics infrastructure investment. On OLTP side, one will not have a long tail of historic data, but for the analytics, historic data is important. That too, if you use machine learning packages, you definitely need historic data. Hence the hardware needs are higher than that of the OLTP system.
The management must understand one thing. Just 1 single vital intelligence out of a tera byte of data, can be the holy grail. To know that, to get that and to use that vital piece, you must invest. Data is your biggest asset. The valuation of Uber is more than that of GM or Ford. Uber does not create cars, does not own cars, does not own drivers, does not own maps either! The only hook they have is the data of consumers. That jacks up their value. So invest in good infrastructure for your analytics works. Else it will get very slow and you will put the project on hold.
Miscalculation of initial data volume and incremental data volume
Most consultants try analytics projects with just GBs of data and try to implement in production. We have seen just 2 motors, generating million performance records per day thru IOT devices. Also, the time window within which the incremental data gets ingested is a very critical factor. In many cases, incremental loads must be loaded within minutes. This miscalculation has ripple effects on disk size, data distribution logic, table partition logic, data retrieval logic etc. So ensure you calculate these 2 correctly and test it with that much dummy data before you go live.
Old mindset to use regular RDBMS for analytics as well.
The familiarity to RDMBS makes teams to continue using the same for analytics as well. The moment data lakes are created, most teams fail to properly do ETL on that data and fix data issues. RDBMS need indexes and building index is a time consuming process over 100s of millions of records. So one must definitely unlearn the RDBMS and move to columnar data stores.
If you forget the above 3, irrespective of what hardware you use, you will hear performance issues from your customers. Guaranteed.
Perform Faster.