Blog

Big Data in the Maturity Stage

Posted By Jakub Nowacki, 15 April 2016

For business there is no valueless data, only this which has not been analysed appropriately and therefore is not presented as information. The demand for analysts and effective tools for the analysis of excessive amounts of data is increasing every year. Year 2016 may bring in this matter solutions facilitating the managing of information as well as their proper classification and analysis.

Less is more: less MapReduce, more SQL.

It has been 12 years since the publication of the MapReduce model by Dean et al. While many of the ideas in this publication were great, translating algorithms into MapReduce “language” is often highly complicated. The reason of another rift with the analysis environment was the need of multi-dimensional programming, which usually is not the domain of analysts. Besides, MapReduce makes for a fairly low-level approach to the exploration what in turn moves a person analyzing the data away from the real business problem.

The need to solve more and more complicated business problems without unnecessary implementation details was the cause of the creation and rapid development of such projects as Apache Hive or Spark SQL. While one cannot say that the trend of moving to a higher level of abstraction is new, this year more and more companies will be interested in investing in Big Data, and require a more higher-level approach.

Since the well-known SQL has been used in new ways in Big Data, relational databases have remained the same. A change from relational technology (SQL) into a non-relational one (NoSQL) is still a traumatic experience for many companies. These are not only primal fears, because with the lack of relations come an upheaval of the whole model of database use. However, relational databases have also changed.

The first change is the more effective use of the growing disc capacity. This is that these days having a server with a disk capacity of 1 TB, even in SSD technology, is not unusual. Given the fact that many companies do not have petabyte data sets, using classical relational database in a master-slave configuration is by all means acceptable. The main problem of this solution is the lack of scalability and query processing by a single thread, which means that both the full scan and the high-speed data saving may be a problem. The developers of databases also took care of these cases and addressed them in projects such as Postgres - XL, or MySQL Cluster. If the tendency remains stable, both projects will definitely be actively developed this year and we will see a growing number of implementations.

Stream processing: here, now!

The need to process streaming data is increasing in many companies due to the universal access to digital documentation workflow, web and mobile services. Therefore, highly effective monitoring systems require certain functions from the database management system, such as: possibility to analyse time series, operations carried out in real time, pre-aggregation of data, which will result in only the relevant data being saved and the processing of data coming at high speed. Apart from technologies known in Big Data circles, such as Apache Storm and Spark Streaming, tools enabling the creation of highly reactive applications, such as Akka czy Vert.x, take on particular importance. Increased popularity of these tools is caused by the possibility to focus on business problems without the need to create a whole platform for the streaming process.

The rise of Data Science. It gives shape to the disordered data.

The rise of the importance of Data Science has two aspects. Firstly, Data Scientist tells the history of data, not only presents dry facts. This trend results from business needs, where much more complicated analytics requires a wider picture, analysis and appropriate presentation – most frequently in the graphic form, and not only a generation of a few indicators. Moreover, analysis results are sent not only to Strategy or Sales Departments but also used in the design of systems, software and users’ interface. Another aspect is the fact that more and more companies are enterprises based on data and, not to be left behind, they need Data Science experts to perform data analyses.

Machine learning

This year machine learning algorithms will become more and more important and will allow the IT systems to gain the ability of independent forecasting of events, generation of multi-dimensional analyses as well as they would enable the change of ordinary data into information, or even knowledge. Moreover, many observers call the year 2016 a year of Deep Learning, that is deep neural networks. It results from the fact that deep networks allow for modelling on a higher level of abstraction, closer to human’s mechanics of description and analysis of data. It can be observed on scoreboards of competitions such as Kaggle, where models based on deep neural networks are winning more and more competitions.

Functional programming: tradition and innovation.

Among all above mentioned tools, which are not new, but are present in the rising trend, functional programming is the oldest, as it has been used in IT since the 1950’s. However, because of the rise of such analytical tools as Apache Spark or reactive platforms of Akka etc., elements of functional programming became a part of traditional languages, such as Java. Two factors have caused this change. The first one results from the fact that a well-written functional algorithm can be basically scaled without any changes to many threads or even processes on many machines, which is more and more needed due to an increasing number of data. The second one is the possibility of data processing in so called lazy streams, which is very useful in the analysis of big amounts of data of unknown size, since the processing is performed individually for one stream, allowing for easy control of memory consumption. Nevertheless, languages of pure functionality will likely not be widely used in the branch, contrary to imperative-functional languages, like Java 8 or Scala.

However appealing a vision for ambitious developers functional programming may be, one should ask a question how costly it can be for an organisation. One can come across an opinion of developer sthat Scala is quite difficult to learn in comparison to Java, for instance. Therefore, rising doubts can be observed since developers do not know what may happen to the systems which in a few years’ time can be perceived as legacy and will be written in Scala. The cost of maintenance and development (as well as the cost of the search for specialists) may turn to be too much of a burden. We can, however, hypothesize about further growth in interest in functional programming languages.

Contact

How can we help?

Drop us a line and we'll respond as soon as possible.

Treble