Blog Posts

StackOverflow tags with Dask
Posted By Jakub Nowacki, 05 January 2018

In the previous Dask post we’ve looked into basic data extraction using Dask. In this post we’ll follow up the problem and show how to perform more complex tasks with Dask in a similar way as we’d do in Pandas but on a larger data set.

Dask - a new elephant in the room
Posted By Jakub Nowacki, 20 December 2017

While Big Data is with us for a while, long enough to become almost a cliche, its world was largely dominated by Java and related tools and languages. This became an entry barrier for many people not familiar with these technologies, which were mostly engineering-centric. The first glimpse of hope came with Hive’s SQL and Pig’s (pig) Latin. Later Spark came with Python and R support, SQL interface, and with some DataFrames along the way. Now we see a rise of many new and useful Big Data processing technologies, often SQL-based, which made working with big data sets much easier.

Scala (and Java) Spark UDFs in Python
Posted By Jakub Nowacki, 30 October 2017

Many systems based on SQL, including Apache Spark, have User-Defined Functions (UDFs) support. While it is possible to create UDFs directly in Python, it brings a substantial burden on the efficiency of computations. It is because Spark’s internals are written in Java and Scala, thus, run in JVM; see the figure from PySpark’s Confluence page for details.

Using language-detector aka large not serializable objects in Spark
Posted By Jakub Nowacki, 25 September 2017

Often we find ourselves in a situation, that we need to incorporate an external library to do some extra work with our data. This is especially true when we work with natural language or require machine learning to be applied. These libraries can create large objects and may not be suitable for usage in Big Data clusters out of the box. The majority of distributed computing libraries, Apache Spark being no exception, require the objects that they use to be serializable. On the other hand not all the libraries and their classes implement Serializable interface. Below I discuss how to deal with that in Scala and use these objects efficiently in Spark.

Text analysis in Pandas with some TF-IDF (again)
Posted By Jakub Nowacki, 18 September 2017

Pandas is a great tool for the analysis of tabular data via its DataFrame interface. Slightly less known are its capabilities for working with text data. In this post I’ll present them on some simple examples. As a comparison I’ll use my previous post about TF-IDF in Spark.

Why SQL? Notes from (big) data analysis practice
Posted By Jakub Nowacki, 07 September 2017

Structured Query Language (SQL) is around from the ‘80 and it is still going strong. For many year it has been a lingua franca of many areas of data processing, such as databases, data warehouses, business intelligence etc. Since it’s a natural language for data handling, it’s now making its way into more modern systems processing big data sets.

Word count is Spark SQL with a pinch of TF-IDF (continued)
Posted By Jakub Nowacki, 31 August 2017

In this post we continue with the example introduced last week to calculate TF-IDF measures and find the most characteristic words for each of the analysed books.

Word count is Spark SQL with a pinch of TF-IDF
Posted By Jakub Nowacki, 25 August 2017

So you’ve probably already did the hello-world of distributed computing, which is word count. On the other hand a lot of tutorials about Spark SQL (and SQL in general) deal mostly with structured data in tabular format. But not all data have structure right away, we sometimes need to create one. This is especially true with all forms of text documents.

Docker for Data Science
Posted By Jakub Nowacki, 18 August 2017

Docker is a very useful tool to package software builds and distribute them onwards. In fact, it’s becoming the standard of application packaging, especially for web services. There are a lot of Docker images available at Docker Hub. In general, Docker is very useful for development, testing and production, but for this tutorial, we’ll show how to use Docker for Data Science and Apache Spark. For this, we’ll get a prebuilt version of Spark with other tools bundled in the package.

How to install PySpark locally
Posted By Jakub Nowacki, 11 August 2017

For both our training as well as analysis and development in SigDelta, we often use Apache Spark’s Python API, aka PySpark. Despite the fact, that Python is present in Apache Spark from almost the beginning of the project (version 0.7.0 to be exact), the installation was not exactly the pip-install type of setup Python community is used to.

Power BI - Self-service Business Intelligence tool
Posted By Dawid Detko, 21 April 2017

Power BI belongs to the self-service BI class solutions, which means that it is directed to the end-users and gives them the opportunity to build their own analysis.

Drools far beyond Hello World
Posted By Jakub Koperwas, 04 July 2016

This is an initial post of an article series about Drools as a business rules engine. Instead of focusing on the complete syntax, API, features etc. in these blog series I would like to raise some other topics not being mentioned often, however crucial in the subject matter.

Big Data in the Maturity Stage
Posted By Jakub Nowacki, 15 April 2016

For business there is no valueless data, only this which has not been analysed appropriately and therefore is not presented as information. The demand for analysts and effective tools for the analysis of excessive amounts of data is increasing every year. Year 2016 may bring in this matter solutions facilitating the managing of information as well as their proper classification and analysis.


How can we help?

Drop us a line and we'll respond as soon as possible.