StackOverflow tags with Dask
05 January 2018
In the previous Dask post we’ve looked into basic
data extraction using Dask. In this post we’ll follow
up the problem and show how to perform more complex tasks with Dask in a similar
way as we’d do in Pandas but on a larger data set.
Dask - a new elephant in the room
20 December 2017
While Big Data is with us for a while, long enough to become almost a cliche,
its world was largely dominated by Java and related tools and languages. This
became an entry barrier for many people not familiar with these technologies,
which were mostly engineering-centric. The first glimpse of hope came with
Hive’s SQL and Pig’s (pig)
Latin. Later Spark came
with Python and R support, SQL interface, and with some DataFrames along
the way. Now we see a rise of many new and useful Big Data processing
technologies, often SQL-based, which made working with big data sets much
Scala (and Java) Spark UDFs in Python
30 October 2017
Many systems based on SQL, including Apache Spark, have User-Defined Functions
(UDFs) support. While it is possible to create UDFs directly in
Python, it brings a substantial burden on the efficiency of computations. It is
because Spark’s internals are written in Java and Scala, thus, run in JVM; see
the figure from PySpark’s Confluence
Using language-detector aka large not serializable objects in Spark
25 September 2017
Often we find ourselves in a situation, that we need to incorporate an external library to do some extra work with our data. This is especially true when we work with natural language or require machine learning to be applied. These libraries can create large objects and may not be suitable for usage in Big Data clusters out of the box. The majority of distributed computing libraries, Apache Spark being no exception, require the objects that they use to be serializable. On the other hand not all the libraries and their classes implement Serializable interface. Below I discuss how to deal with that in Scala and use these objects efficiently in Spark.
Text analysis in Pandas with some TF-IDF (again)
18 September 2017
Pandas is a great tool for the analysis of tabular
data via its DataFrame interface. Slightly less known are its capabilities for
working with text data. In this post I’ll present them on some simple examples.
As a comparison I’ll use my previous post about TF-IDF in Spark.
Why SQL? Notes from (big) data analysis practice
07 September 2017
Structured Query Language (SQL) is around
from the ‘80 and it is still going strong. For many year it has been a lingua
franca of many areas of data processing, such as databases, data warehouses,
business intelligence etc. Since it’s a natural language for data handling, it’s
now making its way into more modern systems processing big data sets.
Word count is Spark SQL with a pinch of TF-IDF (continued)
31 August 2017
In this post we continue with the example introduced last week to calculate TF-IDF measures and find the most characteristic words for each of the analysed books.
Word count is Spark SQL with a pinch of TF-IDF
25 August 2017
So you’ve probably already did the hello-world of distributed computing, which
is word count. On the other hand a lot of tutorials about Spark SQL (and SQL in
general) deal mostly with structured data in tabular format. But not all data
have structure right away, we sometimes need to create one. This is especially
true with all forms of text documents.
Docker for Data Science
18 August 2017
Docker is a very useful tool to package software builds and distribute them onwards. In fact, it’s becoming the standard of application packaging, especially for web services. There are a lot of Docker images available at Docker Hub. In general, Docker is very useful for development, testing and production, but for this tutorial, we’ll show how to use Docker for Data Science and Apache Spark. For this, we’ll get a prebuilt version of Spark with other tools bundled in the package.
How to install PySpark locally
11 August 2017
For both our training as well as analysis and development in SigDelta, we often use Apache Spark’s Python API, aka PySpark. Despite the fact, that Python is present in Apache Spark from almost the beginning of the project (version 0.7.0 to be exact), the installation was not exactly the pip-install type of setup Python community is used to.
Power BI - Self-service Business Intelligence tool
21 April 2017
Power BI belongs to the self-service BI class solutions, which means that it is directed to the end-users and gives them the opportunity to build their own analysis.
Drools far beyond Hello World
04 July 2016
This is an initial post of an article series about Drools as a business rules engine. Instead of focusing on the complete syntax, API, features etc. in these blog series I would like to raise some other topics not being mentioned often, however crucial in the subject matter.
Big Data in the Maturity Stage
15 April 2016
For business there is no valueless data, only this which has not been analysed appropriately and therefore is not presented as information. The demand for analysts and effective tools for the analysis of excessive amounts of data is increasing every year. Year 2016 may bring in this matter solutions facilitating the managing of information as well as their proper classification and analysis.