Text analysis in Pandas with some TF-IDF (again)
18 September 2017
Pandas is a great tool for the analysis of tabular
data via its DataFrame interface. Slightly less known are its capabilities for
working with text data. In this post I’ll present them on some simple examples.
As a comparison I’ll use my previous post about TF-IDF in Spark.
Why SQL? Notes from (big) data analysis practice
07 September 2017
Structured Query Language (SQL) is around
from the ‘80 and it is still going strong. For many year it has been a lingua
franca of many areas of data processing, such as databases, data warehouses,
business intelligence etc. Since it’s a natural language for data handling, it’s
now making its way into more modern systems processing big data sets.
Word count is Spark SQL with a pinch of TF-IDF (continued)
31 August 2017
In this post we continue with the example introduced last week to calculate TF-IDF measures and find the most characteristic words for each of the analysed books.
Word count is Spark SQL with a pinch of TF-IDF
25 August 2017
So you’ve probably already did the hello-world of distributed computing, which
is word count. On the other hand a lot of tutorials about Spark SQL (and SQL in
general) deal mostly with structured data in tabular format. But not all data
have structure right away, we sometimes need to create one. This is especially
true with all forms of text documents.
Docker for Data Science
18 August 2017
Docker is a very useful tool to package software builds and distribute them onwards. In fact, it’s becoming the standard of application packaging, especially for web services. There are a lot of Docker images available at Docker Hub. In general, Docker is very useful for development, testing and production, but for this tutorial, we’ll show how to use Docker for Data Science and Apache Spark. For this, we’ll get a prebuilt version of Spark with other tools bundled in the package.
How to install PySpark locally
11 August 2017
For both our training as well as analysis and development in SigDelta, we often use Apache Spark’s Python API, aka PySpark. Despite the fact, that Python is present in Apache Spark from almost the beginning of the project (version 0.7.0 to be exact), the installation was not exactly the pip-install type of setup Python community is used to.
Power BI - Self-service Business Intelligence tool
21 April 2017
Power BI belongs to the self-service BI class solutions, which means that it is directed to the end-users and gives them the opportunity to build their own analysis.
Drools far beyond Hello World
04 July 2016
This is an initial post of an article series about Drools as a business rules engine. Instead of focusing on the complete syntax, API, features etc. in these blog series I would like to raise some other topics not being mentioned often, however crucial in the subject matter.
Big Data in the Maturity Stage
15 April 2016
For business there is no valueless data, only this which has not been analysed appropriately and therefore is not presented as information. The demand for analysts and effective tools for the analysis of excessive amounts of data is increasing every year. Year 2016 may bring in this matter solutions facilitating the managing of information as well as their proper classification and analysis.