• We Crunch Data

  • We Work with Academia

  • We Teach

MEET US IN BERLIN. INTRODUCTION TO BIG DATA AND APACHE SPARK: FEBRUARY 2, 2018

Services

We Crunch Data

As a team of Data Scientists and Big Data specialists we are open for your needs concerning data analysis, machine learning and big data.

Data Analysis and Visualization
Data Analytics and Visualization

It's nice to collect lots of data, but data is worthless without context and interpretation. We can help you understand your data and convert it to actionable knowledge.

Modeling and Prediction
Modeling and Prediction

We specialize in problems involving data mining and machine learning solutions, in such domains as Natural Language Processing, Image Processing and Security.

Big Data Solutions
Implementing Big Data Solutions

We have an extensive experience with processing Big Data and architecting solutions based on such technologies as Apache Hadoop and Apache Spark.

Training and Coaching
Training and Coaching

We can help you or your team gain necessary knowledge to undertake projects involving data-related problems.

Research

We Work with Academia

We believe in applicable research, which is conducted in partnerships between academia and business.

We and our team members initiate and engage in several research projects in such domains as: Natural Language Processing, Image Processing and Systems Security.

Solve your research problem
Solve your research problem

Having an extensive network of experts in top research institutions, we can help you solve even the most difficult problems in your project.

Turn your research idea into a product
Turn your research idea into a product

We can help you assess the business value of your research idea or project and bring it to the commercial market.

Training

We Teach

Is your team in need of data scientists or big data specialists? Do you want to jumpstart your individual career?

Pick one of our training courses or contact us for customized training, specific for your team needs.

Workshops

Meet us during one of these events organized regularly across Europe.

Blog

Recent Blog Posts

  • Using language-detector aka large not serializable objects in Spark
    Using language-detector aka large not serializable objects in Spark
    25 September 2017

    Often we find ourselves in a situation, that we need to incorporate an external library to do some extra work with our data. This is especially true when we work with natural language or require machine learning to be applied. These libraries can create large objects and may not be suitable for usage in Big Data clusters out of the box. The majority of distributed computing libraries, Apache Spark being no exception, require the objects that they use to be serializable. On the other hand not all the libraries and their classes implement Serializable interface. Below I discuss how to deal with that in Scala and use these objects efficiently in Spark.

  • Text analysis in Pandas with some TF-IDF (again)
    Text analysis in Pandas with some TF-IDF (again)
    18 September 2017

    Pandas is a great tool for the analysis of tabular data via its DataFrame interface. Slightly less known are its capabilities for working with text data. In this post I’ll present them on some simple examples. As a comparison I’ll use my previous post about TF-IDF in Spark.

  • Why SQL? Notes from (big) data analysis practice
    Why SQL? Notes from (big) data analysis practice
    07 September 2017

    Structured Query Language (SQL) is around from the ‘80 and it is still going strong. For many year it has been a lingua franca of many areas of data processing, such as databases, data warehouses, business intelligence etc. Since it’s a natural language for data handling, it’s now making its way into more modern systems processing big data sets.

  • Word count is Spark SQL with a pinch of TF-IDF (continued)
    Word count is Spark SQL with a pinch of TF-IDF (continued)
    31 August 2017

    In this post we continue with the example introduced last week to calculate TF-IDF measures and find the most characteristic words for each of the analysed books.

  • Word count is Spark SQL with a pinch of TF-IDF
    Word count is Spark SQL with a pinch of TF-IDF
    25 August 2017

    So you’ve probably already did the hello-world of distributed computing, which is word count. On the other hand a lot of tutorials about Spark SQL (and SQL in general) deal mostly with structured data in tabular format. But not all data have structure right away, we sometimes need to create one. This is especially true with all forms of text documents.

  • Docker for Data Science
    Docker for Data Science
    18 August 2017

    Docker is a very useful tool to package software builds and distribute them onwards. In fact, it’s becoming the standard of application packaging, especially for web services. There are a lot of Docker images available at Docker Hub. In general, Docker is very useful for development, testing and production, but for this tutorial, we’ll show how to use Docker for Data Science and Apache Spark. For this, we’ll get a prebuilt version of Spark with other tools bundled in the package.

  • How to install PySpark locally
    How to install PySpark locally
    11 August 2017

    For both our training as well as analysis and development in SigDelta, we often use Apache Spark’s Python API, aka PySpark. Despite the fact, that Python is present in Apache Spark from almost the beginning of the project (version 0.7.0 to be exact), the installation was not exactly the pip-install type of setup Python community is used to.

  • Power BI - Self-service Business Intelligence tool
    Power BI - Self-service Business Intelligence tool
    21 April 2017

    Power BI belongs to the self-service BI class solutions, which means that it is directed to the end-users and gives them the opportunity to build their own analysis.

  • Drools far beyond Hello World
    Drools far beyond Hello World
    04 July 2016

    This is an initial post of an article series about Drools as a business rules engine. Instead of focusing on the complete syntax, API, features etc. in these blog series I would like to raise some other topics not being mentioned often, however crucial in the subject matter.

  • Big Data in the Maturity Stage
    Big Data in the Maturity Stage
    15 April 2016

    For business there is no valueless data, only this which has not been analysed appropriately and therefore is not presented as information. The demand for analysts and effective tools for the analysis of excessive amounts of data is increasing every year. Year 2016 may bring in this matter solutions facilitating the managing of information as well as their proper classification and analysis.

Our clients

BLStream
ING
CapGemini

Our Partners

Sages
Stacja IT

Contact

How can we help?

Drop us a line and we'll respond as soon as possible.

Treble