In May 2017, Gerard Toonstra and Matthias Schuurmans attended the Strata Data conference (https://conferences.oreilly.com/strata/strata-eu) which was held in London. This conference is all about data, with four main topics:
- Data security and privacy (hello GDPR!)
- Management of (big-)data teams and organisations
- Data science and deep learning
- Data engineering and platforming
The conference consisted of four days in total, of which the first two are all about tutorials and workshops and the last two feature keynotes, presentations and networking events. We attended the last two days only, which already gave us a world of inspiration and know-how on the possibilities and challenges concerning data. People from Google, O’Reilly, Microsoft, Kaggle, Uber and all the other big names reveal their view on the future of data, and it promises to be quite the future for us data engineers and data scientists indeed. In particular I recommend anyone take a look at the first four minutes of Tim O’Reilly’s keynote speech here.
Some of the more interesting presentations were about the following:
- GDPR and changes in EU legislation on the storage of personal data
- Why bigdata project teams fail and how those projects can be better prepared
- The challenges of running data science projects in production
- Tips and tricks by Anthony Goldbloom, founder and CEO of Kaggle
Next to all the presentations, you meet people and vendors that work in the field. This is where the conference really shines, because you can discuss your personal challenges and get into some details over a beer or two. You see some interesting offerings from these vendors: a data replication tool from Attunity which replicates data on-premises to the cloud with all DDL changes into change tables, a data science model tool from DataRobot that can automatically run a bunch of models on custom data and produce statistics on what model is the best, a classifier that tells you your gender and age based on real-time webcam feeds from Microsoft, and a visualisation tool that used a GPU accelerated database to show in real time locations from people using Twitter by Kinetica.
- GDPR is going to offer some tough challenges for data engineers, it will restrict the propagation of personal data and thus incur some engineering cost when we’re already busy working on many other cool things.
- BigData projects have the same failure rate as large “enterprise” projects used to have. The most common reason for failure is lack of knowledge and training on cloud environments, lack of data veterans in the team, not having multi-disciplinary teams, too much ambition or simply not having a good use case for solving that thing in the cloud. The worst decision you can make is pushing the data as a first step to then decide what to do with it. Get your use case figured out!
- Data science in production is more than just playing with models, you need a strong foundation for scheduling, preparing data, creating features, getting the results in the correct destination, managing credentials and maintaining schemas of databases, all in an automated and stable way.
- The value of data science comes after the MVP (first model in production), because then your foundation works and you can work on tweaking and experimenting. This is different from normal software development. Your model also needs maintenance, otherwise it loses its value again over time as new data becomes available.
- Building trust in your data science models is very important, but also very difficult due to the mindset a lot of people get from presentations that make it look easy, the difficulty in quantifying how well your model does and, in some model’s cases, the black box that creates the output.
- Deep learning is here to stay, for some problems (especially those with unstructured data). It’s not feasible to use in forecasting for a number of reasons, but does have many other interesting use cases
As you go to all the presentations and talk with people, you recognize new possibilities for Coolblue as well. Some of the ideas we got are:
- Improving our product info, pictures and movies using NLP and deep learning. Customers do frequently call us with doubts and we can leverage our call center data to label our product master data and then train models that predict how many customer calls we’d get based on the product info display on the website.
- Predict NPS feedback from voice interactions. Using audio and deep learning, we could in theory predict in real time what score a customer would provide at the end of the conversation, based on speech inflections and cues in the voices and provide this as a continuous feedback to a customer service agent.
- Classify visitors into persona’s based on visiting behavior
- Utilize a semantic prediction model for ad keyword performance to optimize the money you spend on Adwords campaigns
- Classify social media interactions on urgency of response
- (Near-) real-time forecasts by looking at the ongoing number of orders on the website so we can optimize the resourcing in the warehouses as the day progresses.