Monday, August 26, 2013

Bad Data Handbook Review

Bad Data Handbook from O'Reilly is a collection of essays and articles by different authors having as common theme data, or “bad”  data to be precise. The “badness” of the data in this case is more of a perceived quality, rather than an inherent one. Arguably, data can be surprising, unpredictable, defective or deficient but rarely thoroughly bad.

The different chapters are generally well written and they can be read in any order. The book contains a wide range of interesting situations, from machine learning war stories, to data quality issues, to modelling and processing concerns. To be clear, this book is not a programming guide but it is full of practical advice and recommendations.

Some of the chapters I particularly enjoyed are:

Crouching Table, Hidden Network by Bobby Norton
In some situations where relational databases are thoughtlessly used, a graph database is a way better fit. This chapter explores such a case around a plausible business requirement to explore the rationale and the impact of moving from a relational to a graph model. A clever analogy with fractals is used to illustrate the point and the chapter ends with database-agnostic example queries using Gremlin.

How to Feed and Care for Your Machine-Learning Experts by Pete Warden
A culinary title hides the intriguing story of how Jetpac devised a machine-learning solution to analyse and classify massive volumes of holiday photos to produce meaningful visual recommendations. The stories starts from building an initial manual solution, and goes all the way to organising a successful data science competition on Kaggle. There are a number of generally interesting points to be learnt from this story, the counterintuitive but successful use of meta-data being a striking example.

Detecting Liars and the Confused in Contradictory Online Reviews by Jacob Perkins
Effective sentiment analysis is essential for any business trying to produce relevant recommendations for its users. Weotta does so by automatically analysing reviews of local businesses. Behind a deceptively simple requirement, there are a number of challenges to be overcome with the quality of the data such as the dealing with polarised language, double negatives and humorous tone. The chapter details the process of developing and training a classifier built on top of Python’s NLTK. Very interesting.

Obvious to say, with such a varied book it is hard to please everyone and some chapters were slightly less aligned with my interests. Overall though, the book is a fun and worthy read if you are interested in data science and engineering.