A comprehensive guide for data scientists to master effective data cleaning tools and techniques
Key Features:
Book Description:
In data science, data analysis, or machine learning, most of the effort needed to achieve your actual purpose lies in cleaning your data. Using Python, R, and command-line tools, you will learn the essential cleaning steps performed in every production data science or data analysis pipeline. This book not only teaches you data preparation but also what questions you should ask of your data.
The book dives into the practical application of tools and techniques needed for data ingestion, anomaly detection, value imputation, and feature engineering. It also offers long-form exercises at the end of each chapter to practice the skills acquired.
You will begin by looking at data ingestion of a range of data formats. Moving on, you will impute missing values, detect unreliable data and statistical anomalies, and generate synthetic features that are necessary for successful data analysis and visualization goals.
By the end of this book, you will have acquired a firm understanding of the data cleaning process necessary to perform real-world data science and machine learning tasks.
What You Will Learn:
Who this book is for:
This book is designed to benefit software developers, data scientists, aspiring data scientists, and students who are interested in data analysis or scientific computing. Basic familiarity with statistics, general concepts in machine learning, knowledge of a programming language (Python or R), and some exposure to data science are helpful. The text will also be helpful to intermediate and advanced data scientists who want to improve their rigor in data hygiene and wish for a refresher on data preparation issues.
Move Beyond Python Code That Mostly Works to Code That Is Expressive, Robust, and Efficient
Python is arguably the most-used programming language in the world, with applications from primary school education to workaday web development, to the most advanced scientific research institutes. While there are many ways to perform a task in Python, some are wrong, inelegant, or inefficient. Better Python Code is a guide to Pythonic programming, a collection of best practices, ways of working, and nuances that are easy to miss, especially when ingrained habits are borrowed from other programming languages.
Author David Mertz presents concrete and concise examples of various misunderstandings, pitfalls, and bad habits in action. He explains why some practices are better than others, based on his 25+ years of experience as an acclaimed contributor to the Python community. Each chapter thoroughly covers related clusters of concepts, with chapters sequenced in ascending order of sophistication.
Whether you are starting out with Python or are an experienced developer pushing through the limitations of your Python code, this book is for all who aspire to be more Pythonic when writing better Python code.
My high expectations for this engaging Python book have been exceeded: it offers a great deal of insight for intermediate or advanced programmers to improve their Python skills, includes copious sharing of precious experience practicing and teaching the language, yet remains concise, easy to read, and conversational.
--From the Foreword by Alex Martelli
Register your book for convenient access to downloads, updates, and/or corrections as they become available. See inside book for details.