The goal of data science is to improve decision making through the analysis of data. Today data science determines the ads we see online, the books and movies that are recommended to us online, which emails are filtered into our spam folders, and even how much we pay for health insurance. This volume in the MIT Press Essential Knowledge series offers a concise introduction to the emerging field of data science, explaining its evolution, current uses, data infrastructure issues, and ethical challenges.
It has never been easier for organizations to gather, store, and process data. Use of data science is driven by the rise of big data and social media, the development of high-performance computing, and the emergence of such powerful methods for data analysis and modeling as deep learning. Data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting non-obvious and useful patterns from large datasets. It is closely related to the fields of data mining and machine learning, but broader in scope. This book offers a brief history of the field, introduces fundamental data concepts, and describes the stages in a data science project. It considers data infrastructure and the challenges posed by integrating data from multiple sources, introduces the basics of machine learning, and discusses how to link machine learning expertise with real-world problems. The book also reviews ethical and legal issues, developments in data regulation, and computational approaches to preserving privacy. Finally, it considers the future impact of data science and offers principles for success in data science projects.
Web developers--this is your all-in-one guide to HTML and CSS! Learn to use HTML to format text and structure web pages. Understand the HTML document skeleton before creating forms, referencing hyperlinks, embedding active content, and more. Then style your pages with CSS: Create consistent designs with selectors, the box model, the cascade algorithm, and inheritance. Round out your client-side development experience by getting to know JavaScript. With detailed code examples, you'll master HTML and CSS in no time!
a. HTML for Formatting and Structure Master HTML syntax and document structure. Work with tags, elements, and attributes to build HTML documents. Create tables and forms, embed images, configure links, and develop interactive HTML elements.
b. CSS for Style and Design Develop simple and complex web layouts with CSS rules, selectors, declarations, combinators, pseudoclasses, and pseudoelements. Understand the principles of cascading, specificity, and inheritance. Learn to use the CSS box model, the position property, and more.
c. JavaScript Fundamentals Expand your knowledge with an introduction to JavaScript. See how to use variables, statements, functions, arrays, and objects to write and run simple programs. Explore the basics of Ajax for interactive web application design.
Highlights:
These days, businesses are collecting massive amounts of data. But this data isn't valuable until someone analyzes it to gain insights that can be used to make decisions. That's why the US Bureau of Labor Statistics (BLS) predicts that the demand for data analysts will continue to grow for the rest of the decade.
Now, with Murach's Python for Data Science as your guide, you'll learn how to use Python libraries to get, clean, prepare, and analyze data at a professional level. To start, you'll learn how to use Pandas for data analysis and Seaborn for data visualization. Then, you'll learn how to use Scikit-learn to create regression models that you can use to make predictions.
To tie everything together, this book contains four realistic analyses that use real-world data. That's because studying analyses like these is critical to the learning process.
Data is bigger, arrives faster, and comes in a variety of formatsâ and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark.
Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, youâ ll be able to:
Minimize AI hallucinations and build accurate, custom generative AI pipelines with RAG using embedded vector databases and integrated human feedback
Purchase of the print or Kindle book includes a free eBook in PDF format
Key Features:
- Implement RAG's traceable outputs, linking each response to its source document to build reliable multimodal conversational agents
- Deliver accurate generative AI models in pipelines integrating RAG, real-time human feedback improvements, and knowledge graphs
- Balance cost and performance between dynamic retrieval datasets and fine-tuning static data
Book Description:
RAG-Driven Generative AI provides a roadmap for building effective LLM, computer vision, and generative AI systems that balance performance and costs.
This book offers a detailed exploration of RAG and how to design, manage, and control multimodal AI pipelines. By connecting outputs to traceable source documents, RAG improves output accuracy and contextual relevance, offering a dynamic approach to managing large volumes of information. This AI book shows you how to build a RAG framework, providing practical knowledge on vector stores, chunking, indexing, and ranking. You'll discover techniques to optimize your project's performance and better understand your data, including using adaptive RAG and human feedback to refine retrieval accuracy, balancing RAG with fine-tuning, implementing dynamic RAG to enhance real-time decision-making, and visualizing complex data with knowledge graphs.
You'll be exposed to a hands-on blend of frameworks like LlamaIndex and Deep Lake, vector databases such as Pinecone and Chroma, and models from Hugging Face and OpenAI. By the end of this book, you will have acquired the skills to implement intelligent solutions, keeping you competitive in fields from production to customer service across any project.
What You Will Learn:
- Scale RAG pipelines to handle large datasets efficiently
- Employ techniques that minimize hallucinations and ensure accurate responses
- Implement indexing techniques to improve AI accuracy with traceable and transparent outputs
- Customize and scale RAG-driven generative AI systems across domains
- Find out how to use Deep Lake and Pinecone for efficient and fast data retrieval
- Control and build robust generative AI systems grounded in real-world data
- Combine text and image data for richer, more informative AI responses
Who this book is for:
This book is ideal for data scientists, AI engineers, machine learning engineers, and MLOps engineers. If you are a solutions architect, software developer, product manager, or project manager looking to enhance the decision-making process of building RAG applications, then you'll find this book useful.
Table of Contents
- Why Retrieval Augmented Generation(RAG)?
- RAG Embeddings Vector Stores with Activeloop and OpenAI
- Indexed-based RAG with LlamaIndex and Langchain
- Multimodal Modular RAG with Pincecone
- Boosting RAG Performance with Expert Human Feedback
- All in One with Meta RAG
- Organizing RAG with Llamaindex Knowledge Graphs
- Exploring the Scaling Limits of RAG
- Empowering AI Models: Fine-tuning RAG Data and Human Feedback
- Building the RAG Pipeline from Data Collection to Generative AI
Implement a basic Enterprise Architecture from start to finish using a four-stage, wheel-based approach. Aided by real-world examples, this book shows what elements are needed for the initial implementation of a fundamental Enterprise Architecture.
The book's pragmatic approach keeps existing architecture frameworks and methodologies in mind while providing instructions that are readable and applicable to all. The Enterprise Architecture Implementation Wheel builds on the methodology of existing architecture frameworks and allows you to apply the theory more pragmatically and closer to the reality that an architect encounters in daily practice.
While the main focus of the book is on the actual steps taken to design an Enterprise Architecture, other important topics include architecture origin, definition, domains, visualization, and roles. Getting Started with Enterprise Architecture is the ideal handbook for the architect who is asked to implement an Enterprise Architecture in an existing organization.
What You'll Learn
Who This Book Is For
Executing Data Quality Projects, Second Edition presents a structured yet flexible approach for creating, improving, sustaining and managing the quality of data and information within any organization.
Studies show that data quality problems are costing businesses billions of dollars each year, with poor data linked to waste and inefficiency, damaged credibility among customers and suppliers, and an organizational inability to make sound decisions. Help is here! This book describes a proven Ten Step approach that combines a conceptual framework for understanding information quality with techniques, tools, and instructions for practically putting the approach to work - with the end result of high-quality trusted data and information, so critical to today's data-dependent organizations.
The Ten Steps approach applies to all types of data and all types of organizations - for-profit in any industry, non-profit, government, education, healthcare, science, research, and medicine. This book includes numerous templates, detailed examples, and practical advice for executing every step. At the same time, readers are advised on how to select relevant steps and apply them in different ways to best address the many situations they will face. The layout allows for quick reference with an easy-to-use format highlighting key concepts and definitions, important checkpoints, communication activities, best practices, and warnings. The experience of actual clients and users of the Ten Steps provide real examples of outputs for the steps plus highlighted, sidebar case studies called Ten Steps in Action.
This book uses projects as the vehicle for data quality work and the word broadly to include: 1) focused data quality improvement projects, such as improving data used in supply chain management, 2) data quality activities in other projects such as building new applications and migrating data from legacy systems, integrating data because of mergers and acquisitions, or untangling data due to organizational breakups, and 3) ad hoc use of data quality steps, techniques, or activities in the course of daily work. The Ten Steps approach can also be used to enrich an organization's standard SDLC (whether sequential or Agile) and it complements general improvement methodologies such as six sigma or lean. No two data quality projects are the same but the flexible nature of the Ten Steps means the methodology can be applied to all.
The new Second Edition highlights topics such as artificial intelligence and machine learning, Internet of Things, security and privacy, analytics, legal and regulatory requirements, data science, big data, data lakes, and cloud computing, among others, to show their dependence on data and information and why data quality is more relevant and critical now than ever before.
Are you ready to begin an enlightening journey through the dynamic world of data businesses? From the historical evolution of data collection in ancient civilizations to the cutting-edge integration of AI in modern data analytics, Data-Minded offers an unparalleled exploration of how data has become the cornerstone of contemporary business strategies. Each chapter unfolds a new facet of the data industry, delving into the intricacies of data productization, the art of data marketing and sales, and the craft of building a compelling business case for data enterprises. This book is a blueprint for navigating the distinct terrain of data companies. It's an essential read for entrepreneurs, data scientists, marketers, and business strategists eager to harness the power of data. This book illuminates the path for creating value from data, ensuring ethical practices, and staying ahead in an ever-evolving industry.
Reference text in top universities like Stanford and Cambridge
Sold in over 85 countries, translated into more than 5 languages
Data and information are fuel of this new age where powerful analytics algorithms burn this fuel to generate decisions that are expected to create a smarter and more efficient world for all of us to live in. This new area of technology has been defined as Big Data Science and Analytics, and the industrial and academic communities are realizing this as a competitive technology that can generate significant new wealth and opportunity.
Big data is defined as collections of datasets whose volume, velocity or variety is so large that it is difficult to store, manage, process and analyze the data using traditional databases and data processing tools. Big data science and analytics deals with collection, storage, processing and analysis of massive-scale data. Industry surveys, by Gartner and e-Skills, for instance, predict that there will be over 2 million job openings for engineers and scientists trained in the area of data science and analytics alone, and that the job market is in this area is growing at a 150 percent year-over-year growth rate.
We have written this textbook, as part of our expanding A Hands-On Approach(TM) series, to meet this need at colleges and universities, and also for big data service providers who may be interested in offering a broader perspective of this emerging field to accompany their customer and developer training programs. The typical reader is expected to have completed a couple of courses in programming using traditional high-level languages at the college-level, and is either a senior or a beginning graduate student in one of the science, technology, engineering or mathematics (STEM) fields. An accompanying website for this book contains additional support for instruction and learning (www.big-data-analytics-book.com)
The book is organized into three main parts, comprising a total of twelve chapters. Part I provides an introduction to big data, applications of big data, and big data science and analytics patterns and architectures. A novel data science and analytics application system design methodology is proposed and its realization through use of open-source big data frameworks is described. This methodology describes big data analytics applications as realization of the proposed Alpha, Beta, Gamma and Delta models, that comprise tools and frameworks for collecting and ingesting data from various sources into the big data analytics infrastructure, incorporating distributed filesystems and non-relational (NoSQL) databases for data storage, and processing frameworks for batch and real-time analytics. This new methodology forms the pedagogical foundation of this book.
Part II introduces the reader to various tools and frameworks for big data analytics, and the architectural and programming aspects of these frameworks, with examples in Python. We describe Publish-Subscribe messaging frameworks (Kafka & Kinesis), Source-Sink connectors (Flume), Database Connectors (Sqoop), Messaging Queues (RabbitMQ, ZeroMQ, RestMQ, Amazon SQS) and custom REST, WebSocket and MQTT-based connectors. The reader is introduced to data storage, batch and real-time analysis, and interactive querying frameworks including HDFS, Hadoop, MapReduce, YARN, Pig, Oozie, Spark, Solr, HBase, Storm, Spark Streaming, Spark SQL, Hive, Amazon Redshift and Google BigQuery. Also described are serving databases (MySQL, Amazon DynamoDB, Cassandra, MongoDB) and the Django Python web framework.
Part III introduces the reader to various machine learning algorithms with examples using the Spark MLlib and H2O frameworks, and visualizations using frameworks such as Lightning, Pygal and Seaborn.
Get started with Apache Flink, the open source framework that powers some of the world's largest stream processing applications. With this practical book, you'll explore the fundamental concepts of parallel stream processing and discover how this technology differs from traditional batch data processing.
Longtime Apache Flink committers Fabian Hueske and Vasia Kalavri show you how to implement scalable streaming applications with Flink's DataStream API and continuously run and maintain these applications in operational environments. Stream processing is ideal for many use cases, including low-latency ETL, streaming analytics, and real-time dashboards as well as fraud detection, anomaly detection, and alerting. You can process continuous data of any kind, including user interactions, financial transactions, and IoT data, as soon as you generate them.
Highly accessible and easy to read, introducing concepts in discrete mathematics without requiring a university-level background in mathematics
Ideally structured for classroom-use and self-study, with modular chapters following ACM curriculum recommendations
Contains examples and exercises throughout the text, and highlights the most important concepts in each section
This book provides the trading system developer with a powerful set of statistical tools for measuring vital aspects of performance that are ignored by most developers.
All algorithms include intuitive justification, basic theory, all relevant equations, and highly commented C++ code for complete programs that run in a Windows Command Console.
Reprogramming them in other languages should be easy, given the detailed explanations of each algorithm.
The following topics are covered:
Testing for overfitting at the earliest possible stage
Evaluating the luckiness-versus-skill of a fully developed system before deploying it
Testing the effectiveness and reliability of a trading system factory
Removing selection bias when screening a large number of indicators
Probability bounds for future mean returns
Bounding typical and catastrophic future drawdowns
Is the best indicator or model in a competition truly the best, or just the luckiest?
Which markets provide truly superior profits for your trading system?
What holding time for your system provides the best risk/return performance?
Your data is your business. Is it supporting your strategy?
Algorithms, machine learning, generative AI. With the latest advancements in technology, data is taking center stage in organizations. But without a clear plan for how you're storing, integrating, and gaining value from that information, your company will be at risk--and fall behind.
If you read nothing else on setting a data strategy in your organization, read these 10 articles. We've combed through hundreds of Harvard Business Review articles and selected the most important ones to help you improve your analytic and algorithmic capabilities, navigate digital transformation, and achieve competitive advantage in our hyperconnected world.
This book will inspire you to:
HBR's 10 Must Reads paperback series is the definitive collection of books for new and experienced leaders alike. Leaders looking for the inspiration that big ideas provide, both to accelerate their own growth and that of their companies, should look no further. HBR's 10 Must Reads series focuses on the core topics that every ambitious manager needs to know: leadership, strategy, change, managing people, and managing yourself. Harvard Business Review has sorted through hundreds of articles and selected only the most essential reading on each topic. Each title includes timeless advice that will be relevant regardless of an ever‐changing business environment.
Ingest, transform, manipulate, and visualize your data beyond Power BI's capabilities.
Purchase of the print or Kindle book includes a free eBook in PDF format.
Key FeaturesThe latest edition of this book delves deeper into advanced analytics, focusing on enhancing Python and R proficiency within Power BI. New chapters cover optimizing Python and R settings, utilizing Intel's Math Kernel Library (MKL) for performance boosts, and addressing integration challenges. Techniques for managing large datasets beyond laptop RAM, employing parquet data format, and advanced fuzzy matching algorithms are explored. Additionally, it discusses leveraging SQL Server External Languages to overcome traditional Python and R limitations in Power BI. It also helps in crafting sophisticated visualizations using the grammar of graphics in both R and Python.
This PowerBI book will help you master data validation with regular expressions, import data from diverse sources, and apply advanced algorithms for transformation. Next, you'll learn to Safeguard personal data in Power BI with techniques like pseudonymization, anonymization, and data masking. You'll also get to grips with the key statistical features of data sets by plotting multiple visual graphs in the process of building a machine-learning model. The book will guide you to Utilize external APIs for enrichment, enhancing I/O performance, and leveraging Python and R for analysis.
You'll also be able to reinforce learning with questions at the end of each chapter.
What you will learnThis book is for business analysts, business intelligence professionals, and data scientists who already use Microsoft Power BI and want to add more value to their analysis using Python and R. Working knowledge of Power BI is required to make the most of this book. Basic knowledge of Python and R will also be helpful.
Table of Contents(N.B. Please use the Look Inside option to see further chapters)
The global data market is estimated to be worth $64 billion dollars, making it a more valuable resource than oil. But data is useless without the analysis, interpretation and innovations of data scientists.
With Confident Data Science, learn the essential skills and build your confidence in this sector through key insights and practical tools for success. In this book, you will discover all of the skills you need to understand this discipline, from primers on the key analytic and visualization tools to tips for pitching to and working with clients. Adam Ross Nelson draws upon his expertise as a data science consultant and, as someone who made moved into the industry late in his career, to provide an overview of data science, including its key concepts, its history and the knowledge required to become a successful data scientist. Whether you are considering a career in this industry or simply looking to expand your knowledge, Confident Data Science is the essential guide to the world of data science. About the Confident series...Explore a variety of Excel features, functions, and productivity tips for various aspects of financial modeling
Key FeaturesFinancial modeling is a core skill required by anyone who wants to build a career in finance. Hands-On Financial Modeling with Excel for Microsoft 365 explores financial modeling terminologies with the help of Excel.
Starting with the key concepts of Excel, such as formulas and functions, this updated second edition will help you to learn all about referencing frameworks and other advanced components for building financial models. As you proceed, you'll explore the advantages of Power Query, learn how to prepare a 3-statement model, inspect your financial projects, build assumptions, and analyze historical data to develop data-driven models and functional growth drivers. Next, you'll learn how to deal with iterations and provide graphical representations of ratios, before covering best practices for effective model testing. Later, you'll discover how to build a model to extract a statement of comprehensive income and financial position, and understand capital budgeting with the help of end-to-end case studies.
By the end of this financial modeling Excel book, you'll have examined data from various use cases and have developed the skills you need to build financial models to extract the information required to make informed business decisions.
What you will learnThis book is for data professionals, analysts, traders, business owners, and students who want to develop and implement in-demand financial modeling skills in their finance, analysis, trading, and valuation work. Even if you don't have any experience in data and statistics, this book will help you get started with building financial models. Working knowledge of Excel is a prerequisite.
Table of ContentsGet going with tidymodels, a collection of R packages for modeling and machine learning. Whether you're just starting out or have years of experience with modeling, this practical introduction shows data analysts, business analysts, and data scientists how the tidymodels framework offers a consistent, flexible approach for your work.
RStudio engineers Max Kuhn and Julia Silge demonstrate ways to create models by focusing on an R dialect called the tidyverse. Software that adopts tidyverse principles shares both a high-level design philosophy and low-level grammar and data structures, so learning one piece of the ecosystem makes it easier to learn the next. You'll understand why the tidymodels framework has been built to be used by a broad range of people.
With this book, you will: