Skip to content
Data Science and Machine Learning

Books

Data Science and Analytics with PythonCRC Press, Taylor & Francis Group, ISBN: 9781138043176 (2017)

Essential Matlab and Octave, CRC Press, Taylor & Francis Group, ISBN: 9781482234633 (2014)

Advanced Data Science and Analytics with Python CRC Press, Taylor & Francis Group, ISBN: 978-0429446610 (2020)

Statistics and Data Visualisation with Python. CRC Press, Taylor & Francis, ISBN: 9780367744519 (2022)

Stats and Data Viz - Book Cover

I am very excited to see the approved cover of my most recent book "Statistics and Data Visualisation with Python". The use of a Jackalope continues to be a theme and in this case it it a figure that is used in Chapter 8 where we explore the creation of a number of useful charts with various Python modules.

The book is set to be published by December 14th and it has already started being listed at my publisher's website, marking the book as "forthcoming".

I look forward to the actual publication and tell you all about the book and its contents.

You can take a look at some info about the book here.

Click here to read more...

Statistics and Data Visualisation with Python - Final Draft

I am very pleased to announce that the final draft of my book "Statistics and Data Science with Python" has been completed. It has been a pleasure to write and use examples referencing some of my favourite SciFi characters, from StarTrek to Battle Star Galactica and more.

The book covers builds from the ground up the basis for statistical analysis underpinning a number of applications and algorithms in business analytics, machine learning and applied machine learning. The book starts with the basics of programming in python as well as data analysis to build a solid background in statistical methods and hypothesis testing useful in a variety of modern applications. 

Table of Contents

  1. Data, Stats and Stories - An Introduction
  2. Python Programming Primer
  3. Snakes, Bears & Other Numerical Beasts: NumPy, SciPy and Pandas
  4. The measure of all things - Statistics
  5. Definitely Maybe: Probability and Distributions
  6. Alluring Arguments and Ugly Facts - Statistical Modelling and Hypothesis Testing
  7. Delightful Details - Data Visualisation
  8. Dazzling Data Designs - Creating Charts

I will let you know how the revisions go and I hope it will be available soon!

Click here to read more...

Data Science and Analytics with Python - Now in Chinese

I received a notification that my "Data Science and Analytics with Python" book is now available in Chinese. Great news and 谢谢!

You can take a look at the English versions (here and here)

The Chinese version is available here.

Click here to read more...

Using MATLAB for Data Science and Machine Learning

Python and R are not the only programming languages used in data science or machine learning applications. In a recent post in the Domino Data Blog I argue about the usefulness of MATLAB.

Check the post here.

While you are doing that also check my "Essential Matlab and Octave" book.

Click here to read more...

Getting Data with Beautiful Soup

Always wanted to get some data from the web in a programmatic way? Well, check out my recent post in the Domino Data Blog where I discuss how to get data with the help of Beautiful Soup.

The aim is to show how we can create a script that grabs the pages we are interested in and obtain the information we are after. In the post I cover ho to complete the these steps:

  1. Identify the webpage with the information we need
  2. Download the source code
  3. Identify the elements of the page that hold the information we need
  4. Extract and clean the information
  5. Format and save the data for further analysis

Click here to read more...

Data Exploration with Pandas Profiler and D-Tale

Are you interested in exploring data using Python? If so, take a look at my this blog post of mine…  where I talk about using Pandas Profiler and D-Tale to carry out data exploration.

Helpful steps to:

  • Detect erroneous data.
  • Determine how much missing data there is.
  • Understand the structure of the data.
  • Identify important variables in the data.
  • Sense-check the validity of the data.

I use the The Mammographic Mass Data Set from the UCI Machine Learning Repository. Information about this dataset can be obtained here.

Read the full blog post in the Domino Data Blog here.

Click here to read more...

Statistics and Data Visualisation with Python - First Chapter Done

As you know I am writing a new book. This time it is a book about statistics and data visualisation using Python as the main language to analyse data. It was thinking that I was a bit behind with my plan for the book, but I managed to surprise myself by being bang on time completing the first chapter.

This is the introductory chapter where we cover some background on the importance of statistics, a bit of history and the personalities behind some concepts widely used in stats and data visualisation. We then cover some background in formulating questions to be answered with data and how to communicate our results.

On to the next chapter! 🐍📊📖

Click here to read more...

Advanced Data Science and Analytics with Python - Video

Hello again this is a video I recorded for my publisher about my book “Advanced Data Science and Analytics with Python”. This is a video I made for my publisher about my book “Data Science and Analytics with Python”. You can get the book here and more about the book here.

https://vimeo.com/516105510

This companion to "Data Science and Analytics with Python" is the result of arguments with myself about writing something to cover a few of the areas that were not included in that first volume, largely due to space/time constraints. Like the previous book, this one exists thanks to the discussions, stand-ups, brainstorms and eventual implementations of algorithms and data science projects carried out with many colleagues and friends.

As the title suggests, this book continues to use Python as a tool to train, test and implement machine learning models and algorithms. The book is aimed at data scientists who would like to continue developing their skills and apply them in business and academic settings.

The subjects discussed in this book are complementary and a follow-up to the ones covered in Volume 1. The intended audience for this book is still composed of data analysts and early-career data scientists with some experience in programming and with a background in statistical modelling. In this case, however, the expectation is that they have already covered some areas of machine learning and data analytics. The subjects discussed in this book are complementary and a follow-up to the topics discussed in "Data Science and Analytics with Python". Although there are some references to the previous book, this volume is written to be read independently.

I have tried to keep the same tone as in the first book, peppering the pages with some bits and bobs of popular culture, science fiction and indeed Monty Python puns. The aim is still to focus on showing the concepts and ideas behind popular algorithms and their use.

In summary, "Advanced Data Science and Analytics with Python" presents each of the topics addressed in the book tackles the data science workflow from a practical perspective, concentrating on the process and results obtained. The material covered includes machine learning and pattern recognition algorithms including: Time series analysis, natural language processing, topic modelling, social network analysis, neural networks and deep learning. The book discusses the need to develop data products and addresses the subject of bringing models to their intended audiences – in this case, literally to the users’ fingertips in the form of an iPhone app.

I hope you enjoy it and if you want to know more about my other books, please check the related videos here:

Click here to read more...

Data Science and Analytics with Python - Video

This is a video I made for my publisher about my book “Data Science and Analytics with Python”. You can get the book here and more about the book here.

https://vimeo.com/512245277

The book provides an introduction to some of the most used algorithms in data science and analytics. This book is the result of very interesting discussions, debates and dialogues with a large number of people at various levels of seniority, working at startups as well as long-established businesses, and in a variety of industries, from science to media to finance.

“Data Science and Analytics with Python” is intended to be a companion to data analysts and budding data scientists that have some working experience with both programming and statistical modelling, but who have not necessarily delved into the wonders of data analytics and machine learning. The book uses Python as a tool to implement and exploit some of the most common algorithms used in data science and data analytics today.

Python is a popular and versatile scripting and object-oriented language, it is easy to use and has a large active community of developers and enthusiasts, not to mention the richness oall of this helped by the versatility of the iPython/Jupyter Notebook.

In the book I address the balance between the knowledge required by a data scientist sucha as mathematics and computer science, with the need for a good business background. To tackle the prevailing image of a unicorn data scientist, I am convinced that the use of a new symbol is needed. And a silly one at that! There is an allegory I usually propose to colleagues and those that talk about the data science Unicorn. It seems to me to be a more appropriate one than the existing image: It is still another mythical creature, less common perhaps than the unicorn, but more importantly with some faint fact about its actual existence: a Jackalope. You will have to read the book to find out more!

The main purpose of the book is to present the reader with some of the main concepts used in data science and analytics using tools developed in Python such as Scikit-learn, Pandas, Numpy and others. The book is intended to be a bridge to the data science and analytics world for programmers and developers, as well as graduates in scientific areas such as mathematics, physics, computational biology and engineering, to name a few.

The material covered includes machine learning and pattern recognition, various regression techniques, classification algorithms, decision tree and hierarchical clustering, and dimensionality reduction. Though this text is not recommended for those just getting started with computer programming,

There are a number of topics that were not covered in this book. If you are interested in more advanced topics take a look at my book called “Advanced Data Science and Analytics with Python”. There is a follow up video for that one! Keep en eye out for that!

Related Content: Please take a look at other videos about my books:

Click here to read more...

Essential MATLAB and Octave - Video

This is a video I made for my publisher about my book “Essential MATLAB and Octave”. You can get the book here and more about the book here.

The book is a primer for programming in Matlab and Octave within the context of numerical simulations for a variety of applications. Matlab and Octave are powerful programming languages widely used by scientists and engineers. They provide excellent capabilities for data analysis, visualisation and more.

https://vimeo.com/509514561

The book started as lecture notes for a course on Computational Physics - later turning into a wider encompassing syllabus covering aspects of computational finance, optimisation and even biology and economics

The aim of the book is to learn and apply programming in Matlab and octave using straightforward explanations and examples from different areas in mathematics, engineering, finance, and physics.

Essential MATLAB and Octave explains how MATLAB and Octave are powerful tools applicable to a variety of problems. This text provides an introduction that reveals basic structures and syntax, demonstrates the use of functions and procedures, outlines availability in various platforms, and highlights the most important elements for both programs.

The book can be considered as a companion for programmers (new and experienced) that require the use of computers to solve numerical problems.

Code is presented in individual boxes and explanations are added as margin notes. Although both Matlab and Octave share a large number of features, they are not strictly the same. In cases where code is specific to one of the languages the margin notes provide clarity.

This text requires no prior knowledge and it is self-contained, allowing the reader to use the material whenever needed rather than follow a particular order.

Compatible with both languages, the book material incorporates commands and structures that allow the reader to gain a greater awareness of MATLAB and Octave, write their own code, and implement their scripts and programs within a variety of applicable fields.

It is always made clear when particular examples apply only to MATLAB or only to Octave, allowing the book to be used flexibly depending on readers’ requirements.

Click here to read more...

Statistics and Data Visualisation with Python

Right!!! It is early December and this post has been in the inkwell for a few months now. Earlier in the year I received the comments and suggestions from reviewers and the final approval from the excellent team at CRC Press for my 4th book.

After a few weeks of frank procrastination and a few more on structuring the thoughts proposed a bit more, I have got a clear head to start writing. So I am pleased to announce that I am officially starting to write “Statistics and Data Visualisation with #Python”.

"Statistics and Data Visualisation with Python" builds from the ground up the basis for statistical analysis underpinning a number of applications and algorithms in business analytics, machine learning and applied machine learning. The book will cover the basics of programming in python as well as data analysis to build a solid background in statistical methods and hypothesis testing useful in a variety of modern applications.

Stay tuned!

Click here to read more...

"Advanced Data Science and Analytics with Python" - Arrived

I was not expecting this today, but I am very pleased to see that my first physical copies of "Advanced Data Science and Analytics" have arrived. I was working under the assumption that these would not be sent until after lockdowns were lifted, but that was not the case.

I am very happy to see the actual book and hold it in my hands!

I also hear that individual copies have started arriving to their new owners. If you ordered yours, let me know when it arrives. I will post your pictures!

Click here to read more...

Advanced Data Science and Analytics with Python - Discount

I am reaching out as volume 2 of my data science book will be out for publication in May and my publisher has made it possible for me to offer 20% off. You can order the book here.

This follows from "Data Science and Analytics with Python" and both books are intended for practitioners in data science and data analytics in both academic and business environments.

The new book aims to present the reader with concepts in data science and analytics that were deemed to be more advanced or simply out of scope in the author's first book, and are used in data analytics using tools developed in Python such as SciKit Learn, Pandas, Numpy, etc. The use of Python is of particular benefit given its recent popularity in the data science community. The book is therefore a reference to be used by seasoned programmers and newcomers alike and the key benefit is the practical approach presented throughout the book

More information about the first book can be found here.

Click here to read more...

Advanced Data Science and Analytics with Python - Final Corrections

Well, this are the final corrections for my latest book "Advanced Data Science and Analytics with Python". Next stop publication!

 

via Instagram https://ift.tt/2UrJ4oj

Click here to read more...

Advanced Data Science and Analytics with Python - Proofreading

Super excited to have received the proofread version of Advanced Data Science and Analytics with Python. They all seem to be very straightforward corrections: a few missing commas, some italics here and there and capitalisation bits and bobs.

I hope to be able to finish the corrections before my deadline for March 25th, and then enter the last phase before publication in May 2020.

Click here to read more...

Cover Draft for “Advanced Data Science and Analytics with Python”

I have received the latest information about the status of my book “Advanced Data Science and Analytics with Python”. This time reviewing the latest cover drafts for the book.

This is currently my favourite one.

Awaiting the proofreading comments, and I hope to update you about that soon.

Click here to read more...

Pandas 1.0 is out

If you are interested in #DataScience you surely have heard of #pandas and you would be pleased to hear that version 1.0 finally out. With better integration with bumpy and improvements with numba among others. Take a look!
— Read on www.anaconda.com/pandas-1-0-is-here/

Click here to read more...

Data Science Talk at University of Hertfordshire

It was great to invited to give the joint Physics Astronomy and Maths + Computer Science research seminar today at the University of Hertfordshire. I had a good opportunity to meet old colleagues and meet new faculty. There were also many students and they with many questions.

I was glad to hear they are thinking about offering more data science courses and even a dedicated programme. I would definitely be interested to hear more about that.

Click here to read more...

Advanced Data Science and Analytics with Python - Submitted!

There you go, the first checkpoint is completed: I have officially submitted the completed version of "Advanced Data Science and Analytics with Python".

The book has been some time in the making (and in the thinking...). It is a follow up from my previous book, imaginatively called "Data Science and Analytics with Python" . The book covers aspects that were necessarily left out in the previous volume; however, the readers in mind are still technical people interested in moving into the data science and analytics world. I have tried to keep the same tone as in the first book, peppering the pages with some bits and bobs of popular culture, science fiction and indeed Monty Python puns. 

Advanced Data Science and Analytics with Python enables data scientists to continue developing their skills and apply them in business as well as academic settings. The subjects discussed in this book are complementary and a follow up from the topics discuss in Data Science and Analytics with Python. The aim is to cover important advanced areas in data science using tools developed in Python such as SciKit-learn, Pandas, Numpy, Beautiful Soup, NLTK, NetworkX and others. The development is also supported by the use of frameworks such as Keras, TensorFlow and Core ML, as well as Swift for the development of iOS and MacOS applications.

The book can be read independently form the previous volume and each of the chapters in this volume is sufficiently independent from the others proving flexibiity for the reader. Each of the topics adressed in the book tackles the data science workflow from a practical perspective, concentrating on the process and results obtained. The implementation and deployment of trained models are central to the book

Time series analysis, natural language processing, topic modelling, social network analysis, neural networds and deep learning are comprehensively covrered in the book. The book discusses the need to develop data products and tackles the subject of bringing models to their intended audiences. In this case literally to the users fingertips in the form of an iPhone app.

While the book is still in the oven, you may want to take a look at the first volume. You can get your copy here:

Furthermore you can see my Author profile here.

Click here to read more...

ODSC Europe 2019

It was a pleasure to come to the opening day of ODSC Europe 2019. This time round I was the first speaker of the first session, and it was very apt as the talk was effectively an introduction to Data Science.

The next 4 days will be very hectic for the attendees and it the quality is similar to the previous editions we are going to have a great time.

Click here to read more...

Natural Language Processing - Talk

Last October I had the great opportunity to come and give a talk at the Facultad de Ciencias Políticas, UAEM, México. The main audience were students of the qualitative analysis methods course, but there were people also from informatics and systems engineering.

It was an opportunity to showcase some of the advances that natural language processing offers to social scientists interested in analysing discourse, from politics through to social interactions.

The talk covered a introduction and brief history of the field. We went through the different stages of the analysis, from reading the data, obtaining tokens and labelling their part of speech (POS) and then looking at syntactic and semantic analysis.

We finished the session with a couple of demos. One looking at speeches of Clinton and Trump during their presidential campaigns; the other one was a simple analysis of a novel in Spanish.

Thanks for the invite.

Click here to read more...

"Advanced Data Science And Analytics" is finished!

It has been a few months of writing, testing, re-writing and starting again, and I am pleased to say that the first complete draft of "Advanced Data Science and Analytics with Python" is ready. Last chapter is done and starting revisions now. Yay!

Click here to read more...

Adding new conda environment kernel to Jupyter and nteract

I know there are a ton of posts out there covering this very topic. I am writing this post more for my out benefit, so that I have a reliable place to check the commands I need to add a new conda environment to my Jupyter and nteract IDEs.

First to create an environment that contains, say TensorFlow, Pillow, Keras and pandas we need to type the following in the command line:

$ conda create -n tensorflow_env tensorflow pillow keras pandas jupyter ipykernel nb_conda

Now, to add this to the list of available environments in either Jupyter or nteract, we type the following:

$ conda activate tensor_env

$ python -m ipykernel install --name tensorflow_env


$ conda deactivate


Et voilà, you should now see the environment in the dropdown menu!

Click here to read more...

Data Science and Analytics with Python - Social Network Analysis

Using the time wisely during the Bank Holiday weekend. As my dad would say, "resting while making bricks"... Currently reviewing/editing/correcting Chapter 3 of "Advanced Data Science and Analytics with Python". Yes, that is volume 2 of "Data Science and Analytics with Python".

NSA_jrs.jpg

Click here to read more...

Social Network Analysis and Star Wars

On my way back to London and making the most of the time in the train to work on my Data Science and Analytics Vol 2 book. Working with #StarWars data to explain Social Network Analysis #datascience #geek

Click here to read more...

ODSC - Introduction to Data Science – A Practical Viewpoint


Very pleased to have the opportunity to share some thoughts with the keen audience attending the ODSC Europe 2018


My talk is not a technical presentation, as many of the other ones in the conference have been. Instead I wanted to present a workshop-style session that gives us the opportunity to interact with each other, share experiences and learn best practice in data science. The audience in mind is varied, from newcomers to the field to experienced practitioners.  You can find a handout of the slides in the link below:



[gview file="https://jrogel.com/wp-content/uploads/2018/09/JRogel_ODSC_An_Intro_To_DataScience.pdf"]

Click here to read more...

nteract - a great Notebook experience

I am a supporter of using Jupyter Notebooks for data exploration and code prototyping. It is a great way to start writing code and immediately get interactive feedback. Not only can you document your code there using markdown, but also you can embed images, plots, links and bring your work to life.

Nonetheless, there are some little annoyances that I have, for instance the fact that I need to launch a Kernel to open a file and having to do that "the long way" - i.e. I cannot double-click on the file that I am interested in seeing. Some ways to overcome this include looking at Gihub versions of my code as the notebooks are rendered automatically, or even saving HTML or PDF versions of the notebooks. I am sure some of you may have similar solutions for this.

Last week, while looking for entries on something completely different, I stumbled upon a post that suggested using nteract. It sounded promising and I took a look. It turned out to be related to the Hydrogen package available for Atom, something I have used in the past and loved it. nteract was different though as it offered a desktop version and other goodies such as in-app support for publishing, a terminal-free experience sticky cells, input and output hiding... Bring it on!

I just started using it, and so far so good. You may want to give it a try, and maybe even contribute to the git repo.

nteract_screenshot.jpg

Click here to read more...

Intro to Data Science Talk

Full room and great audience at General Assembly his evening. Lots of thoughtful questions and good discussion.

Click here to read more...

Now... presenting at ODSC Europe

Data science is definitely in everyone’s lips and this time I had the opportunity of showcasing some of my thoughts, practices and interests at the Open Data Science Conference in London.

The event was very well attended by data scientists, engineers and developers at all levels of seniority, as well as business stakeholders. I had the great opportunity to present the landscape that newcomers and seasoned practitioners must be familiar with to be able to make a successful transition into this exciting field.

It was also a great opportunity to showcase “Data Science and Analytics with Python” and to get to meet new people including some that know other members of my family too.

-j

Click here to read more...

Data Science and Analytics with Python - New York Team

Earlier this week I received this picture of the team in New York. As you can see they have recently all received a copy of my "Data Science and Analytics with Python" book.

Thanks guys!

TeamNY.PNG

Click here to read more...

Another "Data Science and Analytics with Python" Delivered

Another "Data Science and Analytics with Python" Delivered. Thanks for sharing the picture Dave Groves.

Click here to read more...

Data Science and Analytics - In the hands of readers!

I’m very pleased to see that my “Data Science and Analytics” book is arriving to the hands of readers.

Here’s a picture that my colleague and friend Rob Hickling sent earlier today:

 

Click here to read more...


"Essential Matlab and Octave" in the CERN Document Server

I got pinged this screenshot from a friend that saw "Essential MATLAB and Octave" being included in the CERN Document Server!

Chuffed!

 

Click here to read more...

Data Science and Analytics with Python - Proofread Manuscript

I have now received comments and corrections for the proofreading of my “Data Science and Analytics with Python” book.

Two weeks and counting to return corrections and comments back to the editor and project manager.

 

Click here to read more...

Data Analytics Python

"Data Science and Analytics with Python" enters production

Data Analytics Python

I am very pleased to tell you about some news I received a couple of weeks ago from my editor: my book "Data Science and Analytics with Python" has been transferred to the production department so that they can begin the publication process!

UPDATE: The book is available here.

The book has been assigned a Project Editor who will handle the proofreading and handle all aspects of the production process. This was after clearing the review process I told you about some time ago. The review was lengthy but it was very positive and the comments of the reviewers have definitely improved the manuscript.

As a result of the review, the table of contents has changed a bit since the last update I posted. Here is the revised table:

  1. The Trials and Tribulations of a Data Scientist
  2. Python: For Something Completely Different!
  3. The Machine that Goes “Ping”: Machine Learning and Pattern Recognition
  4. The Relationship Conundrum: Regression
  5. Jackalopes and Hares: Clustering
  6. Unicorns and Horses: Classification
  7. Decisions, Decisions: Hierarchical Clustering, Decision Trees and Ensemble Techniques
  8. Less is More: Dimensionality Reduction
  9. Kernel Trick Under the Sleeve: Support Vector Machines

Each of the chapters is intended to be sufficiently self-contained. There are some occasions where reference to other sections is needed, and I am confident that it is a good thing for the reader. Chapter 1 is effectively a discussion of what data science and analytics are, paying particular attention to the data exploration process and munging. It also offers my perspective as to what skills and roles are required to get a successful data science function.

Chapter 2 is a quick reminder of some of the most important features of Python. We then move into the core of machine learning concepts that are used in the rest of the book. Chapter 4 covers regression from ordinary least squares to LASSO and ridge regression. Chapter 5 covers clustering (k-means for example) and Chapter 6 classification algorithms such as Logistic Regression and Naïve Bayes.

In Chapter 7 we introduce the use of hierarchical clustering, decision trees and talk about ensemble techniques such as bagging and boosting.

Dimensionality reduction techniques such as Principal Component Analysis are discussed in Chapter 8 and Chapter 9 covers the support vector machine algorithm and the all important Kernel trick in applications such as regression and classification.

The book contains 55 figures and 18 tables, plus plenty of bits and pieces of Python code  to play with.

I guess I will have to sit and wait for the proofreading to be completed and then start the arduous process of going through the comments and suggestions. As ever I will keep you posted as how things go.

Ah! By the way, I will start a mailing list to tell people when the book is ready, so if you are interested, please let me know!

Keep in touch!

PS. The table of contents is also now available at CRC Press here.

Click here to read more...

Artificial Intelligence, Revealed

A few weeks ago I was invited by General Assembly to give a short intro to Data Science to a group of interested (and interesting) students. They all had different backgrounds, but they all shared an interest for technology and related subjects.

While I was explaining some of the differences between supervised and unsupervised machine learning, I used my example of an alien life trying to cluster (and eventually classify) cats and dogs. If you are interested to know more about this, you will probably have to wait for the publication of my "Data Science and Analytics with Python" book.. I digress...

So, Ed Shipley - one of the admissions managers at GA London - asked me and the students if we had seen the videos that Facebook had produced to explain machine learning... He was reminded of them as they use an example about a machine distinguishing between dogs and cars... (see what they did there?...). If you haven't seen the videos, here you go:

Intro to AI

Machine Learning

Convolutional Neural Nets

Click here to read more...

Data Analytics Python

First full draft of "Data Science and Analytics with Python"

It has been nearly 12 months in development almost to the day, and I am very please to tell you that the first full draft of my new book entitled "Data Science and Analytics with Python" is ready.

Data Analytics Python

The book is aimed at data enthusiasts and professionals with some knowledge of programming principles as well as developers and business people interested in learning more about data science and analytics The proposed table of contents is as follows:

  1. The Trials and Tribulations of a Data Scientist
  2. Firsts Slithers with Python
  3. The Machine that Goes “Ping”: Machine Learning and Pattern Recognition
  4. The Relationship Conundrum: Regression
  5. Jackalopes and Hares, Unicorns and Horses: Clustering and Classification
  6. Decisions, Decisions: Hierarchical Clustering, Decision Trees and Ensemble Techniques
  7. Dimensionality Reduction and Support Vector Machines

At the moment the book contains 53 figures and 18 tables, plus plenty of bits and pieces of code ready to be tried.

The next step is to start the re-reading, re-draftings and revisions in preparation for the final version and submission to my publisher CRC Press later in the year. I will keep you posted as how things go.

Keep in touch!

 

Click here to read more...

Markup for Fast Data Science Publication - Reblog

I am an avid user of Markdown via Mou and R Markdown (with RStudio). The facility that the iPython Notebook offers in combining code and text to be rendered in an interactive webpage is the choice for a number of things, including the 11-week Data Science course I teach at General Assembly.

As for LaTeX, well, I could not have survived my PhD without it and I still use it heavily. I have even created some videos about how to use LaTeX, you can take a loot at them

My book "Essential Matlab and Octave" was written and formatted in its entirety using LaTeX. My new book "Data Science and Analytics with Python" is having the same treatment.

markdown

I was very pleased to see the following blog post by Benjamin Bengfort. This is a reblog of that post and the original can be found here.

Markup for Fast Data Science Publication
Benjamin Bengfort

A central lesson of science is that to understand complex issues (or even simple ones), we must try to free our minds of dogma and to guarantee the freedom to publish, to contradict, and to experiment. — Carl Sagan in Billions & Billions: Thoughts on Life and Death at the Brink of the Millennium

As data scientists, it's easy to get bogged down in the details. We're busy implementing Python and R code to extract valuable insights from data, train effective machine learning models, or put a distributed computation system together. Many of these tasks, especially those relating to data ingestion or wrangling, are time-consuming but are the bread and butter of the data scientist's daily grind. What we often forget, however, is that we must not only be data engineers, but also contributors to the data science corpus of knowledge.

If a data product derives its value from data and generates more data in return, then a data scientist derives their value from previously published works and should generate more publications in return. Indeed, one of the reasons that Machine Learning has grown ubiquitous (see the many Python-tagged questions related to ML on Stack Overflow) is thanks to meticulous blog posts and tools from scientific research (e.g. Scikit-Learn) that enable the rapid implementation of a variety of algorithms. Google in particular has driven the growth of data products by publishing systems papers about their methodologies, enabling the creation of open source tools like Hadoop and Word2Vec.

By building on a firm base for both software and for modeling, we are able to achieve greater results, faster. Exploration, discussion, criticism, and experimentation all enable us to have new ideas, write better code, and implement better systems by tapping into the collective genius of a data community. Publishing is vitally important to keeping this data science gravy train on the tracks for the foreseeable future.

In academia, the phrase "publish or perish" describes the pressure to establish legitimacy through publications. Clearly, we don't want to take our rule as authors that far, but the question remains, "How can we effectively build publishing into our workflow?" The answer is through markup languages - simple, streamlined markup that we can add to plain text documents that build into a publishing layout or format. For example, the following markup languages/platforms build into the accompanying publishable formats:

  • Markdown → HTML
  • iPython Notebook (JSON + Markdown) → Interactive Code
  • reStructuredText + Sphinx → Python Documentation, ReadTheDocs.org
  • AsciiDoc → ePub, Mobi, DocBook, PDF
  • LaTeX → PDF

The great thing about markup languages is that they can be managed inline with your code workflow in the same software versioning repository. Github goes even further as to automatically render Markdown files! In this post, we'll get you started with several markup and publication styles so that you can find what best fits into your workflow and deployment methodology.

Markdown

Markdown is the most ubiquitous of the markup languages we'll describe in this post, and its simplicity means that it is often chosen for a variety of domains and applications, not just publishing. Markdown, originally created by John Gruber, is a text-to-HTML processor, where lightweight syntactic elements are used instead of the more heavyweight HTML tags. Markdown is intended for folks writing for the web, not designing for the web, and in some CMS systems, it is simply the way that you write, no fancy text editor required.

Markdown has seen special growth thanks to Github, which has an extended version of Markdown, usually referred to as "Github-Flavored Markdown." This style of Markdown extends the basics of the original Markdown to include tables, syntax highlighting, and other inline formatting elements. If you create a Markdown file in Github, it is automatically rendered when viewing files on the web, and if you include a README.md in a directory, that file is rendered below the directory contents when browsing code. Github Issues are also expected to be in Markdown, further extended with tools like checkbox lists.

Markdown is used for so many applications it is difficult to name them all. Below are a select few that might prove useful to your publishing tasks.

  • Jekyll allows you to create static websites that are built from posts and pages written in Markdown.
  • Github Pages allows you to quickly publish Jekyll-generated static sites from a Github repository for free.
  • Silvrback is a lightweight blogging platform that allows you to write in Markdown (this blog is hosted on Silvrback).
  • Day One is a simple journaling app that allows you to write journal entries in Markdown.
  • iPython Notebook expects Markdown to describe blocks of code.
  • Stack Overflow expects questions, answers, and comments to be written in Markdown.
  • MkDocs is a software documentation tool written in Markdown that can be hosted on ReadTheDocs.org.
  • GitBook is a toolchain for publishing books written in Markdown to the web or as an eBook.

There are also a wide variety of editors, browser plugins, viewers, and tools available for Markdown. Both Sublime Text and Atom support Markdown and automatic preview, as well as most text editors you'll use for coding. Mou is a desktop Markdown editor for Mac OSX and iA Writer is a distraction-free writing tool for Markdown for iOS. (Please comment your favorite tools for Windows and Android). For Chrome, extensions like Markdown Here make it easy to compose emails in Gmail via Markdown or Markdown Preview to view Markdown documents directly in the browser.

Clearly, Markdown enjoys a broad ecosystem and diverse usage. If you're still writing HTML for anything other than templates, you're definitely doing it wrong at this point! It's also worth including Markdown rendering for your own projects if you have user submitted text (also great for text-processing).

Rendering Markdown can be accomplished with the Python Markdown library, usually combined with the Bleach library for sanitizing bad HTML and linkifying raw text. A simple demo of this is as follows:

First install markdown and bleach using pip:

$ pip install markdown bleach

Then create a markdown parsing function as follows:

import bleach
from markdown import markdown

def htmlize(text):
"""
This helper method renders Markdown then uses Bleach to sanitize it as
well as converting all links in text to actual anchor tags.
"""
text = bleach.clean(text, strip=True) # Clean the text by stripping bad HTML tags
text = markdown(text) # Convert the markdown to HTML
text = bleach.linkify(text) # Add links from the text and add nofollow to existing links

return text

Given a markdown file test.md whose contents are as follows:

# My Markdown Document

For more information, search on [Google](http://www.google.com).

_Grocery List:_

1. Apples
2. Bananas
3. Oranges

The following code:

>>> with open('test.md', 'r') as f:
... print htmlize(f.read())

Will produce the following HTML output:

<h1>My Markdown Document</h1>
For more information, search on <a href="http://www.google.com" rel="nofollow">Google</a>.

<em>Grocery List:</em>
<ol>
    <li>Apples</li>
    <li>Bananas</li>
    <li>Oranges</li>
</ol>

Hopefully this brief example has also served as a demonstration of how Markdown and other markup languages work to render much simpler text with lightweight markup constructs into a larger publishing framework. Markdown itself is most often used for web publishing, so if you need to write HTML, then this is the choice for you!

To learn more about Markdown syntax, please see Markdown Basics.

iPython Notebook

iPython Notebook is an web-based, interactive environment that combines Python code execution, text (marked up with Markdown), mathematics, graphs, and media into a single document. The motivation for iPython Notebook was purely scientific: How do you demonstrate or present your results in a repeatable fashion where others can understand the work you've done? By creating an interactive environment where code, graphics, mathematical formulas, and rich text are unified and executable, iPython Notebook gives a presentation layer to otherwise unreadable or inscrutable code. Although Markdown is a big part of iPython Notebook, it deserves a special mention because of how critical it is to the data science community.

iPython Notebook is interesting because it combines both the presentation layer as well as the markup layer. When run as a server, usually locally, the notebook is editable, explorable (a tree view will present multiple notebook files), and executable - any code written in Python in the notebook can be evaluated and run using an interactive kernel in the background. Math formula written in LaTeX are rendered using MathJax. To enhance the delivery and shareability of these notebooks, the NBViewer allows you to share static notebooks from a Github repository.

iPython Notebook comes with most scientific distributions of Python like Anaconda or Canopy, but it is also easy to install iPython with pip:

$ pip install ipython

iPython itself is an enhanced interactive Python shell or REPL that extends the basic Python REPL with many advanced features, primarily allowing for a decoupled two-process model that enables the notebook. This process model essentially runs Python as a background kernel that receives execution instructions from clients and returns responses back to them.

To start an iPython notebook execute the following command:

$ ipython notebook

This will start a local server at

http://127.0.0.1:8888

and automatically open your default browser to it. You'll start in the "dashboard view", which shows all of the notebooks available in the current working directory. Here you can create new notebooks and start to edit them. Notebooks are saved as .ipynb files in the local directory, a format called "Jupyter" that is simple JSON with a specific structure for representing each cell in the notebook. The Jupyter notebook files are easily reversioned via Git and Github since they are also plain text.

To learn more about iPython Notebook, please see the iPython Notebook documentation.

reStructuredText

reStructuredText is an easy-to-read plaintext markup syntax specifically designed for use in Python docstrings or to generate Python documentation. In fact, the reStructuredText parser is a component of Docutils, an open-source text processing system that is used by Sphinx to generate intelligent and beautiful software documentation, in particular the native Python documentation.

Python software has a long history of good documentation, particularly because of the idea that batteries should come included. And documentation is a very strong battery! PyPi, the Python Package Index, ensures that third party packages provide documentation, and that the documentation can be easily hosted online through Python Hosted. Because of the ease of use and ubiquity of the tools, Python programmers are known for having very consistently documented code; sometimes it's hard to tell the standard library from third party modules!

In How to Develop Quality Python Code, I mentioned that you should use Sphinx to generate documentation for your apps and libraries in a docs directory at the top-level. Generating reStructuredText documentation in a docs directory is fairly easy:

$ mkdir docs
$ cd docs
$ sphinx-quickstart

The quickstart utility will ask you many questions to configure your documentation. Aside from the project name, author, and version (which you have to type in yourself), the defaults are fine. However, I do like to change a few things:

...
> todo: write "todo" entries that can be shown or hidden on build (y/n) [n]: y
> coverage: checks for documentation coverage (y/n) [n]: y
...
> mathjax: include math, rendered in the browser by MathJax (y/n) [n]: y

Similar to iPython Notebook, reStructured text can render LaTeX syntax mathematical formulas. This utility will create a Makefile for you; to generate HTML documentation, simply run the following command in the docs directory:

$ make html

The output will be built in the folder _build/html where you can open the index.html in your browser.

While hosting documentation on Python Hosted is a good choice, a better choice might be Read the Docs, a website that allows you to create, host, and browse documentation. One great part of Read the Docs is the stylesheet that they use; it's more readable than older ones. Additionally, Read the Docs allows you to connect a Github repository so that whenever you push new code (and new documentation), it is automatically built and updated on the website. Read the Docs can even maintain different versions of documentation for different releases.

Note that even if you aren't interested in the overhead of learning reStructuredText, you should use your newly found Markdown skills to ensure that you have good documentation hosted on Read the Docs. See MkDocs for document generation in Markdown that Read the Docs will render.

To learn more about reStructuredText syntax, please see the reStructuredText Primer.

AsciiDoc

When writing longer publications, you'll need a more expressive tool that is just as lightweight as Markdown but able to handle constructs that go beyond simple HTML, for example cross-references, chapter compilation, or multi-document build chains. Longer publications should also move beyond the web and be renderable as an eBook (ePub or Mobi formats) or for print layout, e.g. PDF. These requirements add more overhead, but simplify workflows for larger media publication.

Writing for O'Reilly, I discovered that I really enjoyed working in AsciiDoc - a lightweight markup syntax, very similar to Markdown, which renders to HTML or DocBook. DocBook is very important, because it can be post-processed into other presentation formats such as HTML, PDF, EPUB, DVI, MOBI, and more, making AsciiDoc an effective tool not only for web publishing but also print and book publishing. Most text editors have an AsciiDoc grammar for syntax highlighting, in particular sublime-asciidoc and Atom AsciiDoc Preview, which make writing AsciiDoc as easy as Markdown.

AsciiDoctor is an AsciiDoc-specific toolchain for building books and websites from AsciiDoc. The project connects the various AsciiDoc tools and allows a simple command-line interface as well as preview tools. AsciiDoctor is primarily used for HTML and eBook formats, but at the time of this writing there is a PDF renderer, which is in beta. Another interesting project of O'Reilly's is Atlas, a system for push-button publishing that manages AsciiDoc using a Git repository and wraps editorial build processes, comments, and automatic editing in a web platform. I'd be remiss not to mention GitBook which provides a similar toolchain for publishing larger books, though with Markdown.

Editor's Note: GitBook does support AsciiDoc.

To learn more about AsciiDoc markup see AsciiDoc 101.

LaTeX

If you've done any graduate work in the STEM degrees then you are probably already familiar with LaTeX to write and publish articles, reports, conference and journal papers, and books. LaTeX is not a simple markup language, to say the least, but it is effective. It is able to handle almost any publishing scenario you can throw at it, including (and in particular) rendering complex mathematical formulas correctly from a text markup language. Most data scientists still use LaTeX, using MathJax or the Daum Equation Editor, if only for the math.

If you're going to be writing PDFs or reports, I can provide two primary tips for working with LaTeX. First consider cloud-based editing with Overleaf or ShareLaTeX, which allows you to collaborate and edit LaTeX documents similarly to Google Docs. Both of these systems have many of the classes and stylesheets already so that you don't have to worry too much about the formatting, and instead just get down to writing. Additionally, they aggregate other tools like LaTeX templates and provide templates of their own for most document types.

My personal favorite workflow, however, is to use the Atom editor with the LaTeX package and the LaTeX grammar. When using Atom, you get very nice Git and Github integration - perfect for collaboration on larger documents. If you have a TeX distribution installed (and you will need to do that on your local system, no matter what), then you can automatically build your documents within Atom and view them in PDF preview.

A complete tutorial for learning LaTeX can be found at Text Formatting with LaTeX.

Conclusion

Software developers agree that testing and documentation is vital to the successful creation and deployment of applications. However, although Agile workflows are designed to ensure that documentation and testing are included in the software development lifecycle, too often testing and documentation is left to last, or forgotten. When managing a development project, team leads need to ensure that documentation and testing are part of the "definition of done."

In the same way, writing is vital to the successful creation and deployment of data products, and is similarly left to last or forgotten. Through publication of our work and ideas, we open ourselves up to criticism, an effective methodology for testing ideas and discovering new ones. Similarly, by explicitly sharing our methods, we make it easier for others to build systems rapidly, and in return, write tutorials that help us better build our systems. And if we translate scientific papers into practical guides, we help to push science along as well.

Don't get bogged down in the details of writing, however. Use simple, lightweight markup languages to include documentation alongside your projects. Collaborate with other authors and your team using version control systems, and use free tools to make your work widely available. All of this is possible becasue of lightweight markup languages, and the more profecient you are at including writing in your workflow, the easier it will be to share your ideas.

Helpful Links

This post is particularly link-heavy with many references to tools and languages. For reference, here are my preferred guides for each of the Markup languages discussed:

Books to Read

Special thanks to Rebecca Bilbro for editing and contributing to this post. Without her, this would certainly have been much less readable!

As always, please follow @DistrictDataLab on Twitter and subscribe to this blog by clicking the Subscribe button on the blog home page.

Benjamin Bengfort

 

Click here to read more...

Essential MATLAB and Octave - Closer to publication date

The publication date of "Essential MATLAB and Octave" is getting closer and closer. I would like to use this as an opportunity to share yet another endorsement, this time from Dr Hiram Luna-Munguia from the Department of Neurology at the University of Michigan:

This well-written book is a must-have for those people starting to solve numerical problems in Matlab or Octave. Since the beginning the reader will appreciate that the book´s major goal is to describe the essential aspects of both software without discrediting or highlighting the use of any of them. Page by page you will find clear explanations describing the way you should communicate with each software. The set of homework problems given at the end of each chapter makes the book even more dynamic.

Students and experts will warmly welcome Essential Matlab and Octave: A Beginner's Handbook into their libraries. I highly recommend it as an excellent reference tool.

Essential Matlab Octave Rogel-Salazar

 

Click here to read more...