Structured Documents in LaTeX

This is a video I made a few years ago to encourage my students to use better tools to write dissertations, thesis and reports that include the use of mathematics. The principles stand, although the tools may have moved on since then. I am reposting them as requested by a colleague of mine, Dr Catarina Carvalho, who I hope will still find this useful.

In this video we continue explaining how to use LaTeX. Here we will see how to use a master document in order to build a thesis or dissertation.
We assume that you have already had a look at the tutorial entitled: LaTeX for writing mathematics – An introduction

Structured Documents in LaTeX

LaTeX for writing mathematics – An introduction

This is a video I made a few years ago to encourage my students to use better tools to write dissertations, thesis and reports that include the use of mathematics. The principles stand, although the tools may have moved on since then. I am reposting them as requested by a colleague of mine, Dr Catarina Carvalho, who I hope will still find this useful.

In this video we explore the LaTeX document preparation system. We start with a explaining an example document. We have made use of TeXmaker as an editor given its flexibility and the fact that it is available for different platforms.

LaTeX for writing mathematics – An introduction

MacTex updates for El Capitan

El Capitan! Great! The new version of the OS X operating system. New features, new fonts, new problems… I knew that updating was going to bring some unexpected problems with my applications, but I wanted to update… And ditto, as soon as I tried to take a look under the hood for a couple of things I realised that a fresh installation of homebrewwas going to be needed.

More importantly, with my new book on data science (aka “Data Science and Analytics with Python”), LaTeX is probably one of the most used things in my computer. So, I wanted to check that things were fine and although I could compile (currently trying to finish Chapter 3 in case you are wondering) but there were some issues here and there, for example TeX Live thought I was using version 0 (yes zero!) and it could not find some files.

It turns out that El Capitan does not let us write to /usr and the 2015 TeX distribution creates symbolic links to /usr/texbin, is removed (if it was there from a previous OS version) and cannot be installed. If a GUI looks by default at that location it will sadly no longer find it. That is why the terminal was not affected! (Phew!)

The solution is to tell the broken applications to look at /Library/TeX/texbin, in /Library/TeX which is “owned” by MacTEX so is allowed by El Capitan. So to fix Tex Live do the following:

  •  Open TEX Live Utility  Preferences and click on the Choose. . .
  •  That opens a file chooser. Type Shift-Cmd-G , enter /Library/TeX  into the dialog box and then press Return .
  • Finally Double-Click  on texbin
  • Et voilà

 

For more info see this link.

El Capitan

Markup for Fast Data Science Publication – Reblog

I am an avid user of Markdown via Mou and R Markdown (with RStudio). The facility that the iPython Notebook offers in combining code and text to be rendered in an interactive webpage is the choice for a number of things, including the 11-week Data Science course I teach at General Assembly.

As for LaTeX, well, I could not have survived my PhD without it and I still use it heavily. I have even created some videos about how to use LaTeX, you can take a loot at them

My book “Essential Matlab and Octave” was written and formatted in its entirety using LaTeX. My new book “Data Science and Analytics with Python” is having the same treatment.

markdown

I was very pleased to see the following blog post by Benjamin Bengfort. This is a reblog of that post and the original can be found here.

Markup for Fast Data Science Publication
Benjamin Bengfort

A central lesson of science is that to understand complex issues (or even simple ones), we must try to free our minds of dogma and to guarantee the freedom to publish, to contradict, and to experiment. — Carl Sagan in Billions & Billions: Thoughts on Life and Death at the Brink of the Millennium

As data scientists, it’s easy to get bogged down in the details. We’re busy implementing Python and R code to extract valuable insights from data, train effective machine learning models, or put a distributed computation system together. Many of these tasks, especially those relating to data ingestion or wrangling, are time-consuming but are the bread and butter of the data scientist’s daily grind. What we often forget, however, is that we must not only be data engineers, but also contributors to the data science corpus of knowledge.

If a data product derives its value from data and generates more data in return, then a data scientist derives their value from previously published works and should generate more publications in return. Indeed, one of the reasons that Machine Learning has grown ubiquitous (see the many Python-tagged questions related to ML on Stack Overflow) is thanks to meticulous blog posts and tools from scientific research (e.g. Scikit-Learn) that enable the rapid implementation of a variety of algorithms. Google in particular has driven the growth of data products by publishing systems papers about their methodologies, enabling the creation of open source tools like Hadoop and Word2Vec.

By building on a firm base for both software and for modeling, we are able to achieve greater results, faster. Exploration, discussion, criticism, and experimentation all enable us to have new ideas, write better code, and implement better systems by tapping into the collective genius of a data community. Publishing is vitally important to keeping this data science gravy train on the tracks for the foreseeable future.

In academia, the phrase “publish or perish” describes the pressure to establish legitimacy through publications. Clearly, we don’t want to take our rule as authors that far, but the question remains, “How can we effectively build publishing into our workflow?” The answer is through markup languages – simple, streamlined markup that we can add to plain text documents that build into a publishing layout or format. For example, the following markup languages/platforms build into the accompanying publishable formats:

  • Markdown → HTML
  • iPython Notebook (JSON + Markdown) → Interactive Code
  • reStructuredText + Sphinx → Python Documentation, ReadTheDocs.org
  • AsciiDoc → ePub, Mobi, DocBook, PDF
  • LaTeX → PDF

The great thing about markup languages is that they can be managed inline with your code workflow in the same software versioning repository. Github goes even further as to automatically render Markdown files! In this post, we’ll get you started with several markup and publication styles so that you can find what best fits into your workflow and deployment methodology.

Markdown

Markdown is the most ubiquitous of the markup languages we’ll describe in this post, and its simplicity means that it is often chosen for a variety of domains and applications, not just publishing. Markdown, originally created by John Gruber, is a text-to-HTML processor, where lightweight syntactic elements are used instead of the more heavyweight HTML tags. Markdown is intended for folks writing for the web, not designing for the web, and in some CMS systems, it is simply the way that you write, no fancy text editor required.

Markdown has seen special growth thanks to Github, which has an extended version of Markdown, usually referred to as “Github-Flavored Markdown.” This style of Markdown extends the basics of the original Markdown to include tables, syntax highlighting, and other inline formatting elements. If you create a Markdown file in Github, it is automatically rendered when viewing files on the web, and if you include a README.md in a directory, that file is rendered below the directory contents when browsing code. Github Issues are also expected to be in Markdown, further extended with tools like checkbox lists.

Markdown is used for so many applications it is difficult to name them all. Below are a select few that might prove useful to your publishing tasks.

  • Jekyll allows you to create static websites that are built from posts and pages written in Markdown.
  • Github Pages allows you to quickly publish Jekyll-generated static sites from a Github repository for free.
  • Silvrback is a lightweight blogging platform that allows you to write in Markdown (this blog is hosted on Silvrback).
  • Day One is a simple journaling app that allows you to write journal entries in Markdown.
  • iPython Notebook expects Markdown to describe blocks of code.
  • Stack Overflow expects questions, answers, and comments to be written in Markdown.
  • MkDocs is a software documentation tool written in Markdown that can be hosted on ReadTheDocs.org.
  • GitBook is a toolchain for publishing books written in Markdown to the web or as an eBook.

There are also a wide variety of editors, browser plugins, viewers, and tools available for Markdown. Both Sublime Text and Atom support Markdown and automatic preview, as well as most text editors you’ll use for coding. Mou is a desktop Markdown editor for Mac OSX and iA Writer is a distraction-free writing tool for Markdown for iOS. (Please comment your favorite tools for Windows and Android). For Chrome, extensions like Markdown Here make it easy to compose emails in Gmail via Markdown or Markdown Preview to view Markdown documents directly in the browser.

Clearly, Markdown enjoys a broad ecosystem and diverse usage. If you’re still writing HTML for anything other than templates, you’re definitely doing it wrong at this point! It’s also worth including Markdown rendering for your own projects if you have user submitted text (also great for text-processing).

Rendering Markdown can be accomplished with the Python Markdown library, usually combined with the Bleach library for sanitizing bad HTML and linkifying raw text. A simple demo of this is as follows:

First install markdown and bleach using pip:

$ pip install markdown bleach

Then create a markdown parsing function as follows:

import bleach
from markdown import markdown

def htmlize(text):
"""
This helper method renders Markdown then uses Bleach to sanitize it as
well as converting all links in text to actual anchor tags.
"""
text = bleach.clean(text, strip=True) # Clean the text by stripping bad HTML tags
text = markdown(text) # Convert the markdown to HTML
text = bleach.linkify(text) # Add links from the text and add nofollow to existing links

return text

Given a markdown file test.md whose contents are as follows:

# My Markdown Document

For more information, search on [Google](http://www.google.com).

_Grocery List:_

1. Apples
2. Bananas
3. Oranges

The following code:

>>> with open('test.md', 'r') as f:
... print htmlize(f.read())

Will produce the following HTML output:

<h1>My Markdown Document</h1>
For more information, search on <a href="http://www.google.com" rel="nofollow">Google</a>.

<em>Grocery List:</em>
<ol>
    <li>Apples</li>
    <li>Bananas</li>
    <li>Oranges</li>
</ol>

Hopefully this brief example has also served as a demonstration of how Markdown and other markup languages work to render much simpler text with lightweight markup constructs into a larger publishing framework. Markdown itself is most often used for web publishing, so if you need to write HTML, then this is the choice for you!

To learn more about Markdown syntax, please see Markdown Basics.

iPython Notebook

iPython Notebook is an web-based, interactive environment that combines Python code execution, text (marked up with Markdown), mathematics, graphs, and media into a single document. The motivation for iPython Notebook was purely scientific: How do you demonstrate or present your results in a repeatable fashion where others can understand the work you’ve done? By creating an interactive environment where code, graphics, mathematical formulas, and rich text are unified and executable, iPython Notebook gives a presentation layer to otherwise unreadable or inscrutable code. Although Markdown is a big part of iPython Notebook, it deserves a special mention because of how critical it is to the data science community.

iPython Notebook is interesting because it combines both the presentation layer as well as the markup layer. When run as a server, usually locally, the notebook is editable, explorable (a tree view will present multiple notebook files), and executable – any code written in Python in the notebook can be evaluated and run using an interactive kernel in the background. Math formula written in LaTeX are rendered using MathJax. To enhance the delivery and shareability of these notebooks, the NBViewer allows you to share static notebooks from a Github repository.

iPython Notebook comes with most scientific distributions of Python like Anaconda or Canopy, but it is also easy to install iPython with pip:

$ pip install ipython

iPython itself is an enhanced interactive Python shell or REPL that extends the basic Python REPL with many advanced features, primarily allowing for a decoupled two-process model that enables the notebook. This process model essentially runs Python as a background kernel that receives execution instructions from clients and returns responses back to them.

To start an iPython notebook execute the following command:

$ ipython notebook

This will start a local server at

http://127.0.0.1:8888

and automatically open your default browser to it. You’ll start in the “dashboard view”, which shows all of the notebooks available in the current working directory. Here you can create new notebooks and start to edit them. Notebooks are saved as .ipynb files in the local directory, a format called “Jupyter” that is simple JSON with a specific structure for representing each cell in the notebook. The Jupyter notebook files are easily reversioned via Git and Github since they are also plain text.

To learn more about iPython Notebook, please see the iPython Notebook documentation.

reStructuredText

reStructuredText is an easy-to-read plaintext markup syntax specifically designed for use in Python docstrings or to generate Python documentation. In fact, the reStructuredText parser is a component of Docutils, an open-source text processing system that is used by Sphinx to generate intelligent and beautiful software documentation, in particular the native Python documentation.

Python software has a long history of good documentation, particularly because of the idea that batteries should come included. And documentation is a very strong battery! PyPi, the Python Package Index, ensures that third party packages provide documentation, and that the documentation can be easily hosted online through Python Hosted. Because of the ease of use and ubiquity of the tools, Python programmers are known for having very consistently documented code; sometimes it’s hard to tell the standard library from third party modules!

In How to Develop Quality Python Code, I mentioned that you should use Sphinx to generate documentation for your apps and libraries in a docs directory at the top-level. Generating reStructuredText documentation in a docs directory is fairly easy:

$ mkdir docs
$ cd docs
$ sphinx-quickstart

The quickstart utility will ask you many questions to configure your documentation. Aside from the project name, author, and version (which you have to type in yourself), the defaults are fine. However, I do like to change a few things:

...
> todo: write "todo" entries that can be shown or hidden on build (y/n) [n]: y
> coverage: checks for documentation coverage (y/n) [n]: y
...
> mathjax: include math, rendered in the browser by MathJax (y/n) [n]: y

Similar to iPython Notebook, reStructured text can render LaTeX syntax mathematical formulas. This utility will create a Makefile for you; to generate HTML documentation, simply run the following command in the docs directory:

$ make html

The output will be built in the folder _build/html where you can open the index.html in your browser.

While hosting documentation on Python Hosted is a good choice, a better choice might be Read the Docs, a website that allows you to create, host, and browse documentation. One great part of Read the Docs is the stylesheet that they use; it’s more readable than older ones. Additionally, Read the Docs allows you to connect a Github repository so that whenever you push new code (and new documentation), it is automatically built and updated on the website. Read the Docs can even maintain different versions of documentation for different releases.

Note that even if you aren’t interested in the overhead of learning reStructuredText, you should use your newly found Markdown skills to ensure that you have good documentation hosted on Read the Docs. See MkDocs for document generation in Markdown that Read the Docs will render.

To learn more about reStructuredText syntax, please see the reStructuredText Primer.

AsciiDoc

When writing longer publications, you’ll need a more expressive tool that is just as lightweight as Markdown but able to handle constructs that go beyond simple HTML, for example cross-references, chapter compilation, or multi-document build chains. Longer publications should also move beyond the web and be renderable as an eBook (ePub or Mobi formats) or for print layout, e.g. PDF. These requirements add more overhead, but simplify workflows for larger media publication.

Writing for O’Reilly, I discovered that I really enjoyed working in AsciiDoc – a lightweight markup syntax, very similar to Markdown, which renders to HTML or DocBook. DocBook is very important, because it can be post-processed into other presentation formats such as HTML, PDF, EPUB, DVI, MOBI, and more, making AsciiDoc an effective tool not only for web publishing but also print and book publishing. Most text editors have an AsciiDoc grammar for syntax highlighting, in particular sublime-asciidoc and Atom AsciiDoc Preview, which make writing AsciiDoc as easy as Markdown.

AsciiDoctor is an AsciiDoc-specific toolchain for building books and websites from AsciiDoc. The project connects the various AsciiDoc tools and allows a simple command-line interface as well as preview tools. AsciiDoctor is primarily used for HTML and eBook formats, but at the time of this writing there is a PDF renderer, which is in beta. Another interesting project of O’Reilly’s is Atlas, a system for push-button publishing that manages AsciiDoc using a Git repository and wraps editorial build processes, comments, and automatic editing in a web platform. I’d be remiss not to mention GitBook which provides a similar toolchain for publishing larger books, though with Markdown.

Editor’s Note: GitBook does support AsciiDoc.

To learn more about AsciiDoc markup see AsciiDoc 101.

LaTeX

If you’ve done any graduate work in the STEM degrees then you are probably already familiar with LaTeX to write and publish articles, reports, conference and journal papers, and books. LaTeX is not a simple markup language, to say the least, but it is effective. It is able to handle almost any publishing scenario you can throw at it, including (and in particular) rendering complex mathematical formulas correctly from a text markup language. Most data scientists still use LaTeX, using MathJax or the Daum Equation Editor, if only for the math.

If you’re going to be writing PDFs or reports, I can provide two primary tips for working with LaTeX. First consider cloud-based editing with Overleaf or ShareLaTeX, which allows you to collaborate and edit LaTeX documents similarly to Google Docs. Both of these systems have many of the classes and stylesheets already so that you don’t have to worry too much about the formatting, and instead just get down to writing. Additionally, they aggregate other tools like LaTeX templates and provide templates of their own for most document types.

My personal favorite workflow, however, is to use the Atom editor with the LaTeX package and the LaTeX grammar. When using Atom, you get very nice Git and Github integration – perfect for collaboration on larger documents. If you have a TeX distribution installed (and you will need to do that on your local system, no matter what), then you can automatically build your documents within Atom and view them in PDF preview.

A complete tutorial for learning LaTeX can be found at Text Formatting with LaTeX.

Conclusion

Software developers agree that testing and documentation is vital to the successful creation and deployment of applications. However, although Agile workflows are designed to ensure that documentation and testing are included in the software development lifecycle, too often testing and documentation is left to last, or forgotten. When managing a development project, team leads need to ensure that documentation and testing are part of the “definition of done.”

In the same way, writing is vital to the successful creation and deployment of data products, and is similarly left to last or forgotten. Through publication of our work and ideas, we open ourselves up to criticism, an effective methodology for testing ideas and discovering new ones. Similarly, by explicitly sharing our methods, we make it easier for others to build systems rapidly, and in return, write tutorials that help us better build our systems. And if we translate scientific papers into practical guides, we help to push science along as well.

Don’t get bogged down in the details of writing, however. Use simple, lightweight markup languages to include documentation alongside your projects. Collaborate with other authors and your team using version control systems, and use free tools to make your work widely available. All of this is possible becasue of lightweight markup languages, and the more profecient you are at including writing in your workflow, the easier it will be to share your ideas.

Helpful Links

This post is particularly link-heavy with many references to tools and languages. For reference, here are my preferred guides for each of the Markup languages discussed:

Books to Read

Special thanks to Rebecca Bilbro for editing and contributing to this post. Without her, this would certainly have been much less readable!

As always, please follow @DistrictDataLab on Twitter and subscribe to this blog by clicking the Subscribe button on the blog home page.

Benjamin Bengfort

 

Furigana (ふりがな) in LaTeX

Furigana example
Furigana example (Photo credit: Wikipedia)

Some time ago I wrote a post about adding furigana using MS Word for Mac. It seems that the post has been quite useful to a few readers, nonetheless some of you have contacted me about the remark I made about doing this with in LaTeX.

So far I have helped people when they have requested help, but as I promised in that post, I have finally come to adding a post to add furigana using LaTex. Here is how:

You will need the following packages installed in your LaTeX distribution:

With these packages installed and working in your distribution, you can now use a document similar to the following:

documentclass[12pt]{article}

usepackage[10pt]{type1ec} % use only 10pt fonts
usepackage[T1]{fontenc}
usepackage{CJKutf8}
usepackage[german, russian, vietnam, USenglish]{babel}
usepackage[overlap, CJK]{ruby}
usepackage{CJKulem}
renewcommand{rubysep}{-0.2ex}
newenvironment{Japanese}{%
CJKfamily{min}%
CJKtilde
CJKnospace}{}
begin{document}
begin{CJK}{UTF8}{}
begin{Japanese}
noindent これは日本語の文章
noindent Hello 
begin{equation}
 frac{2}{pi}
end{equation}
私は日本語の勉強します!
furigana: ruby{私}{わたし}
end{Japanese}
end{CJK}
end{document}

The outcome of the script above can be seen below:

Furigana Latex

iBooks Author supports LaTeX now

When Apple launched iBooks Author back in January 2012 I was quite curious to see the things that you were able to do with it. It all looked very nice and relatively easy to use. You can create documents using some templates provided and you then are able to export them as PDF or even publish them as iBooks.

Unfortunately, at the time, Apple failed to put any easy support to include equations or mathematical symbols. That alone put me off using the application altogether (see post). However, in the recent update (released on October 23rd) Apple has finally included an equation editor that uses LaTeX or  MathML. I have just tried it and it seems to do a good work. Definitely not as powerful as the actual LaTeX engine (it does not let you number the equations automatically for instance), but it is an improvement.

Here are some screenshots of the little first trial I did. As you can see the update clearly states that the new editor accepts native LaTeX or MathML:

 

ibooks_latex1

Now, to insert a new equation:

ibooks_latex2

 

This opens up the equation editor:

ibooks_latex3

 

In the new window you can start typing your LaTeX commands. Notice that you don’t need to start an equation environment as you would do in LaTeX, you simply type the commands that will create the maths:

 

ibooks_latex4

Once you have done that, simply tell iBooks Author to insert the equation, and voilà:

ibooks_latex5

 

Have you used iBooks Author? What do you think of it? What is your opinion about the support for LaTeX?

I think I may give it a go, but will probably continue using LaTeX itself. If you want to learn learn about using it have a look at these past posts:

Structured Documents in LaTeX

The LaTeX logo, typeset with LaTeX

Continuing with the brief introduction to LaTeX that I posted recently, in this video I discuss the use of LaTeX to produce a document that has a structure similar to that of a book for example. The idea is to build a master file that controls the flow of the document and separates each “Chapter” in separate files. This provides the author with a lot of flexibility in terms of organising content and makes large documents far more manageable than when using a single LaTeX file.

Enjoy and any feedback, comments or suggestions are more than welcome.

Using LaTeX to write mathematics

I have been meaning to do something like this for a long time and finally got the courage to do it. A lot of times I get completely horrified by the way in which some documents that contain mathematical notations are mangled (quite literally) by using MS Word. It helps sometimes that some people have access to MathType but still…

LaTeXSo, in this video I intend to provide some help to those that are interested in using LaTeX to include mathematics and  produce their documents. LaTeX is freely available for various platforms. You can obtain MikTeX for  Windows here, and MacTeX for Mac here. There are a great variety of editors to choose from; in this video I recommend TeXmaker, which I believe provides quite a lot of help to those of us that still are attached to the pointing and clicking of MS Word.

Let me know what you think! Any feedback is always welcome.

iBook Author

So, I just found out about Apple announcing rp_ibookauthor.jpgiBooks Author which according to the information they provide  “is an amazing new app that allows anyone to create beautiful Multi-Touch textbooks” and is a free download from the Mac App Store.

Installation was not too slow, considering that perhaps lots of other users were doing exactly the same. I had a quick go at  selecting a template and it really seemed to be quite straightforward to use. It does look like a combination of Pages and Keynote. I will have to play more with it, but something that I did find disappointing was the lack of support to handle mathematics. I am not after LaTeX (I already use that quite a lot), but it would be nice to be able to handle equations natively. I do hope someone at Apple is reading!

 

Pretend you are Hercule Poirot…

Hercule Poirot explains how it all happened
Image by elena-lu via Flickr

I really enjoyed this error message in a LaTeX file that my office mate came across yesterday while preparing some slides. Usually the errors are quite obvious so there is no need to check the console. This time it was something a bit obscure… so much so that LaTeX suggested:

Pretend that you’re Hercule Poirot: Examine all the clues, and deduce the truth by order and method.

Great! Where is my hat and fake moustache???

Poirot