Natural Language Processing Talk – Newspaper Article

With the lockdown and social distancing rules forcing all of us to adjust our calendars, events and even lesson plans and lectures, I was not surprised to hear of speaking opportunities that otherwise may not arise.

A great example is the reprise of a talk I gave about a year ago while visiting Mexico. It was a great opportunity to talk to Social Science students at the Political Science Faculty of the Universidad Autónoma del Estado de México. The subject was open but had to cover the use of technology and I thought that talking about the use of natural language processing in terms of digital humanities would be a winner. And it was…

In March this year I was approached by the Faculty to re-run the talk but this time instead of doing it face to face we would use a teleconference room. Not only was I, the speaker, talking from the comfort of my own living room, but also all the attendees would be at home. Furthermore, some of the students may not have access to the live presentation (lack of broadband, equipment, etc) and recoding the session for later usage was the best option for them.

I didn’t hesitate in saying yes, and I enjoyed the interaction a lot. Today I learnt that the session was the focus of a small note in a local newspaper. The session was run in Spanish and the note in Portal, the local newspaper, is in Spanish too. I really liked that they picked a line I used in the session to convince the students that technology is not just for the natural sciences:

“Hay que hacer ciencias sociales con técnicas del Siglo XIX… El mundo es de los geeks.

“We should study social sciences applying techniques of the 21st Century. The world today belongs to us, the geeks.

The point is that although qualitative and quantitative techniques are widely used in social science, the use of new platforms and even programming languages such as python open up opportunities for social scientists too.

The talk is available in the blog the class uses to share their discussions: The Share Knowledge Network – Follow this link for the talk.

The newspaper article by Ximena Barragán can be found here.

Computer Programming Knowledge

I came across the image above in the Slack channel of the University of Hertfordshire Centre for Astrophysics Research. It summarises some of the fundamental knowledge in computer science that was assumed necessary at some point in time: Binar, CPU execution and algorithms.

They refer to 7 algorithms, but actually rather than actual algorithms they are classes:

  1. Sort
  2. Search
  3. Hashing
  4. Dynamic Programming
  5. Binary Exponentiation
  6. String Matching and Parsing
  7. Primality Testing

I like the periodic table shown at the bottom of the graphic. Showing some old friends such as Fortran, C, Basic and Cobol. Some other that are probably not used all that much, and others that have definitely been rising: Javascript, Java, C++, Lisp. It is great to se Python, number 35, listed as Multi-Paradigm!

Enjoy!

Structured Documents in LaTeX

This is a video I made a few years ago to encourage my students to use better tools to write dissertations, thesis and reports that include the use of mathematics. The principles stand, although the tools may have moved on since then. I am reposting them as requested by a colleague of mine, Dr Catarina Carvalho, who I hope will still find this useful.

In this video we continue explaining how to use LaTeX. Here we will see how to use a master document in order to build a thesis or dissertation.
We assume that you have already had a look at the tutorial entitled: LaTeX for writing mathematics – An introduction

Structured Documents in LaTeX

LaTeX for writing mathematics – An introduction

This is a video I made a few years ago to encourage my students to use better tools to write dissertations, thesis and reports that include the use of mathematics. The principles stand, although the tools may have moved on since then. I am reposting them as requested by a colleague of mine, Dr Catarina Carvalho, who I hope will still find this useful.

In this video we explore the LaTeX document preparation system. We start with a explaining an example document. We have made use of TeXmaker as an editor given its flexibility and the fact that it is available for different platforms.

LaTeX for writing mathematics – An introduction

Pandas 1.0 is out

If you are interested in #DataScience you surely have heard of #pandas and you would be pleased to hear that version 1.0 finally out. With better integration with bumpy and improvements with numba among others. Take a look!
— Read on www.anaconda.com/pandas-1-0-is-here/

Natural Language Processing – Talk

Last October I had the great opportunity to come and give a talk at the Facultad de Ciencias Políticas, UAEM, México. The main audience were students of the qualitative analysis methods course, but there were people also from informatics and systems engineering.

It was an opportunity to showcase some of the advances that natural language processing offers to social scientists interested in analysing discourse, from politics through to social interactions.

The talk covered a introduction and brief history of the field. We went through the different stages of the analysis, from reading the data, obtaining tokens and labelling their part of speech (POS) and then looking at syntactic and semantic analysis.

We finished the session with a couple of demos. One looking at speeches of Clinton and Trump during their presidential campaigns; the other one was a simple analysis of a novel in Spanish.

Thanks for the invite.

Python – Pendulum

Working with dates and times in programming can be a painful test at times. In Python, there are some excellent libraries that help with all the pain, and recently I became aware of Pendulum. It is effectively are replacement for the standard datetime class and it has a number of improvements. Check out the documentation for further information.

Installation of the packages is straightforward with pip:

$ pip install pendulum

For example, some simple manipulations involving time zones:

import pendulum

now = pendulum.now('Europe/Paris')

# Changing timezone
now.in_timezone('America/Toronto')

# Default support for common datetime formats
now.to_iso8601_string()

# Shifting
now.add(days=2)

Duration can be used as a replacement for the standard timedelta class:

dur = pendulum.duration(days=15)

# More properties
dur.weeks
dur.hours

# Handy methods
dur.in_hours()
360
dur.in_words(locale='en_us')
'2 weeks 1 day'

It also supports the definition of a period, i.e. a duration that is aware of the DateTime instances that created it. For example:

dt1 = pendulum.now()
dt2 = dt1.add(days=3)

# A period is the difference between 2 instances
period = dt2 - dt1

period.in_weekdays()
period.in_weekend_days()

# A period is iterable
for dt in period:
    print(dt)


Give it a go, and let me know what you think of it. 

File Encoding with the Command Line – Determining and Converting

With the changes that Python 3 has brought to bear in terms of dealing with character encodings, I have written before some tips that I use on my day to day work. It is sometimes useful to determine the character encoding of a files at a much earlier stage. The command line is a perfect tool to help us with these issues. 

The basic syntax you need is the following one:

$ file -I filename

Furthermore, you can even use the command line to convert the encoding of a file into another one. The syntax is as follows:

$ iconv -f encoding_source -t encoding_target filename

For instance if you needed to convert an ISO88592 file called input.txt into UTF8 you can use the following line:

$ iconv -f iso-8859-1 -t utf-8 < input.txt > output.txt

If you want to check a list of know coded characters that you can handle with this command simply type:

$ iconv --list

Et voilà!

 

IEEE Language Rankings 2018

Python retains its top spot in the fifth annual IEEE Spectrum top programming language rankings, and also gains a designation as an “embedded language”. Data science language R remains the only domain-specific slot in the top 10 (where it as listed as an “enterprise language”) and drops one place compared to its 2017 ranking to take the #7 spot.

Looking at other data-oriented languages, Matlab as at #11 (up 3 places), SQL is at #24 (down 1), Julia at #32 (down 1) and SAS at #40 (down 3). Click the screenshot below for an interactive version of the chart where you can also explore the top 50 rankings.

Language Rank

The IEEE Spectrum rankings are based on search, social media, and job listing trends, GitHub repositories, and mentions in journal articles. You can find details on the ranking methodology here, and discussion of the trends behind the 2018 rankings at the link below.

IEEE Spectrum: The 2018 Top Programming Languages