A collection of Data Science and Data Visualisation related posts, pics and thoughts. Take a look and enjoy.
If the mention of "Transformers" brings to mind the adventures of autonomous robots in disguise you are probably, like me, a child of the 80s. However, a different kind of transformers is what a lot of people in Machine Learning are playing with. In my most recent post in the Domino Data Lab blog I cover what transformers are and how the idea of self-attention works.
Check it out here: https://blog.dominodatalab.com/transformers-self-attention-to-the-rescue
I received a notification that my "Data Science and Analytics with Python" book is now available in Chinese. Great news and 谢谢!
The Chinese version is available here.Read me...
Surely you have heard of many fantastic applications where Deep Learning is being employed. There are a few options for us to implement deep learning methods: TensorFlow, PyTorch, Caffe Keras and MXNet.
In my recent post in the Domino Data Blog I look at the origins, pros and cons, and considerations to choose between three of the most popular deep learning frameworks out there. Check it out.Read me...
Python and R are not the only programming languages used in data science or machine learning applications. In a recent post in the Domino Data Blog I argue about the usefulness of MATLAB.
Check the post here.
While you are doing that also check my "Essential Matlab and Octave" book.Read me...
Always wanted to get some data from the web in a programmatic way? Well, check out my recent post in the Domino Data Blog where I discuss how to get data with the help of Beautiful Soup.
The aim is to show how we can create a script that grabs the pages we are interested in and obtain the information we are after. In the post I cover ho to complete the these steps:
- Identify the webpage with the information we need
- Download the source code
- Identify the elements of the page that hold the information we need
- Extract and clean the information
- Format and save the data for further analysis
Are you interested in exploring data using Python? If so, take a look at my this blog post of mine… where I talk about using Pandas Profiler and D-Tale to carry out data exploration.
Helpful steps to:
- Detect erroneous data.
- Determine how much missing data there is.
- Understand the structure of the data.
- Identify important variables in the data.
- Sense-check the validity of the data.
I use the The Mammographic Mass Data Set from the UCI Machine Learning Repository. Information about this dataset can be obtained here.
Read the full blog post in the Domino Data Blog here.Read me...
Hello again this is a video I recorded for my publisher about my book “Advanced Data Science and Analytics with Python”. This is a video I made for my publisher about my book “Data Science and Analytics with Python”. You can get the book here and more about the book here.
This companion to "Data Science and Analytics with Python" is the result of arguments with myself about writing something to cover a few of the areas that were not included in that first volume, largely due to space/time constraints. Like the previous book, this one exists thanks to the discussions, stand-ups, brainstorms and eventual implementations of algorithms and data science projects carried out with many colleagues and friends.
As the title suggests, this book continues to use Python as a tool to train, test and implement machine learning models and algorithms. The book is aimed at data scientists who would like to continue developing their skills and apply them in business and academic settings.
The subjects discussed in this book are complementary and a follow-up to the ones covered in Volume 1. The intended audience for this book is still composed of data analysts and early-career data scientists with some experience in programming and with a background in statistical modelling. In this case, however, the expectation is that they have already covered some areas of machine learning and data analytics. The subjects discussed in this book are complementary and a follow-up to the topics discussed in "Data Science and Analytics with Python". Although there are some references to the previous book, this volume is written to be read independently.
I have tried to keep the same tone as in the first book, peppering the pages with some bits and bobs of popular culture, science fiction and indeed Monty Python puns. The aim is still to focus on showing the concepts and ideas behind popular algorithms and their use.
In summary, "Advanced Data Science and Analytics with Python" presents each of the topics addressed in the book tackles the data science workflow from a practical perspective, concentrating on the process and results obtained. The material covered includes machine learning and pattern recognition algorithms including: Time series analysis, natural language processing, topic modelling, social network analysis, neural networks and deep learning. The book discusses the need to develop data products and addresses the subject of bringing models to their intended audiences – in this case, literally to the users’ fingertips in the form of an iPhone app.
I hope you enjoy it and if you want to know more about my other books, please check the related videos here:Read me...
The book provides an introduction to some of the most used algorithms in data science and analytics. This book is the result of very interesting discussions, debates and dialogues with a large number of people at various levels of seniority, working at startups as well as long-established businesses, and in a variety of industries, from science to media to finance.
“Data Science and Analytics with Python” is intended to be a companion to data analysts and budding data scientists that have some working experience with both programming and statistical modelling, but who have not necessarily delved into the wonders of data analytics and machine learning. The book uses Python as a tool to implement and exploit some of the most common algorithms used in data science and data analytics today.
Python is a popular and versatile scripting and object-oriented language, it is easy to use and has a large active community of developers and enthusiasts, not to mention the richness oall of this helped by the versatility of the iPython/Jupyter Notebook.
In the book I address the balance between the knowledge required by a data scientist sucha as mathematics and computer science, with the need for a good business background. To tackle the prevailing image of a unicorn data scientist, I am convinced that the use of a new symbol is needed. And a silly one at that! There is an allegory I usually propose to colleagues and those that talk about the data science Unicorn. It seems to me to be a more appropriate one than the existing image: It is still another mythical creature, less common perhaps than the unicorn, but more importantly with some faint fact about its actual existence: a Jackalope. You will have to read the book to find out more!
The main purpose of the book is to present the reader with some of the main concepts used in data science and analytics using tools developed in Python such as Scikit-learn, Pandas, Numpy and others. The book is intended to be a bridge to the data science and analytics world for programmers and developers, as well as graduates in scientific areas such as mathematics, physics, computational biology and engineering, to name a few.
The material covered includes machine learning and pattern recognition, various regression techniques, classification algorithms, decision tree and hierarchical clustering, and dimensionality reduction. Though this text is not recommended for those just getting started with computer programming,
There are a number of topics that were not covered in this book. If you are interested in more advanced topics take a look at my book called “Advanced Data Science and Analytics with Python”. There is a follow up video for that one! Keep en eye out for that!
Related Content: Please take a look at other videos about my books:Read me...
This is a reblog of a story in ScienceDaily. See the original here.
Underwhelming results underscore the complexity of language evolution while showing promise in some current applications
Researchers have investigated the ability of machine learning algorithms to identify lexical borrowings using word lists from a single language. Results show that current machine learning methods alone are insufficient for borrowing detection, confirming that additional data and expert knowledge are needed to tackle one of historical linguistics' most pressing challenges.
Lexical borrowing, or the direct transfer of words from one language to another, has interested scholars for millennia, as evidenced already in Plato's Kratylos dialogue, in which Socrates discusses the challenge imposed by borrowed words on etymological studies. In historical linguistics, lexical borrowings help researchers trace the evolution of modern languages and indicate cultural contact between distinct linguistic groups -- whether recent or ancient. However, the techniques for identifying borrowed words have resisted formalization, demanding that researchers rely on a variety of proxy information and the comparison of multiple languages.
"The automated detection of lexical borrowings is still one of the most difficult tasks we face in computational historical linguistics," says Johann-Mattis List, who led the study.
In the current study, researchers from PUCP and MPI-SHH employed different machine learning techniques to train language models that mimic the way in which linguists identify borrowings when considering only the evidence provided by a single language: if sounds or the ways in which sounds combine to form words are atypical when comparing them with other words in the same language, this often hints to recent borrowings. The models were then applied to a modified version of the World Loanword Database, a catalog of borrowing information for a sample of 40 languages from different language families all over the world, in order to see how accurately words within a given language would be classified as borrowed or not by the different techniques.
In many cases the results were unsatisfying, suggesting that loanword detection is too difficult for machine learning methods most commonly used. However, in specific situations, such as in lists with a high proportion of loanwords or in languages whose loanwords come primarily from a single donor language, the teams' lexical language models showed some promise.
"After these first experiments with monolingual lexical borrowings, we can proceed to stake out other aspects of the problem, moving into multilingual and cross-linguistic approaches," says John Miller of PUCP, the study's co-lead author.
"Our computer-assisted approach, along with the dataset we are releasing, will shed a new light on the importance of computer-assisted methods for language comparison and historical linguistics," adds Tiago Tresoldi, the study's other co-lead author from MPI-SHH.
The study joins ongoing efforts to tackle one of the most challenging problems in historical linguistics, showing that loanword detection cannot rely on mono-lingual information alone. In the future, the authors hope to develop better-integrated approaches that take multi-lingual information into account.
Using lexical language models to detect borrowings in monolingual wordlists. PLOS ONE, 2020; 15 (12): e0242709 DOI: 10.1371/journal.pone.0242709Read me...
This survey paper extracts practical considerations from recent case studies of a variety of ML applications and is organized into sections that correspond to stages of a typical machine learning workflow: from data management and model learning to verification and deployment.
In recent years, machine learning has received increased interest both as an academic research field and as a solution for real-world business problems. However, the deployment of machine learning models in production systems can present a number of issues and concerns. This survey reviews published reports of deploying machine learning solutions in a variety of use cases, industries and applications and extracts practical considerations corresponding to stages of the machine learning deployment workflow. Our survey shows that practitioners face challenges at each stage of the deployment. The goal of this paper is to layout a research agenda to explore approaches addressing these challenges.
Right!!! It is early December and this post has been in the inkwell for a few months now. Earlier in the year I received the comments and suggestions from reviewers and the final approval from the excellent team at CRC Press for my 4th book.
After a few weeks of frank procrastination and a few more on structuring the thoughts proposed a bit more, I have got a clear head to start writing. So I am pleased to announce that I am officially starting to write “Statistics and Data Visualisation with #Python”.
"Statistics and Data Visualisation with Python" builds from the ground up the basis for statistical analysis underpinning a number of applications and algorithms in business analytics, machine learning and applied machine learning. The book will cover the basics of programming in python as well as data analysis to build a solid background in statistical methods and hypothesis testing useful in a variety of modern applications.
Stay tuned!Read me...
Now Reading: Dark Data by David Hand
I first came across a mention of this book in the Summer 2020 number of Imperial, the magazine for the Imperial College Community in a feature note about the book.
It sounded like an interesting read and I had a look for the Princeton University Press book and to my surprise I found an version in Italian published by Rizzoli a few months earlier... I wonder how that worked out. It was cheaper and I was tempted to give it a go in Italian with the name Il tradimento dei numeri (i.e. “The betrayal of the numbers”...). I wonder what hidden story is behind all this...
In the end I decided to go for the English version... Let’s see how it goes.
David Hand is emeritus professor of mathematics at Imperial College London, a former president of the Royal Statistical Society, and a Fellow of the British Academy.
There is a website dedicated to the book: https://darkdata.websiteRead me...
I had an opportunity to be one of the panellists in the Data Skeptic podcast recently. It was great to have been invited and as a listener to the podcast it was a really treat to be able to take part. Also, recording it was fun...
You can listen to the episode here.
In the episode Kyle talks about the relationship between Covid-19 and Carbon Emissions. George tells us about the new Hateful Memes Challenge from Facebook. Lan joins us to talk about Google's AI Explorables. I talk about a paper that uses neural networks to detect infections in the ear.
Let me know what you guys think!Read me...
With the lockdown and social distancing rules forcing all of us to adjust our calendars, events and even lesson plans and lectures, I was not surprised to hear of speaking opportunities that otherwise may not arise.
A great example is the reprise of a talk I gave about a year ago while visiting Mexico. It was a great opportunity to talk to Social Science students at the Political Science Faculty of the Universidad Autónoma del Estado de México. The subject was open but had to cover the use of technology and I thought that talking about the use of natural language processing in terms of digital humanities would be a winner. And it was...
In March this year I was approached by the Faculty to re-run the talk but this time instead of doing it face to face we would use a teleconference room. Not only was I, the speaker, talking from the comfort of my own living room, but also all the attendees would be at home. Furthermore, some of the students may not have access to the live presentation (lack of broadband, equipment, etc) and recoding the session for later usage was the best option for them.
I didn’t hesitate in saying yes, and I enjoyed the interaction a lot. Today I learnt that the session was the focus of a small note in a local newspaper. The session was run in Spanish and the note in Portal, the local newspaper, is in Spanish too. I really liked that they picked a line I used in the session to convince the students that technology is not just for the natural sciences:
“Hay que hacer ciencias sociales con técnicas del Siglo XIX... El mundo es de los geeks.
“We should study social sciences applying techniques of the 21st Century. The world today belongs to us, the geeks.
The point is that although qualitative and quantitative techniques are widely used in social science, the use of new platforms and even programming languages such as python open up opportunities for social scientists too.
The talk is available in the blog the class uses to share their discussions: The Share Knowledge Network - Follow this link for the talk.
The newspaper article by Ximena Barragán can be found here.Read me...
It is official! "Advanced Data Science and Analytics with Python" is published!
According to the information I had received from CRC Press, my publisher, the book would be available on May 7th. According to the official page of the book the volume was available since May 5th.
Looking forward to hearing what you think of the book.Read me...