Random thoughts about random subjects… From science to literature and between manga and watercolours, passing by data science and rugby; including film, physics and fiction, programming, pictures and puns.
This is a video I made for my publisher about my book “Data Science and Analytics with Python”. You can get the book hereand more about the book here.
The book provides an introduction to some of the most used algorithms in data science and analytics. This book is the result of very interesting discussions, debates and dialogues with a large number of people at various levels of seniority, working at startups as well as long-established businesses, and in a variety of industries, from science to media to finance.
“Data Science and Analytics with Python” is intended to be a companion to data analysts and budding data scientists that have some working experience with both programming and statistical modelling, but who have not necessarily delved into the wonders of data analytics and machine learning. The book uses Python as a tool to implement and exploit some of the most common algorithms used in data science and data analytics today.
Python is a popular and versatile scripting and object-oriented language, it is easy to use and has a large active community of developers and enthusiasts, not to mention the richness oall of this helped by the versatility of the iPython/Jupyter Notebook.
In the book I address the balance between the knowledge required by a data scientist sucha as mathematics and computer science, with the need for a good business background. To tackle the prevailing image of a unicorn data scientist, I am convinced that the use of a new symbol is needed. And a silly one at that! There is an allegory I usually propose to colleagues and those that talk about the data science Unicorn. It seems to me to be a more appropriate one than the existing image: It is still another mythical creature, less common perhaps than the unicorn, but more importantly with some faint fact about its actual existence: a Jackalope. You will have to read the book to find out more!
The main purpose of the book is to present the reader with some of the main concepts used in data science and analytics using tools developed in Python such as Scikit-learn, Pandas, Numpy and others. The book is intended to be a bridge to the data science and analytics world for programmers and developers, as well as graduates in scientific areas such as mathematics, physics, computational biology and engineering, to name a few.
The material covered includes machine learning and pattern recognition, various regression techniques, classification algorithms, decision tree and hierarchical clustering, and dimensionality reduction. Though this text is not recommended for those just getting started with computer programming,
There are a number of topics that were not covered in this book. If you are interested in more advanced topics take a look at my book called “Advanced Data Science and Analytics with Python”. There is a follow up video for that one! Keep en eye out for that!
Related Content: Please take a look at other videos about my books:
This is a reblog of a story in ScienceDaily. See the original here.
Underwhelming results underscore the complexity of language evolution while showing promise in some current applications
Researchers have investigated the ability of machine learning algorithms to identify lexical borrowings using word lists from a single language. Results show that current machine learning methods alone are insufficient for borrowing detection, confirming that additional data and expert knowledge are needed to tackle one of historical linguistics’ most pressing challenges.
Lexical borrowing, or the direct transfer of words from one language to another, has interested scholars for millennia, as evidenced already in Plato’s Kratylos dialogue, in which Socrates discusses the challenge imposed by borrowed words on etymological studies. In historical linguistics, lexical borrowings help researchers trace the evolution of modern languages and indicate cultural contact between distinct linguistic groups — whether recent or ancient. However, the techniques for identifying borrowed words have resisted formalization, demanding that researchers rely on a variety of proxy information and the comparison of multiple languages.
“The automated detection of lexical borrowings is still one of the most difficult tasks we face in computational historical linguistics,” says Johann-Mattis List, who led the study.
In the current study, researchers from PUCP and MPI-SHH employed different machine learning techniques to train language models that mimic the way in which linguists identify borrowings when considering only the evidence provided by a single language: if sounds or the ways in which sounds combine to form words are atypical when comparing them with other words in the same language, this often hints to recent borrowings. The models were then applied to a modified version of the World Loanword Database, a catalog of borrowing information for a sample of 40 languages from different language families all over the world, in order to see how accurately words within a given language would be classified as borrowed or not by the different techniques.
In many cases the results were unsatisfying, suggesting that loanword detection is too difficult for machine learning methods most commonly used. However, in specific situations, such as in lists with a high proportion of loanwords or in languages whose loanwords come primarily from a single donor language, the teams’ lexical language models showed some promise.
“After these first experiments with monolingual lexical borrowings, we can proceed to stake out other aspects of the problem, moving into multilingual and cross-linguistic approaches,” says John Miller of PUCP, the study’s co-lead author.
“Our computer-assisted approach, along with the dataset we are releasing, will shed a new light on the importance of computer-assisted methods for language comparison and historical linguistics,” adds Tiago Tresoldi, the study’s other co-lead author from MPI-SHH.
The study joins ongoing efforts to tackle one of the most challenging problems in historical linguistics, showing that loanword detection cannot rely on mono-lingual information alone. In the future, the authors hope to develop better-integrated approaches that take multi-lingual information into account.
Using lexical language models to detect borrowings in monolingual wordlists. PLOS ONE, 2020; 15 (12): e0242709 DOI: 10.1371/journal.pone.0242709
This is a reblog of the post by Hayley Dunning in the Imperial College website. See the original here.
Researchers have used AI to control beams for the next generation of smaller, cheaper accelerators for research, medical and industrial applications.
Experiments led by Imperial College London researchers, using the Science and Technology Facilities Council’s Central Laser Facility (CLF), showed that an algorithm was able to tune the complex parameters involved in controlling the next generation of plasma-based particle accelerators.
The techniques we have developed will be instrumental in getting the most out of a new generation of advanced plasma accelerator facilities under construction within the UK and worldwide.Dr Rob Shalloo
The algorithm was able to optimize the accelerator much more quickly than a human operator, and could even outperform experiments on similar laser systems.
These accelerators focus the energy of the world’s most powerful lasers down to a spot the size of a skin cell, producing electrons and x-rays with equipment a fraction of the size of conventional accelerators.
The electrons and x-rays can be used for scientific research, such as probing the atomic structure of materials; in industrial applications, such as for producing consumer electronics and vulcanised rubber for car tyres; and could also be used in medical applications, such as cancer treatments and medical imaging.
Several facilities using these new accelerators are in various stages of planning and construction around the world, including the CLF’s Extreme Photonics Applications Centre (EPAC) in the UK, and the new discovery could help them work at their best in the future. The results are published today in Nature Communications.
First author Dr Rob Shalloo, who completed the work at Imperial and is now at the accelerator centre DESY, said: “The techniques we have developed will be instrumental in getting the most out of a new generation of advanced plasma accelerator facilities under construction within the UK and worldwide.
“Plasma accelerator technology provides uniquely short bursts of electrons and x-rays, which are already finding uses in many areas of scientific study. With our developments, we hope to broaden accessibility to these compact accelerators, allowing scientists in other disciplines and those wishing to use these machines for applications, to benefit from the technology without being an expert in plasma accelerators.”
First of its kind
The team worked with laser wakefield accelerators. These combine the world’s most powerful lasers with a source of plasma (ionised gas) to create concentrated beams of electrons and x-rays. Traditional accelerators need hundreds of metres to kilometres to accelerate electrons, but wakefield accelerators can manage the same acceleration within the space of millimetres, drastically reducing the size and cost of the equipment.
However, because wakefield accelerators operate in the extreme conditions created when lasers are combined with plasma, they can be difficult to control and optimise to get the best performance. In wakefield acceleration, an ultrashort laser pulse is driven into plasma, creating a wave that is used to accelerate electrons. Both the laser and plasma have several parameters that can be tweaked to control the interaction, such as the shape and intensity of the laser pulse, or the density and length of the plasma.
While a human operator can tweak these parameters, it is difficult to know how to optimise so many parameters at once. Instead, the team turned to artificial intelligence, creating a machine learning algorithm to optimise the performance of the accelerator.
The algorithm set up to six parameters controlling the laser and plasma, fired the laser, analysed the data, and re-set the parameters, performing this loop many times in succession until the optimal parameter configuration was reached.
Lead researcher Dr Matthew Streeter, who completed the work at Imperial and is now at Queen’s University Belfast, said: “Our work resulted in an autonomous plasma accelerator, the first of its kind. As well as allowing us to efficiently optimise the accelerator, it also simplifies their operation and allows us to spend more of our efforts on exploring the fundamental physics behind these extreme machines.”
Future designs and further improvements
The team demonstrated their technique using the Gemini laser systemat the CLF, and have already begun to use it in further experiments to probe the atomic structure of materials in extreme conditions and in studying antimatter and quantum physics.
The data gathered during the optimisation process also provided new insight into the dynamics of the laser-plasma interaction inside the accelerator, potentially informing future designs to further improve accelerator performance.
The experiment was led by Imperial College London researchers with a team of collaborators from the Science and Technology Facilities Council (STFC), the York Plasma Institute, the University of Michigan, the University of Oxford and the Deutsches Elektronen-Synchrotron (DESY). It was funded by the UK’s STFC, the EU Horizon 2020 research and innovation programme, the US National Science Foundation and the UK’s Engineering and Physical Sciences Research Council.
This is a reblog of the article by Cian Hughes and Sumit Agrawal in ENTNews. See the original here.
Machine learning in healthcare
Over the last five years there have been significant advances in high performance computing that have led to enormous scientific breakthroughs in the field of machine learning (a form of artificial intelligence), especially with regard to image processing and data analysis.
These breakthroughs now affect multiple aspects of our lives, from the way our phone sorts and recognises photographs, to automated translation and transcription services, and have the potential to revolutionise the practice of medicine.
The most promising form of artificial intelligence used in medical applications today is deep learning. Deep learning is a type of machine learning in which deep neural networks are trained to identify patterns in data . A common form of neural network used in image processing is a convolutional neural network (CNN). Initially developed for general-purpose visual recognition, it has shown considerable promise in, for instance, the detection and classification of disease on medical imaging.
“Machine learning algorithms have also been central to the development of multiple assistive technologies that can help patients to overcome or alleviate disabilities”
Automated image segmentation has numerous clinical applications, ranging from quantitative measurement of tissue volume, through surgical planning/guidance, medical education and even cancer treatment planning. It is hoped that such advances in automated data analysis will help in the delivery of more timely care, and alleviate workforce shortages in areas such as breast cancer screening , where patient demand for screening already outstrips the availability of specialist breast radiologists in many parts of the world.
Applications in otolaryngology
Artificial intelligence is quickly making its way into [our] specialty. Both otolaryngologists and audiologists will soon be incorporating this technology into their clinical practices. Machine learning has been used to automatically classify auditory brainstem responses  and estimate audiometric thresholds . This has allowed for accurate online testing , which could be used for rural and remote areas without access to standard audiometry (see the article by Dr Matthew Bromwich here).
Machine learning algorithms have also been central to the development of multiple assistive technologies that can help patients to overcome or alleviate disabilities. For example, in the context of hearing loss, significant advances in automated transcription apps, driven by machine learning algorithms, have proven particularly useful in recent months for patients who find themselves unable to lipread due to the use of face coverings to prevent the spread of COVID-19.
In addition to their role in general image classification, CNNs are likely to play a significant role in the introduction of machine learning in healthcare, especially in image-heavy specialties such as otolaryngology. For otologists, deep learning algorithms can already identify detailed temporal bone structures from CT images [3-6], segment intracochlear anatomy , and identify individual cochlear implant electrodes  (Figure 1); automatic analysis of critical structures on temporal bone scans have already facilitated patient-specific virtual reality otologic surgery  (Figure 2). Deep learning will likely also be critical in customised cochlear implant programming in the future.
“Automatic analysis of critical structures on temporal bone scans have already facilitated patient-specific virtual reality otologic surgery”
Convolutional neural networks have also been used in rhinology to automatically delineate critical anatomy and quantify sinus opacification [10-12]. Deep learning networks have been used in head and neck oncology to automatically segment anatomic structures to accelerate radiotherapy planning [13-18]. For laryngologists, voice analysis software will likely incorporate machine learning classifiers to identify pathology as it has been shown to perform better than traditional rule-based algorithms .
In summary, artificial intelligence and, in particular, deep learning algorithms will radically change the way we manage patients within our careers. Although developed in high-resource settings, the technology has equally significant applications in low-resource settings to facilitate quality care even in the presence of limited human resources.
“Although developed in high-resource settings, the technology has equally significant applications in low-resource settings to facilitate quality care even in the presence of limited human resources”
1. Bengio Y, Courville A, Vincent P. Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell2013;35:1798-828. 2. McKinney SM, Sieniek M, Shetty S. International evaluation of an AI system for breast cancer screening. Nature2020;577:89-94. 3. Heutink F, Kock V, Verbist B, et al. Multi-Scale deep learning framework for cochlea localization, segmentation and analysis on clinical ultra-high-resolution CT images. Comput Methods Programs Biomed 2020;191:105387. 4. Fauser J, Stenin I, Bauer M, et al. Toward an automatic preoperative pipeline for image-guided temporal bone surgery. Int J Comput Assist Radiol Surg 2019;14(6):967-76. 5. Zhang D, Wang J, Noble JH, et al. Deep convolutional neural networks for accurate classification and multi-landmark localization of head CTs. Med Image Anal 2020;61:101659. 6. Nikan S, van Osch K, Bartling M, et al. PWD-3DNet: A deep learning-based fully-automated segmentation of multiple structures on temporal bone CT scans. Submitted to IEEE Trans Image Process. 7. Wang J, Noble JH, Dawant BM. Metal Artifact Reduction and Intra Cochlear Anatomy Segmentation Inct Images of the Ear With A Multi-Resolution Multi-Task 3D Network. IEEE 17th International Symposium on Biomedical Imaging (ISBI) 2020;596-9. 8. Chi Y, Wang J, Zhao Y, et al. A Deep-Learning-Based Method for the Localization of Cochlear Implant Electrodes in CT Images. IEEE 16th International Symposium on Biomedical Imaging (ISBI) 2019;1141-5. 9. Compton EC, et al. Assessment of a virtual reality temporal bone surgical simulator: a national face and content validity study. J Otolaryngol Head Neck Surg 2020;49:17. 10. Laura CO, Hofmann P, Drechsler K, Wesarg S. Automatic Detection of the Nasal Cavities and Paranasal Sinuses Using Deep Neural Networks. IEEE 16th International Symposium on Biomedical Imaging (ISBI) 2019;1154-7. 11. Iwamoto Y, Xiong K, Kitamura T, et al. Automatic Segmentation of the Paranasal Sinus from Computer Tomography Images Using a Probabilistic Atlas and a Fully Convolutional Network. Conf Proc IEEE Eng Med Biol Soc 2019;2789-92. 12. Humphries SM, Centeno JP, Notary AM, et al. Volumetric assessment of paranasal sinus opacification on computed tomography can be automated using a convolutional neural network. Int Forum Allergy Rhinol 2020. 13. Nikolov S, Blackwell S, Mendes R, et al. Deep learning to achieve clinically applicable segmentation of head and neck anatomy for radiotherapy. arXiv [cs.CV] 2018. 14. Tong N, Gou S, Yang, S, et al. Fully automatic multi-organ segmentation for head and neck cancer radiotherapy using shape representation model constrained fully convolutional neural networks. Med Phys 2018;45;4558-67. 15. Ibragimov B, Xing L. Segmentation of organs-at-risks in head and neck CT images using convolutional neural networks.Med Phys 2017;44:547-57. 16. Vrtovec T, Močnik D, Strojan P, et al. B. Auto-segmentation of organs at risk for head and neck radiotherapy planning: from atlas-based to deep learning methods. Med Phys 2020. 17. Zhu W, Huang Y, Zeng L. et al. AnatomyNet: Deep learning for fast and fully automated whole-volume segmentation of head and neck anatomy. Med Phys 2019;46(2):576-89. 18. Tong N, Gou S, Yang S, et al. Shape constrained fully convolutional DenseNet with adversarial training for multiorgan segmentation on head and neck CT and low-field MR images. Med Phys 2019;46:2669-82. 19. Cesari U, De Pietro G, Marciano E, et al. Voice Disorder Detection via an m-Health System: Design and Results of a Clinical Study to Evaluate Vox4Health. Biomed Res Int 2018;8193694. 20. Bosch WR, Straube WL, Matthews JW, Purdy JA. Data From Head-Neck_Cetuximab 2015. 21. Clark K, Vendt B, Smith K, et al. The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. J Digit Imaging 2013;26:1045-5.
The paper is an analysis of the 10-K and 10-Q filings that American public companies are obliged to file with the Securities and Exchange Commission (SEC). The 10-K is a version of a company’s annual report, but without the glossy photos and PR hype: a corporate nerd’s delight. It has, says one guide, “the-everything-and-the-kitchen-sink data you can spend hours going through – everything from the geographic source of revenue to the maturity schedule of bonds the company has issued”. Some investors and commentators (yours truly included) find the 10-K impenetrable, but for those who possess the requisite stamina (big companies can have 10-Ks that run to several hundred pages), that’s the kind of thing they like. The 10-Q filing is the 10-K’s quarterly little brother.
The observation that triggered the research reported in the paper was that “mechanical” (ie machine-generated) downloads of corporate 10-K and 10-Q filings increased from 360,861 in 2003 to about 165m in 2016, when 78% of all downloads appear to have been triggered by request from a computer. A good deal of research in AI now goes into assessing how good computers are at extracting actionable meaning from such a tsunami of data. There’s a lot riding on this, because the output of machine-read reports is the feedstock that can drive algorithmic traders, robot investment advisers, and quantitative analysts of all stripes.
The NBER researchers, however, looked at the supply side of the tsunami – how companies have adjusted their language and reporting in order to achieve maximum impact with algorithms that are reading their corporate disclosures. And what they found is instructive for anyone wondering what life in an algorithmically dominated future might be like.
The researchers found that “increasing machine and AI readership … motivates firms to prepare filings that are more friendly to machine parsing and processing”. So far, so predictable. But there’s more: “firms with high expected machine downloads manage textual sentiment and audio emotion in ways catered to machine and AI readers”.
In other words, machine readability – measured in terms of how easily the information can be parsed and processed by an algorithm – has become an important factor in composing company reports. So a table in a report might have a low readability score because its formatting makes it difficult for a machine to recognise it as a table; but the same table could receive a high readability score if it made effective use of tagging.
The researchers contend, though, that companies are now going beyond machine readability to try and adjust the sentiment and tone of their reports in ways that might induce algorithmic “readers” to draw favourable conclusions about the content. They do so by avoiding words that are listed as negative in the criteria given to text-reading algorithms. And they are also adjusting the tones of voice used in the standard quarterly conference calls with analysts, because they suspect those on the other end of the call are using voice analysis software to identify vocal patterns and emotions in their commentary.
In one sense, this kind of arms race is predictable in any human activity where a market edge may be acquired by whoever has better technology. It’s a bit like the war between Google and the so-called “optimisers” who try to figure out how to game the latest version of the search engine’s page ranking algorithm. But at another level, it’s an example of how we are being changed by digital technology – as Brett Frischmann and Evan Selinger argued in their sobering book Re-Engineering Humanity.
After I’d typed that last sentence, I went looking for publication information on the book and found myself trying to log in to a site that, before it would admit me, demanded that I solve a visual puzzle: on an image of a road junction divided into 8 x 4 squares I had to click on all squares that showed traffic lights. I did so, and was immediately presented with another, similar puzzle, which I also dutifully solved, like an obedient monkey in a lab.
And the purpose of this absurd challenge? To convince the computer hosting the site that I was not a robot. It was an inverted Turing test in other words: instead of a machine trying to fool a human into thinking that it was human, I was called upon to convince a computer that I was a human. I was being re-engineered. The road to the future has taken a funny turn.
This survey paper extracts practical considerations from recent case studies of a variety of ML applications and is organized into sections that correspond to stages of a typical machine learning workflow: from data management and model learning to verification and deployment.
In recent years, machine learning has received increased interest both as an academic research field and as a solution for real-world business problems. However, the deployment of machine learning models in production systems can present a number of issues and concerns. This survey reviews published reports of deploying machine learning solutions in a variety of use cases, industries and applications and extracts practical considerations corresponding to stages of the machine learning deployment workflow. Our survey shows that practitioners face challenges at each stage of the deployment. The goal of this paper is to layout a research agenda to explore approaches addressing these challenges.
I had an opportunity to be one of the panellists in the Data Skeptic podcast recently. It was great to have been invited and as a listener to the podcast it was a really treat to be able to take part. Also, recording it was fun…
In the episode Kyle talks about the relationship between Covid-19 and Carbon Emissions. George tells us about the new Hateful Memes Challenge from Facebook. Lan joins us to talk about Google’s AI Explorables. I talk about a paper that uses neural networks to detect infections in the ear.
There you go, the first checkpoint is completed: I have officially submitted the completed version of “Advanced Data Science and Analytics with Python”.
The book has been some time in the making (and in the thinking…). It is a follow up from my previous book, imaginatively called “Data Science and Analytics with Python” . The book covers aspects that were necessarily left out in the previous volume; however, the readers in mind are still technical people interested in moving into the data science and analytics world. I have tried to keep the same tone as in the first book, peppering the pages with some bits and bobs of popular culture, science fiction and indeed Monty Python puns.
Advanced Data Science and Analytics with Python enables data scientists to continue developing their skills and apply them in business as well as academic settings. The subjects discussed in this book are complementary and a follow up from the topics discuss in Data Science and Analytics with Python. The aim is to cover important advanced areas in data science using tools developed in Python such as SciKit-learn, Pandas, Numpy, Beautiful Soup, NLTK, NetworkX and others. The development is also supported by the use of frameworks such as Keras, TensorFlow and Core ML, as well as Swift for the development of iOS and MacOS applications.
The book can be read independently form the previous volume and each of the chapters in this volume is sufficiently independent from the others proving flexibiity for the reader. Each of the topics adressed in the book tackles the data science workflow from a practical perspective, concentrating on the process and results obtained. The implementation and deployment of trained models are central to the book
Time series analysis, natural language processing, topic modelling, social network analysis, neural networds and deep learning are comprehensively covrered in the book. The book discusses the need to develop data products and tackles the subject of bringing models to their intended audiences. In this case literally to the users fingertips in the form of an iPhone app.
While the book is still in the oven, you may want to take a look at the first volume. You can get your copy here:
It was a pleasure to come to the opening day of ODSC Europe 2019. This time round I was the first speaker of the first session, and it was very apt as the talk was effectively an introduction to Data Science.
The next 4 days will be very hectic for the attendees and it the quality is similar to the previous editions we are going to have a great time.