This article takes a look at what the online Data Science community has been writing about over the last couple of years, with an aim of understanding the trends in technology over the course of 2020. To do this, I’ve sampled roughly 30,000 unique Data Science stories from across Medium between the January 2019 and mid-December 2020.
This article is broken into two parts:
Over the last decade or two, cloud computing has come to dominate many of the skills and processes needed to develop ‘modern’ software. This is increasingly true for adjacent fields too, including the world of Data Science (among others). One of the trends in this sweeping move towards ‘The Cloud’ has been the ever-increasing levels of abstraction with respect to how development teams interact with the infrastructure running their applications.
Arguably at the top of this pyramid of abstraction is the concept of serverless computing, and it is built on the idea that (as the name suggests), developers need not spend time configuring servers and writing boilerplate app code, and should instead dive straight in to writing and deploying the code that ‘really’ drives business value. This can also make it super easy for developers, Data Scientists and others to deploy simple applications and services with little-to-no experience of configuring the infrastructure needed to deploy ‘classic’ web apps. If that sounds like it may be useful to you, then great! …
Deploying software regularly and reliably is hard. Deploying software that utilises Machine Learning (ML) models regularly and reliably can be harder still. At the end of the day, the long-term value of your latest model pipeline will be determined (in part) by how much your company or your customers trust the resulting service, and how quickly you can address changing customer requirements by iterating on your pipeline.
That’s where automation can come in very handy: careful automation of ML pipelines can massively boost your productivity by allowing you to rapidly iterate on a pipeline in order to account for new business logic or modelling changes, while also ensuring those changes meet key performance criteria before going into service with your stakeholders/customers. …
In Part 1 of this series, you saw a few practical examples of how Object-Oriented Programming (OOP) can be used to help you resolve some code design problems. If you missed it, it’s over here:
Right, let’s dig in.
The language around OOP can seem intimidating. You’ve seen some of this language in the example in Part 1, but let’s make it a little more concrete. Firstly, let’s start with probably the most basic question: what’s the difference between a
class and an
A few months ago, I had an enthusiastic outburst in which I expressed my appreciation for a little package called
TQDM for creating progress bars. This post is in the same vein but this time for
Fire: a great package that can make getting a Command Line Interfaces (CLI) up and running in (literally) a couple of seconds a breeze.
So, why write a CLI? Practically, a simple CLI can make configuring a script as simple as changing a couple of command line arguments. Let’s say you’ve got a script set up on an orchestration service (maybe something like Jenkins) that regularly retrains your latest and greatest Tweet sentiment classifier. Let’s say it’s a Scikit-Learn Random Forest. …
If you’re familiar with the Data Science software ecosystem in Python, you’ll likely have come across a handful of widely used dashboarding and data visualization tools designed for programmatic usage (e.g. to be embedded in notebooks or to be served as standalone web-apps). For the last few years, the likes of Dash, Bokeh and Voila have been some of the biggest open-source players in this space. Within the world of R, there’s also the long-standing champion of dashboarding tools: Shiny.
With this relatively mature ecosystem in place, you may question the need for a yet another framework to join the pack. But that’s exactly what the team over at Streamlit are doing: introducing a brand new framework for building data applications. What’s more, they’ve created quite a bit of buzz around their project too, so much so that they recently closed a $21M Series A funding round to allow them to continue developing their framework. …
If you’ve been programming for at least a little while, you’ll likely have come across (and perhaps used) Object Oriented Programming (OOP) concepts and language-features. This programming paradigm has been a central concept in software engineering since the mid-90’s, and can provide some very powerful capabilities to programmers — especially when used carefully.
However, it isn’t uncommon for many programmers to swirl around concepts like OOP for many years — perhaps gaining the odd bit of insight here and there — but not consolidating that understanding into a clear set of ideas. For beginners too, the concepts of OOP can be a little bewildering, with some guides utilising language-specific OOP implementations to illustrate ideas and many using subtly distinct of overloaded language, all of which in turn can sometimes obfuscate OOP concepts in the more generic sense. …
This post aims to help you get started with building robust, automated ML pipelines (on a budget!) for automatically retraining, tracking and redeploying your models. It covers:
The tutorial section is designed to make use of free (or nearly free) services, so following along should cost you a few pennies at most. If you’re working on an MVP and need some ML infrastructure in place sharpish but want to avoid the price tag and technical overhead of AWS SageMaker or Azure ML deployments, you might find the example useful too. Finally, if you’re interested in understanding how the tutorial fits together to run it end-to-end for yourself, you should check out the previous post in this series on deploying lightweight ML models as serverless functions. …
Deploying machine learning (ML) models into production can sometimes be something of a stumbling block for Data Science (DS) teams. A common mode of deployment is to find somewhere to host your models and expose them via APIs. In practice, this can make it easy for your end users to integrate your model outputs directly into their applications and business processes. Furthermore, if the customer trusts the validity of your outputs and performance of your API, this can drive huge business value: your models can make a direct and lasting impact on the target business problem.
However, if you don’t have access to ongoing technical support in the form of DevOps or MLOps teams, then wading through cloud services to set up load balancers, API gateways, continuous integration and delivery pipelines, security settings etc. can be quite a lot of overhead. Moreover, unless you’re pretty confident with these concepts, delivering (and monitoring) an ML API for which you can guarantee security and performance at scale and thereby engender the trust of your users can be challenging. …
You can’t really beat a great, high-quality little package that makes you more productive.
tqdm is one such package. It is an easy-to-use, extensible progress bar Python package that makes adding simple progress bars to Python processes extremely easy. If you’re a professional Data Scientist or Machine Learning (ML) Engineer, chances are you’ll have used or developed algorithms or data transformations that can take a fair while — perhaps many hours or even days — to complete.
It’s not uncommon for folks to opt to simply print status messages to console, or in some slightly more sophisticated cases use the (excellent and recommended) built-in
logging module. In a lot of cases this is fine. However, if you're running a task with many hundreds of steps (e.g. training epochs), or over a data structure with many millions of elements, these approaches are sometimes a little unclear and verbose, and frankly kind of ugly. Plus, adding little ‘developer experience’ touches to your code (such as progress bars!) …