Jon Calder, full time data scientist, part time musician and blogger from Derivco Cape Town talks to us about his recent lecture at the University of Cape Town and provides some insights into the rapidly growing discipline of Data science.
Getting to know Jon Calder
I’m an aspiring data scientist (impostor syndrome is common in this line of work) with an MSc in Operations Research from UCT. Over the past few years I worked first asa data analyst in a business intelligence team and then later as part of a cross-functional product team, working on a mixture of product design, business analysis and mathematical modelling. Recently I joined the Yield Analytics team. Our role is to service the (Platform – Yield) product owners and teams with data-driven insights and analysis of how their product offerings create value for the business. Outside of my involvement in data science at work, I am also a member of the RWeekly team, a track maintainer for R on exercism.io, and an active contributor to open source R projects on GitHub. But despite appearances, I do have other interests besides R J – among them are music (I play piano/keyboard), sound engineering, and running.
How did you manage to land a role giving a guest lecture at UCT?
After completing my MSc in 2013, my supervisor and I collaborated on two related journal papers which were published in 2015, and though we communicate less frequently these days, we have generally stayed in touch via e-mail or by meeting up for breakfast occasionally. This year he has been involved in teaching one of the (new) modules for UCT’s new Masters in Data Science program. The module is “Data Science for Industry”, and given that much of the curriculum is made up of new content, he contacted me to ask if I would be willing to lecture on one of the topic areas he was less familiar with and knew that I had expertise in.
What was the main topic of your presentation?
I was asked to present on two of the key modern development tools used in the communication of data science results – R packages and Shiny apps. R packages are the primary vehicle for sharing R code, and in addition to their conventional use for extending the R language with additional features and functionality, they can also accompany academic papers in order to help communicate reproducible research methods, or be used for creating libraries of shared (internal) tooling within organizations. Shiny provides a framework which facilitates rapid creation of reactive web apps with R, and this allows for rich interactive visualizations, custom dashboards etc to be built and used for sharing data analysis and insights.
Why is it so relevant in our industry today?
I think anyone who spends time online (particularly on tech related websites and blogs) will probably already be aware that data science is a rapidly growing discipline due to the proliferation of all kinds of data in industry. Communication is one of the key challenges that data scientists face, especially when transitioning from the world of academia to that of industry. In computing terms, R is viewed by many (alongside Python) to be the lingua franca of statistics and data science, and tends to be the most commonly taught programming language in the academic domain. As a result development tooling (around R packages) and frameworks such as Shiny are hugely important since they help to address (at least from a technical perspective) some of the gaps between performing analysis (data computation) and communicating both methodologies and results.
What piece of advice would you give to someone trying to get into the field that you are currently in?
Data science as a field is still relatively immature, and very broad in terms of what it encompasses. Regardless of whether you’re just entering the workforce or moving into data science from another profession, the demands are both increasingly specialized and increasingly varied – often unrealistically so. It’s not uncommon to see job listings looking for experts in statistics, machine learning, distributed computing, and visualization, with experience working with both SQL and NoSQL databases, who are proficient in R and/or Python and also familiar with at least one other compiled language such as C++, Java or C#. Oh, and who also good business sense and excellent verbal and written communication skills.
Even with a lifetime of experience it’s hard to tick all of those boxes properly. My advice would be to focus on developing extensive skills in one or two key areas first (i.e. specialize to some extent), and then start to branch out and increase your knowledge and expertise in other areas. It’s valuable to have a working knowledge of a wide range of things, but it’s unrealistic and unproductive to try and learn all of these different things simultaneously. I tend to think of data scientists as either good statisticians that also know a fair amount of computer science, or good computer scientists who also know a reasonable amount of statistics. So if you’re still studying I would probably recommend focusing your attention on at least one of those subject areas. With that said, many established industry data scientists come from a diverse range of academic (or non-academic) backgrounds so there’s no set recipe, but a scientific degree in an applied, math-related subject area such as statistics, computer science, biostatistics or physics will probably provide the best foundation on which to build.
The other thing I would say is that you should never stop learning – and since data science is not a traditional field, some of the best ways to pick up data science skills and knowledge are non-traditional forms of learning. Attend conferences, do online courses, read blogs and tutorials, and look for open source projects to contribute to. These are great (and fun) ways to learn, especially since the flexibility of these mediums generally allows them to keep up with modern trends and tooling more effectively than dedicated learning institutions.