Who should read this?
If you are a bench scientist and hold a PhD in a ‘non-quantitative’ field like Biology, Life Sciences, Microbiology, Sociology or even Psychology, and have you ever wondered if Data Science is a career option for you? If yes, then read on! You would next wonder about how Data Scientist is the fanciest job title in 2018 and how lot of PhDs are moving into the field. But, the question really is: how do I actually go about training myself such that I can get my first break in the industry as a ‘Data Scientist?’
Your questions are genuine, and while everything that you have heard about being a Data Scientist is probably true, however, there are a few things to keep in mind before you start your journey.
Transitioning into the field of Data Science would be extremely challenging, and will involve a major shift in your thought process. It will also require regular learning and exploring.
As the field is continuously evolving, there is constant pressure to keep up with the new innovations and being up to date with the upcoming technology. I don’t want to scare you away from this exhilarating journey but want you to have a realistic overview of what you are getting into. At times, all this can be very frustrating, nevertheless, once things start to work out, the rewards are plentiful. Apart from all of the above, the very first question that you should be asking yourself is, if you are genuinely interested in the field? This probably sounds like the first year of your PhDs, when you had to choose a topic in which you were genuinely interested and that’s quite right, we all did that and have been successful to varying degrees in our scientific endeavors only because we chose a topic which was genuinely interesting to us. Similarly, to be a successful Data Scientist one has to understand the nature of the field in-depth and develop a strong and sustainable interest, which would drive you in future to be successful in the field of Data Science.
A good way to figure out if you are really going to like Data Science is by trying it out. Give yourself some time, get your hands dirty and then decide if this appeals to you. There is an immense amount of online material and toolkits available to practice and learn Data Science tools and techniques. These can help to develop your skills very well, but it is equally possible to get lost in the process and not be able to achieve the aims you set for yourself. So, in order to use these resources in a productive manner, it is good to have a plan. I am here to discuss that very plan which I had followed and you can also follow to develop your skills as a Data Scientist.
Expanding your skills
While training for your advanced degree, you develop a lot of soft skills in the field of data analysis and problem-solving which can be valuable to a Data Scientist. These skills range from identifying a problem, making a workable strategy for solving the problem, to comprehending huge amounts of information and connecting sparse information to present an interesting story. Alongside you also develop interpersonal and leadership skills as well as strong expertise in communication. In order to be a Data Scientist however, you really need to leverage this plethora of expertise that you have and build upon some very essential tools and techniques like Programming, Statistics and Machine Learning.
Developing Programming skills
Every data scientist needs to have a go-to language which they can use for everything they want to do. This can either be R, Python or even SAS. The first step would be to choose the language that you would like to learn. You can make a decision depending on various factors like your familiarity with any of these languages, ease of learning for newbies, domain-specific job scenario, data handling capabilities and machine learning and deep learning support. Once you have decided which language you want to learn, start by taking some introductory online courses from Datacamp, Dataquest, edx, Udemy, etc for the language of your choice. Next step would be to make yourself feel more comfortable with the platform and move onto some intermediate and advanced level courses for programming.
Other than your go-to language it is always good to have some familiarity with other commonly used platforms. It would be nice to have some understanding of how SQL queries are run, how to use Git, how you can do some basic data visualizations like plotting a bar chart or scatter plot and some understanding of databases. It is crucial that you have a Github account to gain visibility and show off some of the code you have written. This list can vary with different domains. One should look into the specific requirements and try to hone themselves accordingly.
At this stage you should be able to handle different types of data through data mining, perform some basic steps towards data cleaning and data wrangling with a sample dataset. You should try these out to do some very basic things like importing different types of datasets, removing/replacing missing values, merging some datasets and try to answer some very basic questions like what are the top three most prevalent data points.
Statistics and Math
To make any meaningful sense of the data, you should perform some basic inferential statistics and hypothesis testing. We all are pretty much familiar with hypothesis testing and performing various statistical tests to compare how significantly different is our test group performance from the control group. So now you just have to build upon that idea and make yourselves more familiar with the process of thinking probabilistically to make sense of either discrete or continuous variables. You also need to learn the tools that can be used to apply statistical thinking to your sample dataset within the frameworks of your programming language. Some of the resources that can help you build upon your statistical knowledge are Statistics and Probability courses by Khan Academy, Statistical Thinking courses on DataCamp and many other courses available at Coursera, Udacity, edx. These courses will give you an in-depth understanding of hypothesis testing, making statistical inferences, confidence intervals, regression, correlation such that you can do quantitative analysis on data.
At this stage, you would be eager to perform exploratory data analysis with your sample data which will give some insights into the summary statistics, distributions of the data, interdependence of features and make some cool inferences on the basis of statistical significance vs practical significance.
Now, it’s time to dig deeper and do some background work to prepare yourself to learn and understand complex Machine Learning models. This means we should not be afraid of Math and acquaint ourselves with some basic concepts like linear algebra and calculus that will help in better understanding of machine learning models. It would be nice to understand the concepts of matrix algebra and eigenvalues as well as derivatives and gradients as these are the backbones of some machine learning algorithms. The best place to learn these concepts would be Khan Academy, MIT OpenCourseWare or Coursera.
Machine Learning
By now you must be wondering if this is this really what data scientists do? Then how is it different from being a data analyst? You probably already know this stuff and could have easily done this on Excel, then why do you need to spend so much time, effort and resources into learning python. Am I really making any predictions?’ Well, the heart of being a Data Scientist lies in making sensible predictions on data by using Machine Learning algorithms. So, the next step would be to understand these Machine learning algorithms.
Machine learning is broadly classified into two categories- Supervised and Unsupervised Learning. Supervised learning algorithms create a model by looking at labelled examples. Whereas, unsupervised learning algorithms create a model by finding patterns in the unlabelled datasets. Some of the most commonly used supervised learning algorithms are Linear regression, Logistic regression, Support Vector Machines, Decision trees, and Naive Bayes, while the unsupervised learning algorithms are called K-Means clustering. One should try to understand how these algorithms function, what are the advantages, limitations, and assumptions for each model. There are many online resources, some of the most recommended would be Machine Learning courses on Coursera, Edx or Udemy which can help you build a thorough understanding and lots of practice with lots of exercises on these algorithms. After understanding these models you will feel powerful and would be ardent to apply it to your sample dataset.
So, now it will be time to figure out which would be the most appropriate model for your dataset and learn tools to apply it using your programming language of choice. Once you have achieved this successfully you should try to evaluate your choice of model and investigate its performance. Next step, you could either try to improve the performance of your model or endeavor into other models to see if they are better suited to your dataset. These steps would help you learn how to arrive at the most appropriate algorithm that can provide some valuable insights.
Build Portfolio
At this time you will have a complete understanding of a typical data science procedure and it’s time to get creative and build your portfolio. This includes honing your skills in machine learning by doing multiple projects. The most effective way would be to do some serious projects with real world data, as part of a group, or alone with a mentor who can guide you throughout the projects to keep you on the right path. The projects should be carefully chosen to have some practical implications and economic value such that you can make a mark by publishing it in the form of a blog, on a web-page or even present a talk or a poster at a conference. Alongside this, build your profile for job search by enhancing your linkedin profile, making up your interactive resume and growing your valuable network.
Next steps…
At this stage, you would be aware of most of the skills that Data Scientists use in everyday life and ready to jump-start your career as a Data Scientist. Though there are a few things that you should be aware of. The position as Data Scientists come with a range of responsibilities extending from building models to be able to make data-driven business decisions. The range of tools and techniques that data scientists use will vary a lot from one domain to another. Hence, it goes without saying that one has to spend some time interning, doing lots of small projects to get a good idea of what you want to do and what type of data scientist you want to be and what part of the data science process do you enjoy.
In closing, it is best to say that don’t be afraid to try anything that interests you and get what you really want. It might be challenging at the beginning but it will soon get super exciting and unstoppable.
Author:
Tarjani Agrawal is a Data Scientist at @ Point of Care. She did her Ph.D. in Neuroscience from NCBS, Bangalore. After that, she did a short Post-Doc at NYU Langone Medical Centre, New York. During her postdoc, she was working on sleep behavior in flies, which is a quite old yet fascinating problem, but for her, the more challenging problem was to handle and make sense of big data, that was being collected every 5 mins for months for some thousands of animals. That’s when she discovered the whole field of Data Science and decided to make a move. However, it was not possible to make that switch with a full-time Postdoc, so she took a part-time teaching job as Adjunct Assistant Professor at CUNY, New York. She followed the steps that she has described here as well as got a position as a data science fellow at Springboard, where she was able to hone her skills by working on various projects and build her portfolio. After the struggle of around one year and four months, she got her first job as a Data Scientist. Now, she enjoys working on huge datasets, makes models to predict the trends in engagement and participation of the target audience and make very interactive dashboards for immediate analysis. She is helping her team make better, more informed business decisions based on quantifiable data-driven evidence. In her spare time, she enjoys teaching high school kids and graduate students and also takes up new and challenging assignments related to data science.
Editors
Pawan Nandakishore is a postdoctoral researcher at the International center for theoretical sciences (ICTS). He is a Ph.D. from the Max Planck Institute for Dynamics and Self Organization. His background is in experimental physics, with a specialization in soft matter. Do feel free to write to him at pawan.nandakishore@gmail.com
Anshu Malhotra is an assistant scientist at Emory University and she is actively involved in coordinating the activities in CSG’s flagship mentor-mentee program (Gurukool). She is actively involved in bench-based research in pediatric oncology and is strongly interested in developing skills in data science. CSG’s current venture, the Data Science club is Anshu’s latest passion and she hopes that this platform will bring more life scientists together to train themselves and network in this budding new profession. In her spare time, she dabbles into artwork of 3D murals.