Data generation and analysis is not a new concept. While high throughput scientific data has been generated by the likes of genome sequencers and the Large Hadron Collider, large swathes of commercial data has been generated by Amazon, Netflix, Google and social media platforms. Thinking in a more organized manner, every aspect of life, be it mechanics, biology, social media or weather, all generate data. Analyzing this data gives rise to meaningful trends and patterns, which can be used to ask questions that would reveal logical answers to various different aspects of life. For example, can we predict the next big disease outbreak using patient data from hospitals, so healthcare organizations can be better prepared to handle the outbreak; combine domain specific knowledge from modern medicine to build models that suggest what type of treatments will work for a given disease. In other words, by analyzing data, we can make better decisions and efficiently allocate resources to tackle existing problems.
Many trillions of gigabytes of data is generated everyday. It is estimated that by 2018 we will be generating about 50,000GB/second. GIS images, preferences on shopping, movies and shows on television networks, GPS based trends, social media behavior, photos and videos acquired by smartphones all contribute to very simple forms of data that is generated and stored by companies. Another example is healthcare, where patient metrics have been collected over a century, in the form of qualitative observations, images and numbers. Meaningful trends can be identified from this data by using predictive models and visualization tools. In order to identify trends and build these models, appropriate questions need to be asked of a data set.
With a growing trend in data analysis, there is an increasing demand for people who can perform such analyses. Scientists, by virtue of their training, are required to frame questions as the starting point of any project. They work to find answers to their questions by designing experiments, building numerical or analytical models and sometimes combining all three methods. Since these people know how to extract useful information from a bunch of messy, unclean data, they are fast becoming the lifeline of data science. A number of scientists are self-teaching themselves at least one relevant coding language and are playing around with open-source biological and other data sets.
It is important that scientists who wish to train themselves for this relatively new profession benefit not only from a plethora of resources available online but also from appropriate interactions with the data science community. The purpose being not to feel lost in the huge online community that already exists, but to receive relevant pointers to resources and networking. The PhDCSG group has been especially instrumental in providing professional development guidance and resources to scientists especially from the STEM fields. Their recent initiative, called the Data Science Club has been created with this same objective: to train PhDs in relevant data science skills and also guide them towards career opportunities in this area. The club functions on the mentor-mentee format, with one mentor assigned to a group of 3-5 mentees. The mentees have been acquiring coding skills since the inception of the club and have an active network available to discuss projects, job openings, problem solving etc. The club will soon embark upon undertaking challenges from freely sourced data sets as well as smaller projects that are being out-sourced to the club members.
The goal of the club is two fold- firstly by introducing people to the wonderful world of data science and machine learning, the club aims to help those who want to familiarize themselves with data science tools and apply them to various types of problems in disciplines like science, industry, education etc. Their second goal is to keep in touch with the rapidly advancing field of data science, the academic leaps and its subsequent applications to various domains. The aim is to present articles at various levels, from building data based stories using visualizations to reviews and implementations of some of the newest techniques in machine learning.
Infographically Speaking…
From Visually.
There are vast opportunities waiting to be harnessed in this field and scientists have a very unique opportunity on their hands: to take up the challenge of this newer profession and start making sense of the vast amounts of data that have already been generated in the world. After all, who understands data better than a scientist?!
If you want to learn about data science or discuss interesting ideas and projects please write to us. We would love to collaborate with data science practitioners and start ups looking to develop machine learning and data science solutions. So please feel free to connect with us at csgdatascience@gmail.com
About the Author:
Pawan Nandakishore is a postdoctoral researcher at International center for theoretical sciences (ICTS). He is a PhD from the Max Planck Institute for Dynamics and Self Organization. His background is experimental physics, with a specialization in soft matter. Do feel free to write to him at pawan.nandakishore@gmail.com
Edited by:
Anshu Malhotra is an assistant scientist at Emory University and she is actively involved in co-ordinating the activities in CSG’s flagship mentor-mentee program (Gurukool). She is actively involved in bench-based research in pediatric oncology and is strongly interested in developing skills in data science. CSG’s current venture, the Data Science club is Anshu’s latest passion and she hopes that this platform will bring more life scientists together to train themselves and network in this budding new profession. In her spare time, she dabbles into artwork of 3D murals.
Cover Image: Vinita Bharat
The contents of Club SciWri are the copyright of PhD Career Support Group for STEM PhDs (A US Non-Profit 501(c)3, PhDCSG is an initiative of the alumni of the Indian Institute of Science, Bangalore. The primary aim of this group is to build a NETWORK among scientists, engineers and entrepreneurs).
This work by Club SciWri is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License