Table of content: 1. Who am I and to whom I am writing this note? 2. Why I am writing this note? 3. From Academia to Industry 4. What is Data Science and who is a Data Scientist? 5. What Skill Set Is Needed? [learning resources] 6. 3 Steps Between you and Becoming a Data Scientist
Who am I and to whom I am writing this note?
I am a Data Scientist with background in Physics. I write this note to anyone with academic background who is interested to become a data scientist in industrial sector. Although this note targets the academics with advanced degreed in quantitative fields such as Physics, Statistics, Math, and Engineering, interested people from other relevant backgrounds may also find reading this note useful.
Why I am writing this note?
I believe the information on how one can pursue their interests has to be accessible to all globally. The differences in the level of success should not be because of the lack of information, in my opinion. However, it can be because of the differences in creativity, strategy, and hard work, for instance.
Although it is obvious that there is not only one way to success, it is not always obvious how a successful way would look like. It has been my own question. I have also received similar questions from many fellows and juniors asking about the same issues frequently. Since I could not find a comprehensive, instructional note on the path from academia to data science in industry, I decided to write this note and share it with public.
From Academia to Industry
Back in January 2016 a fellow physicist friend of mine who was a postdoc researcher in Europe at the time asked me about my new job as a Data Scientist and how I like it so far. He mentioned that he is also interested in doing data science and wants to know more about it. He essentially wanted to know where he should start from to become a data scientist. Specially, he wanted to know what skill set he would need to do so, and where he has to search for a data science job. He was not the first one asking these questions from me since I landed to the field of Data Science. I asked him to let me respond to these questions in details later this weekend when I get some more free time.
Responding to my friend’s questions, here in this note I would like to briefly picture the path I, myself, went through over the past 18 months to switch from a science postdoctoral research position in academia to a data scientist role in tech industry. In addition to responding to my friend’s questions, I hope this article would be also useful for other people with advanced degrees in science and engineering out there who want to choose data science as their future career and may have similar questions.
As a former postdoc and having met many fellow postdocs during several years in academia, I believe that in a successful career a postdoc position is an excellent place for transition (either to a permanent faculty job in academia or to a proper job in industry section) but not necessarily the best place to stay for a long time. Not so many years ago almost all science PhD/postdocs were intended to be the future university professors after one or two postdocs. This intention has changed quite a lot over the past decade due to several reasons including the lack of proper faculty jobs in academia from one side, and the highly growing job demand for science PhDs in industrial section on the other side.
Therefore, because of various reasons (to see some of them click here) many PhDs in science, engineering and even in other fields may find that academia is not the best place for them to settle and therefore eventually decide to leave. In fact only a very small percentage of PhDs will remain in academia for their entire career i.e. professors [for example, see above diagram from this report published in UK, 2010].
Having academic background in science and technology, one justified path would be landing in the fast growing field of data science. The rest of this note would be more useful for this group of people who has just started to think about taking this path to become a data scientist. They might want to know more about what it is, how they can get prepared for that, and where they have to look for the best opportunities. I will try to address these questions in the next sections.
What is Data Science and who is a Data Scientist?
In reality, because of the nature of this field, there is no unique definition for data science and data scientists. The definitions and descriptions can vary in a huge range. But here is my version:
A data scientist is a skilled professional with scientific mindset who uses the past and current data to ask [and eventually answer] the right questions in order to make the most informed future decisions in an organization.
There are many popular quotes from the seniors of the field to define data science. Each of them describes one aspect quite well from one perspective and they all together could give you a good description of the truth about data science and data scientists. Here are the quotes that I like the most on the importance/definition of data/data-science/data-scientists.
“Data is the new oil? No: Data is the new soil.” — David Mccandleuss
“Data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.” — Mike Loukides, VP, O’Reilly Media.
“Data scientist is a person who is better at statistics than any software engineer AND better at software engineering than any statistician.” — Josh Wills
“By 2018 the United States will experience a shortage of 190,000 skilled data scientists, and 1.5 million managers and analysts capable of reaping actionable insights from the big data deluge.”— Game changers: Five opportunities for US growth and renewal, McKinsey report.
“This hot new field [data science] promises to revolutionize industries from business to government, health care to academia.” — The New York Times
“The data scientist must have knowledge in applied science, with an extensive experience in its industry, and training in science.” — Juan F. Cía, BBVA
And here is my most favorite joke on the definition of data scientist retweeted by Chris Dixon “A Data Scientist is a statistician who lives in San Francisco!” which also delivers some part of the truth.
To have a broader view of the field of data science and big data, I highly encourage you watching the 2-minutes video message of US President’s on Data Science in 2015 introducing DJ Patil, the first official Chief Data Scientist of White House following by a remarkable 10-minutes talk by DJ Patil and also a fun 12-minutes public talk by Hillary Mason in 2012.
What Skill Set Is Needed? [learning resources]
Generally speaking, the skills one would need to get a data scientist job usually fall into three categories [see the diagram]:
- Math and statistics
- Programming and hacking skills
- Domain knowledge
Depending on your background in academia, you may have to focus on improving your skills in different categories from the above list. As a physicist with advanced academic degrees, you should have no serious problem in math and statistics part. However, the other two categories i.e. domain knowledge (substantive experience) and hacking skills (machine learning algorithms, programming skills, advanced software tools, etc) are those you have to focus on more instead. A detailed and relatively complete list of skill sets needed for a data scientist is available in this diagram. Talking about the most important skills for data scientists, I also found this article very informative.
There are so many subjects and tools under the hacking skills category which are useful to get yourself familiarized with during the preparation phase. However, nobody knows everything about all of them and that’s totally fine. Among all, the most important computer and hacking skills which are under high demand, in my point of view, are the following.
- Machine learning algorithms: Regression, Classification, Clustering, Recommender Systems, Neural Networks, etc.
- Programming Languages: Python, R, SAS, etc.
- Big Data tools: Map Reduce Fundamentals, Hadoop, Spark, SQL, etc.
- Visualization tools: D3.js, ggplot2, matplotlib, etc.
I personally got interested in Data Science when I was a research postdoctoral fellow in data analysis group of LIGO. As part of this job, I had this opportunity to learn and work with several tools and techniques in data analysis of big data from science observatories of gravitational waves which helped me a lot along the way. I started my first Data Science projects when I took an online 10-weeks course on Machine Learning by Andrew Ng of Stanford. I totally enjoyed the lectures and homework projects such that I took more courses on the field afterwards and started to think more seriously about choosing data science as my career path. Following is a schematic of my career path towards data science which is only one way out of many other possible ways approaching data science.
Among all the useful, available textbooks on applied statistics and machine learning algorithms, essential for data science, I recommend the following books. They are all available online by their publishers for free:
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction. By Trevor Hastie, Robert Tibshirani, Jerome Friedman. [Download] (see the course: Statistical Learning)
- An Introduction to Statistical Learning with Applications in R. By Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani [Download]
- Mining of Massive Datasets. By Jure Leskovec, Anand Rajaraman, Jeff Ullman. [Download] (see the course: Mining of Massive Data)
There are a growing number of weblogs active writing about data science on a regular basis. Some of the most interesting ones to me that I have found useful are the following.
- Astronomer to Data Scientist, a brief but very useful post on transitioning from academia to data science which was very helpful to me when I was getting started. [introductory] [if you are curious what happened to her after this post, read this interview with her after a year and half after]
- Data School [introductory]
- Data Science 101 [introductory and technical]
- Data Science Central [introductory and advanced]
- FlowingData [technical, details]
- R-bloggers [technical]
Online Courses [MOOC]:
There are many online courses that you may want to participate. Many of them are available for free and there are many which are not. Below I have listed the most popular educational websites with links to particular courses on data science. Those which I’ve found the most useful courses among all courses that I’ve taken so far are marked with [***]. They have the highest priority to be taken for an outsider of the field, in my point of view.
- from Coursera: Series on Data Science including but not limited to these that I took:
- [***] Machine Learning [10-weeks, high work-load]
- [***] Mining Massive Datasets [7-weeks, high work-load]
- The Data Scientist’s Toolbox [4-weeks, medium work-load]
- R Programming [4-weeks, medium work-load]
- Getting and Cleaning Data [4-weeks, medium work-load]
- from EdX: Series on Data Analysis and Statistics including:
- Data Science and Machine Learning Essentials (AzurML) [5 weeks, medium work-load]
- from Stanford Online Laguna:
- [***] Statistical Learning [10-weeks, high work-load]
- from Udacity: Series on Data Science including these that I took:
- Intro to Hadoop and MapReduce [4-weeks, medium work-load]
- Data Visualization and D3.js [6-weeks, medium work-load]
- from Data Camp: many useful, related courses are available.
- Kaggle R Tutorial on Machine Learning [1-hour, low work-load]
- Insight Data Science Fellows Program: A 6-weeks bootcamp on data science started in New York City. Completed or approaching completion of a Ph.D. is required. [FREE]
- The Data Incubator: A 6-8 weeks bootcamp on data science at New York and Washington D.C. [FREE]
- Microsoft Research Data Science Summer School: An 8 weeks bootcamp on data science at New York [FREE]
Be aware that since they are free and high quality and the employment rate of the graduates by popular companies are very high in these bootcamps, getting admission to them are quite competitive.
In addition to the above free bootcamps, there are several other popular data science bootcamps which are not free [~$12-16k in 2015]. To see a more complete list of data science bootcamps in the US click here. After all, consider data science bootcamps as an optional choice. You don’t need to participate in these bootcamps to get a nice data science job, but if you have time and money to participate in one, do it.
Academic Graduate Programs:
Here in this note I’m targeting those who want to do a transition to data science with advanced degrees in academia and this option is not necessary in such a situation. But for the sake of completeness of the argument, I write a few lines about it.
Recently, there have been started many Master programs in various schools and universities in the US including in Columbia and Stanford University. A list of master programs in the US is available online. A full-time master program of data science usually takes 1-2 years. Due to this relatively long duration of these programs and high costs, they are only recommended for undergraduates interested in data science or for those who already have a job in analytics teams of corporation and want to learn about data science in an academic system.
3 Steps Between you and Becoming a Data Scientist
Assuming you are an academic with solid domain expertise in your field willing to became a data scientist, there are three major steps that you should take for a smooth take-off from your field and land in to the field of data science. It definitely depends on your skills’ level when you start and how much effort you intend to put in the process, but for me, for example, it took around 18 months from the point I started my first online course on data science to the point that I got my first job offer. The following are the three major steps that I categorize all the details of this journey under those [see the below schematic].
A) Skill Development: Establish a solid data science skill set
Since this is the first and probably the most important step towards your goal, it’s very important to do it right. Because data science is an interdisciplinary field, there might be many topics that will be totally new to you and you have not seen them before. Be patient and follow an organized plan to learn the new topics as deeply as you can. Don’t worry much about the time you spend in this step. Like in any other learning curve, you are going to go back and forth several times until you learn a topic deep enough to be able to apply it elsewhere. Absorb as much as you can from the learning resources and make yourself comfortable with the new concepts such that you can easily use them in the real-world problems later on.
Choose your resources carefully and read, listen, watch, and learn as much as you can to develop and improve your basic knowledge and skills in programming, machine learning algorithms, and software toolboxes. The textbooks and online courses listed in the previous section are rich resources to get you started. I personally recommend starting with Machine Learning course in Coursera by Andrew Ng which required almost no prior knowledge of machine learning and data science and afterwards taking the more advanced course of Statistical Learning by Trevor Hastie and Rob Tibshirani of Stanford. Both of these courses are linked above in the resource section. They both cover a broad range of topics in different depths with slightly different approaches and tools which makes it an excellent strategy to learn, I think.
Look around, find, follow, and communicate with the people who are experienced in data science including the seniors in the field and also those who are willing to take the same path as you for company along the way. It is going to be a long journey and having the right people in the right moments around you will certainly give you lots of benefits during this journey. Since it’s a transition path, there will be moments of loss and darkness. Never give up and always be positive. The more informed your initial decision to take this journey is, the less lost and darkness moments you will have. Gather as much information as you can about the new field that you are willing to enter to.
In addition to the new concepts and various machine learning algorithms which are crucial for doing data science, it’s also very important to be familiar with the tools. Know a programming language very well to the master level and try to learn as much as you can from other languages to be able to use them in special circumstances. There is a long debate among data scientists on which language is the best for doing data science. After all, it doesn’t matter which programming language you use to implement your algorithms but it is important to be able to communicate with the rest of the community.
For this reason, and because of their popularity, Python and R are the most recommended languages to learn and use among all. Python has been written by computer scientists and therefore is more structured than R. But R is developed by statisticians and therefore it’s easier to learn and apply statistical concepts with it. I learned and worked with both of them, just to be on the safe side in any case. I don’t regret it.
B) Project: Start and finish a data science project
Although developing a strong skill set is the most basic step towards becoming a data scientist, and therefore the longest one too, I cannot put enough emphasis on the positive effect of performing a small data science project on your success in the next and final step which is job searching. Thinking from the point of view of a hiring manager who wants to hire you with an academic background in your first data science position, regardless of your great resume and all the long list of refereed papers on high impact journals, the first thing they would like to know is weather you are actually capable of doing a data science project from the beginning to the end. Performing a data science project prior to the job interview and showing the procedure and results is certainly a key factor which clarifies the ability and quality of your work to the interviewer and helps him/her to make his/her decision more informed and therefore more confidently.
There are many interesting data science projects out there that you can work on in this stage. Participating in an internship program of a nice company would be certainly a good idea if you have the time to do it. Keep in mind that in this case, you have to go through all the application process and paperwork to do an internship. But if for any reason doing a data scientist internship is not working for you, I recommend working on a project on your own. It is more flexible and less time consuming. If you choose so, I strongly recommend participating in one of the Kaggle competitions.
Kaggle is a platform for data scientists who are willing to participate in very interesting, real-world problems. Competitions are performed in different categories but the structure is quite the same for all the categories. It’s very simple. There is a data science problem that you are supposed to solve before a solid deadline. All training data is available to download. The solution that you are supposed to submit to the website is a single file in a standard format. It doesn’t matter what tool, method, or algorithm you have used; Kaggle will score your submitted file based on comparing it with the real solution that they have kept confidentially.
There are a few things that make Kaggle an excellent platform for you to use at this stage. First, all the problems in this website are standard real-world problems that you have several options to choose from. Second, since it is happening in a competition, there will be many others like you who are working on the same problem as you. So, you can always communicate with your fellow Kagglers to share ideas and improve your algorithms and results. Third, at the end of the competition your solution will get ranked and in the case you have done it well, you can present it as a valid achievement to show the quality of your work.
C) Job Searching: Shape up your resume and do the job search wisely
After above steps, now it’s the time to reach out to the world and show what you have got and what you can do. You’ve got all you are and ready to start searching for your dream job. However, just like previous steps this step, i.e. searching for a proper job, wouldn’t be easy either. Although data scientist jobs are highly on demand but notice, on the other hand there are many talented persons like you with higher educations out there who are also active in the job market and therefore the environment is very competitive to become a data scientist. So, to be successful in this final stage, some specific rules and strategies should be followed.
If you are coming from academia and used to have a 10-pages CV full of rocket science publications, achievements, and honors with the list of all talks and seminars that you have given to folks, burn it down and create a one (max 2) page resume instead. Make sure at the beginning of your resume you have clarified yourself, what you are and what you want. Highlight your main skills relevant to the job you are looking for which in this case is data science. List your experiences to support and justify your highlighted skills.
Keep in mind that “Data scientist” is often used as an umbrella title to describe jobs that are drastically different. Be aware and be careful about what you will be actually doing under the title of data scientist and in what industry you will be working. Data engineer, data analyst, and data scientist roles may make you confused at the first glance. This infographic describes different roles in the business of data science pretty well.
Transitioning to a new field, your prior information is your power. Do a broad and careful research on different aspects of the field and what you are interested in the most. No question is stupid or naive. Find the answers of your questions about data science jobs. 2015’s Burtch-Works report on analyzing the status and compensation of 371 data scientists in 2015 in the US is very useful in this stage i.e. job searching. They may continue publishing new reports next years as well. Analyzing 11,400 data scientists currently employed by companies known to LinkedIn, RJMETRICS has also published a review on The Status of Data Science. Read these reports carefully and enrich and update your knowledge about the field of data science. Glassdoor has also many useful features in this regard. The bottom line here is that clear up your eyes and sharp your mind on what you are going through.
For the sake of completeness, here I should write about how to search for your dream job efficiently and all different techniques that you need during this step, but since it’s a broad subject and may go beyond the scope of this note, I decided not to go far deep into this topic and refer the readers to other technical references on this subject instead.
However, I would like to briefly mention a few quick, crucial points. First, be prepared for the job interviews but don’t expect your first few job interviews to go as you expect. It always gets better in the next interviews. Take notes and use your previous experiences in the next occasions. Wrap your mind around what you have, what you are, and what you want to present in advance. It takes time and won’t be easy in the first interviews. I remember when I got a phone interview from Google, I was so frustrated and stressed out just because I was not ready for it. It got much better in the next interviews later on.
Second, work hard on your communication skills and build up your own network. If you have data scientist friends, reach out to them and ask for suggestions and get them involved in your job search process. They may also have information about job openings in their companies or in other places in their network that you want to hear about. Expand your options by connecting to others on the other side of the wall.
Third, always be connected and be updated. Internet social networks are powerful tools. Use LinkedIn, Monster, Indeed, Kaggle job board, and Glassdoor for your best. They are actually functional. Face-to-face conversations are great but do not underestimate online tools in your job search and make the best use out of them. Follow popular data scientists on social media specially those who are in your target geographical region and discover what they are up to. Always feel free to contact and follow @smirshekari, if you have any question that you think he might be able to help.