Analysing job post data from Indeed, can we make more evidence-informed career decisions?
Picture a novice data scientist, either straight out of education or pivoting careers. They are determined to enter the field but have limited time and funds to spare in extra training and are daunted by the sheer variety of skills and experiences employers request. Naturally, they sign up to intense bootcamps and scramble to take online courses on everything from Azure to Excel.
How can you be sure that focusing on a particular skill won’t prove to be a massive opportunity cost? How many years experience are they expecting? What should I expect to get paid, realistically? If you’re leafing through a data science syllabus, and you want to make sure your investment pays off then what should that syllabus include? That’s also a very pertinent question for people designing a data science course: how do we maximise the overlap between what recruiters want and what our course teaches?
Every new or aspiring data scientist right now— Img 1.1 — Photo courtesy of UnSplash
I’m reporting on findings from an exploration of job data on Indeed.co.uk, so this focusses on the UK job market. Although this post provides useful answers, there are still plenty of questions to ask and it’s unclear how representative the findings are of the total population of jobs. This is why future iterations of the project in Q1 2021 and later will try to replicate / falsify the findings. If you have suggestions for other things to search for within the data or any critiques of methods used, please leave a comment. All feedback is appreciated.
If you just want the most important results, scroll to section II. Key insights.
If you want to know how those results were generated in more detail, scroll to section III. Methodology.
If you want to be able to replicate the findings, read the assumptions or see the results in full (including more of the null results), see the README in the project repo and the main notebook.
Harvard Business Review dubbed Data Scientist the sexiest job of the century, and the rapid wave of data bootcamps and online courses over the past decade reflects the immense magnetism of the profession. Yet with so many people rushing into data science, it’s important to know the market and offer a competitive resume. This post and its corresponding project repository represent my initiative in adding to the trove of knowledge on the job market. This is an exploratory, single-researcher analysis, using data scraped only from Indeed, with a total sample of 1082 job descriptions and 382 annual salaries. Therefore you should take findings with a pinch of salt. The plan is to replicate the scraping and analysis during Q1 2021. I plan to broaden the range of analyses performed on the data as well as improve the quality and efficiency of the web-scraping tool. I searched for jobs with 3 different pairs of words in the title:
I’ve included all of those under the umbrella of “data science roles” because even though it’s common knowledge that those roles do very different things and require different proficiencies, they still all do data science, and the lines between the categories is blurry.
The discrepancy is consistent across the three categories, though wider for roles with “machine learning” (ML) in the title than those with “data analyst” (DA). This is particularly daunting for the lone, fresh data scientist entering the field, especially if they have no prior experience of negotiating pay in any field.
Roles with “data scientist” (DS) and ML in the title are paid on average £25k per year more. That’s approximately £13 per hour more.
Why does this matter?
If you’re aiming to maximise your salary, it’s important to know what parts of the field return greatest financial rewards over time. Having said that, consider that experience can be the greatest barrier to getting a job in data science (see Insight 4). If an analyst role is easier to get into, but you still want to aim for the higher salaried positions, then why not build on it?
Recommendations:
For those just entering the field, that’s enough information to tell you where to start focusing your attention. Once you have a solid foundation in those two, here are your options:
Fig 2.1 — the top 10 most mentioned skills / languages for each group
Fig 2.2 — when ranges were stated they were averaged (e.g. “2–3 years” became 2.5)
Most jobs that could be found to explicitly state an experience threshold asked for 2 to 3 years experience. If representative, this presents the largest barrier to entering a data science role. If you’re completely lacking in experience now, you might have around 20 jobs open to your level.
After 2 years working in data science, that number should be 3 times bigger. Moreover, the expected salary will increase too, as there is a moderate correlation between the two.
As you might expect, just over half of the jobs are in London. They also pay on average 50% more than in the rest of the country, which is to be expected given the much higher cost of living.
Fig 2.3
The median data analyst in London would expect to earn as much as the overall median for outside the capital.
Using topic modelling (LDA) to cluster and group jobs, I could determine 3 emergent topics that weren’t just noise. These were the
This last insight was generated with unsupervised techniques and it is not currently possible to verify the accuracy of those topics. However, they do corroborate other findings in Key Insight 3.
For this project, I’ve followed the PPDAC cycle for data science:
Fig 3.1 — PPDAC cycle
2. Plan —What libraries would I use? How many statistical tests should I run and what confidence level should I set in advance? What
3. Data — Web-scraped job descriptions from Indeed.co.uk, searched for between 24th and 25th November, 2020.
4. Analysis — Plot salary distributions for London and non-London jobs; different job categories and jobs by which programming language they mention. With an initial alpha of 0.05 I will run several NHST tests and report on results.
5. Conclusion — Report any insights, recommendations and future steps.
If we want to get ahead in the data science UK job market, it would be useful to be able to answer the following:
My plan was to scrape data from Indeed.co.uk using BS4 and Selenium. Then I would extract key information such as salary, location and programming languages using regex (regular expressions). I would plot and wrangle data using the Pandas and Seaborn libraries then carry out statistical tests. I would also attempt to build a predictive model for salary using linear regression, and report on the coefficients as measures of the importance of each job feature. Lastly I would use Latent Dirichlet Allocation to look for emerging topics within job descriptions.
The ideal situation would be to answer all questions stated and to be able to state Insights and Recommendations for every part of my analysis. However, in several areas, I had to report null resorts due to either insufficient data or a lack of a signal, and will have to revisit those questions in future iterations of the project.
From previous attempts at web-scraping jobs on Indeed.co.uk, I know that the number of jobs listed at any one time are in the hundreds, so there would likely be a problem of insufficient data for some of the questions I wanted to answer. Another problem was trying to make sure that the search terms I used captured the field of data science as much as possible, rather than just one specific role within it.
To tackle those two problems, I retrieved job data from Indeed.co.uk based on 3 separate searches. For each of these, my search results only returned job posts where the title of the job included the following pairs of words:
There were jobs that contained a mixture of 2 of the title words (these duplicates were discarded). Moreover it’s generally known that a data analyst is doing data science, an ML engineer does do some analysis, a data scientist does use machine learning, and so on.
Fig 3.2: being coy about pay
From the perspective of a recent data science grad or someone fresh out of a bootcamp, one challenge they’ll face is salary negotiation — particularly daunting when most jobs do not directly state their salary. This applies across the board to all 3 categories, although the gap is wider for DA and ML roles.
For the bootcamp’s organisers, this makes it even more important that they research salary estimates thoroughly and inform their students of this — to help reach the goal of maximising average salary of bootcamp grads they should also be given help with salary negotiation.
Fig 3.3: annual salary distribution — regarding the median with a decimal: salaries that stated a range (e.g. £40–45) had their average taken instead of both values. Hence the non-integer median value
Among the jobs that reported annual salary (Fig 3.3), DA jobs were not as well paid as ML or DS jobs by quite a margin — the median salaries for DS and ML are at least £12k above the median for all data science jobs; DA roles pay about £20k less!
This is fairly solid finding since it’s supported by general background knowledge about the field that analyst roles tend to be less technically specialist and pay less compared to other data science roles. If you’re aiming to maximise salary, then a recommendation might be to prepare your grads to aim for DS and ML jobs. However DA jobs are also the most numerous. They might form a reliable fallback for bootcamp grads not managing to hit targets for the DS and ML roles.
3. What are the main locations that data scientist roles appear in? (London expected to be the main one)
Fig 3.4 — dominated by the capital
As expected, London dominates the country in terms of data science roles. Even with all the mini tech hubs, the emerging Northern cities and the Silicon Fen in Cambridge, London still edges over the entire rest of the country (Fig 3.4).
Fig 3.5— number of jobs by top 10 location
The next 9 ‘locations’ with the most data science roles are utterly dwarfed by London (Fig 3.5). The fourth most popular location (as declared in the job) is ‘Home Based’, which is unfortunate for anyone hoping data science jobs might invigorate the North or anywhere that isn’t London.
The picture becomes even more dire when we consider salary breakdown between the capital and rest of the kingdom. Figure 3.6 shows the annual salaries for the 3 sub-groups in London. The purple and yellow line show the non-London and London median salaries respectively. There is £20k difference between the two!
Fig 3.6— London vs the rest of the country
4. What are some of the most frequent words mentioned in the job title?
A job title can communicate a lot of things. For instance if a role mention “R” in their title (e.g. “Data Scientist with R experience”) you’ll have good reason to ignore that role if you don’t code in R. If we look at the single terms and bigrams (2 word combos) that appear most frequently (Fig 3.7) we can infer the following:
Fig 3.7 — top terms appearing in the title
5. Which programming languages are in greatest demand? Do any of the languages correlate with higher salary?
For this question, I went beyond including just programming languages but also techniques, libraries, cloud services and skills. This has greatly increased the usefulness of the findings.
Figure 3.8 illustrates how the 10 most popular languages and skills compare across the 3 groups. This particular graph can be used to tailor your own portfolio building journey.
For instance, if you have expertise in R and SQL, you’re more well positioned to aim for a DA roles. If you have experience in Java and are considering pivoting into data science, then focusing on Python and AWS will put you in good stead.
Fig 3.8— Skills in demand
If you’re a data bootcamp, you can help inform more clearly which skills grads can leverage towards which roles, so they can optimise their job search. Towards the end of your course, you could create ‘Data Science Profiles’ that learners can gravitate towards depending on the kind of work they would prefer doing. Those that want heavier computational work (so more on the ML side), can attend classes on Docker, Java and AWS. Those drawn more towards analytics can spend more time refining their SQL and other relevant skills.
I next attempted to build a linear regression model for predicting salary, but it would seem that there wasn’t any way to reliably predict the salary from the features I built. After multiple iterations, dropping non-significant features and cross-validating, the summary for the best model is this:
I also attempted to bin the salary data into bands so that I could attempt to predict (binned into categories such as £20'000–25'000). However, even that didn’t have much predictive power, with the best model having an accuracy of about 10% (given 10 different salary bands). To conclude, the model is nullified and I will attempt to build another in future iterations of the project, when more data is available. However, it might just be the case that the features used have no real relationship with salary.
Many new data scientists find it vexing or disappointing when they search for an “entry-level” position and find that it requires 3 years of experience in the field. But do the years of experience required stated in the job ad actually have anything to do with something more concrete, like pay?
After extracting the required experience in years from job posts using regex and averaging those that gave a range (“2–3 years experience” becomes 2.5), I compared it to salary for those jobs. Jobs with 0 years experience were found by searching for “junior role / data / position” (this is probably the most contentious assumption of the created feature). Unfortunately there were very few jobs I could extract such data for (246 job posts), although the Spearman Ranked Correlation test was significant (p value < 0.001). The experience and salary have a weak-to-moderate correlation (0.37). We must remember that this was based on data mined with regex and is relying on certain assumptions and limitations. Hence this is an important test to re-run in the next project iteration.
Fig 3.9 — how many years??
Out of those jobs, Figure 3.9 shows the general trend. The largest number of jobs in this group asked for 2 OR 3 years of experience. Hopefully this isn’t representative of the job market, but if it is then as a beginner data scientist, you’d have to find ways to make up for the lack of experience, e.g. by doing freelance work for a while.
Using topic modelling, can we see if there are natural groups within the job descriptions? Can we split apart our data in some semantic way?
Using Latent Dirichlet Allocation and pyLDAviz (here’s my previous post on it), I determined a few emergent topics of interest. It’s important to note that, since this is an unsupervised approach, there’s a strong chance that the outputs are mostly noise, and not useful insights. However, guided by domain knowledge and other pieces of information in this dataset, we can infer at least 3 useful topics.
Client and Business-centric — (Fig 3.10) Roles heavily featuring this topic are focused more on delivering insights towards customers and using tools such as dashboards, excel, (power) ‘bi’ and thus providing analytical insights for the stakeholders.
Fig 3.10— Client and Business centric
Development and Deep Learning— (Fig 3.11) This topic and associated job roles are focussed on development programming languages (‘java’), specific packages used for deep learning (‘tensorflow’, ‘pytorch’), niche areas (‘NLP’, ‘neural’ (networks)) and mentions ‘development’, ‘processing’ and ‘product’. This topic corresponds strongly to a lot of ML jobs.
Fig 3.11—Development and Deep Learning
Academic & Scientific (Fig 3.12) — there’s a very strong association with this topic and terms such as ‘university’ and ‘research’ — more so than for any other topic! Also the only other topic with a strong association with ‘AI’, ‘novel’, ‘publication’ and ‘academic’ is Topic 1 — Deep Learning and Development.
Fig 3.12 — Academic and Scientific roles
My plan is to repeat this project in late Q1 2021 with a fresh batch of job post data and to improve the functionality of the web scraper to be able to detect tags for things such as “Remote working”. I hope that this exploratory analysis proves useful to some people, although I repeat that all findings should be considered in light of the unstructured and semi-rigorous nature of the work. None of these findings can/should be interpreted as conclusive, only preliminary. As said before, I appreciate any feedback and any claps! If I had to summarise the most important advice for aspiring data scientists in one bullet point it would be this:
Invest most time into mastering Python and SQL, play the long game and lower your initial expectations of what pay or job you’ll get, with the realistic hope that 2 years of building experience in your less-than-ideal job. Don’t spend too much time initially on niche, fancy areas like deep learning or NLP, you can revisit those at a latter stage. Knowledge of the basics, of how to solve problems and work experience are what pay off most.
Thank you.