Exploring the H1-B visa dataset
As part of my course Data Analysis with Python: Zero to Pandas, I analyzed the H1-B visa dataset from Kaggle. The end goal of the assignment was to find an underlying pattern in the data and to represent insights graphically.
What is the H1-B visa type?
The H-1B is a visa in the United States that allows U.S. employers to temporarily employ foreign workers in specialty occupations. If a citizen of any nation(other than the United States) wants to work in the United States, they need to have a H1-B visa permit.
We start with having a look at the structure(number of columns, data type) of the dataset
We peek into the first five rows of the dataset to get an idea of the data we are going to analyze further.
The next logical step would be to figure out the different case statuses of applicants, this would help us in exploring the data better.
We can see that we have 7 different visa status and some NaN values. Further we will drop the NULL values from the CASE_STATUS column for the purpose of the project.
We begin exploring the data and finding insights by asking and answering questions.
We would start our analysis with the people who have been granted the visa and hence we make a separate data frame which contains them.
Q: Has there been an increase or decrease in the number of H1-B visas granted over the years.
We can see from the above graph that the number of H1-B visas being issued has been increasing over the years.
Q: Does having a full time job position influence your chance of landing the visa
The histogram tells us that people having H1-B visas majorly work in full time positions.
We return back to our original data frame for further analysis.
Q: In the data we are analyzing how many people out of the total applicants are granted the visa.
In the data that we have we can see that out of the total number of applicants a majority have been granted the visa.
Another interesting subset to analyze would be the people whose visas have been denied, also since our data has more number of people who have been certified the visa, any analysis on the entire dataset would be highly skewed, hence we take a subset as shown below. Let’s further analyze this.
Q: A possible question to ask would be is there are trend in the visas being denied over the years.
The trend is that the visas being denied over the years have been declining.
Q: Do people who have been certified the visa in general earn more wage than those whose visa have been denied?
From the above it is clear that average wage of people who have been denied the visa is greater than the average wage of people who have been granted the visa.
Q: Top 10 jobs people work in who have their job status “Withdrawn”
List of top 10 Jobs of people who have their visa status as “Withdrawn” is below
Conclusion and Inferences
- Data used for the analysis has a majority of “certified” visa status, to avoid skewing of data we have created subsets of visa status for analysis.
- The data used for the analysis ranges from 2011–2016 and hence may not represent current conditions of H1-B visas accurately.
- In the project we have tried to find trends between people who have been granted the visa or those who have been rejected/ withdrawn etc.
- Official documentation of matplotlib
- Data Analysis with Python: Zero to Pandas by Jovian.ml and FreeCodeCamp
The entire notebook is embedded here, for your reference