Data Scientist and Analyst
Data Science Institute - Research and Education
Mar 2024 - Jun 2024
Chicago, IL
Project Overview
This project is designed to handle and process workplace injury and illness data provided by a governmental agency, ultimately creating a refined and consolidated dataset that can be used by an application to provide insights into workplace safety. The process starts from raw data ingestion, followed by a series of validation and cleaning steps, culminating in the generation of files that are presented to the end user through an application. This end-to-end process ensures that the data is clean, coherent, and ready for analysis and use in decision-making.
Data Ingestion and Download
The data ingestion process starts by accessing publicly available data from an online government source. This data includes information about workplace incidents, and each file contains extensive records over multiple years. To facilitate downloading, a script is used to systematically access the website, download all necessary data files, and save them locally. The downloaded data often comes in compressed formats, which means the next step involves extracting these files to obtain structured data in formats that are easier to work with, like spreadsheets or tables.
During this step, the data is carefully organized into separate directories that reflect its content. This separation allows the subsequent stages to handle each data type effectively. For example, files are categorized by year and type, ensuring that the data remains easily navigable for future processes.
Initial Validation and Pre-Processing
After downloading and extracting the data, the next step involves an initial validation phase. This phase ensures that the data adheres to specific quality standards. For example, each establishment is required to have unique identification information, which must be verified for consistency. One of the initial tasks is to validate critical identifiers like establishment IDs and company identification numbers. These identifiers are validated for format (e.g., a tax identification number must have a specific number of digits), and any values that do not conform are either corrected or flagged for removal.
Additionally, the data is examined for missing or malformed entries. If certain key attributes are missing (e.g., establishment name or address), those rows might be corrected using secondary information, or in cases where the missing data is critical, those rows are removed entirely. A priority of this validation step is to maintain data integrity so that all subsequent analysis is based on reliable information.
Data Cleaning and Standardization
Once the initial validation is complete, the data undergoes a series of transformations designed to clean and standardize it. Cleaning involves fixing inconsistencies, like different formats of company names (e.g., "Company LLC" versus "Company Limited"), while standardization involves transforming attributes so that all data adheres to a consistent format. This helps ensure consistency across records and minimizes discrepancies.
For example, addresses are standardized by breaking down the street, city, state, and zip code fields. Once these components are standardized, the dataset undergoes a merging process with geocoded data, which adds precise geographical attributes to each establishment. This geocoding data allows the system to categorize each establishment by its precise location, which is essential for further analysis and insights.
During this phase, entity resolution is also performed to identify duplicate records and merge them into a single, consolidated record. For example, if two records share the same address, establishment name, and company name but appear under slightly different IDs, they are merged into a single unified record, thereby reducing redundancy in the dataset.
Assigning Unique Identifiers and Mapping Relationships
Another key aspect of the data preparation process is to assign unique identifiers to establishments and companies. Initially, the raw data lacks a consistent company identifier, meaning that the same company might appear multiple times with slight variations in its name or ID. To address this, the project assigns a unique company identifier based on specific criteria, such as matching the company name, location, and other attributes.
Similarly, a clean establishment identifier is created to uniquely distinguish each physical location, even if multiple companies share similar names or operate under similar attributes. The cleaned and validated dataset is then used to create a mapping between the raw and clean identifiers, which can be used to track records back to their original data if needed.
Data Transformation and Aggregation
The cleaned and standardized data is next transformed to make it more useful for analysis and presentation. This involves aggregating key metrics such as injury rates, employee counts, and hours worked at the establishment level. Injury rates, for instance, are calculated based on the number of reported incidents relative to the total hours worked. This provides a normalized measure of workplace safety that can be compared across different establishments or industries.
To further enrich the dataset, the project also includes information about industry classification codes (such as NAICS codes) that categorize each establishment based on its primary activity. This information is used to calculate industry averages for key metrics, allowing individual establishments to be compared to industry benchmarks. The industry data is sourced from multiple government-provided datasets and merged into the overall data pipeline.
Applying Data Filters and Cutoffs
Once the data has been transformed and aggregated, a series of filtering criteria are applied to ensure that only relevant establishments are retained. For instance, establishments that have very low employee counts or that have not reached a certain threshold of hours worked may be filtered out from the final dataset. The idea behind these filters is to focus on establishments with sufficient data to provide meaningful insights while excluding smaller entities whose data might be more erratic or less reliable.
For example, establishments with fewer than fifty employees may be excluded from the analysis because they do not meet the threshold that ensures reliable comparison with larger companies. Similarly, establishments that have reported fewer than 100,000 total hours worked might also be excluded, as this represents a volume too low to be statistically significant.
Generating Output Files
After filtering, the cleaned, transformed, and validated data is used to generate a set of output files. These files include comprehensive lists of establishments, companies, counties, and other geographic data, as well as specific files detailing workplace injury rates and related statistics. Each output file is organized in a way that supports easy ingestion by the subsequent application.
The output files serve different purposes:
- Company and Establishment Files: These contain standardized data about companies and establishments, including unique identifiers, addresses, and industry codes.
- Workplace Injury Data: This file contains aggregated information on workplace incidents, including injury types, counts, and rates.
- Mapping Files: These files allow users to map back from cleaned data to raw data, maintaining a link between the original data and its processed form.
Data Integration and Website Presentation
The final step of the process involves integrating the cleaned and processed data into a web-based application. This application is designed to provide insights into workplace safety by visualizing the injury and illness data in a user-friendly manner. The web interface allows users to explore data across different dimensions, such as by industry, by geographic location, or by year.
The website includes interactive visualizations like charts, tables, and maps that help users quickly identify trends in workplace safety. For example, a user might be able to select a particular county or industry and view the average injury rates compared to national or industry-specific averages. This kind of visualization can be invaluable for companies looking to benchmark their performance, for regulators assessing industry compliance, or for workers interested in understanding the safety performance of their employers.
The web application uses the output files generated by the data pipeline to present accurate, up-to-date information. The mapping files ensure that even if a user wants to trace a particular value back to its raw form, they can do so seamlessly, providing transparency and accountability.
Conclusion
This project provides a comprehensive solution to managing and analyzing workplace injury data. The process starts from downloading raw data, validates and cleans it, transforms it for consistency and comparability, and ultimately generates output files that feed into an application designed to present this information visually. The entire pipeline is designed to ensure data accuracy, usability, and reliability, supporting informed decision-making in workplace safety.
The emphasis on validation, cleaning, and standardization ensures that the final dataset is of high quality and ready for meaningful analysis. The integration into a web application further enhances the usability of the data, making it accessible to a wide range of users with varying needs.