← Back to Home

Data Scientist and Analyst

Leyton - Consulting

Jun 2023 - Sept 2023

Morocco, Casablanca

1. Introduction

Overview

Leyton is a consulting firm that specializes in helping companies leverage tax credits and funding opportunities, particularly those related to research and development (R&D). The goal of this project was to automate the identification of potential clients who could benefit from tax credits and funding. By creating a pipeline to extract relevant information from company websites and classify their industries, this project aimed to enhance Leyton's capabilities in identifying suitable clients quickly and efficiently.

Objective

The core objective of this project was to develop a data-driven approach to identify companies with potential R&D activities and classify them into various industries. This included building a web scraping pipeline to collect relevant data from company websites, improving an industry classification model to enhance accuracy, and optimizing the entire workflow to make it efficient and scalable.

Scope

The project was composed of several key components:

  1. A web scraping pipeline to gather relevant content from company websites.
  2. An industry classification model to categorize companies based on the extracted information.
  3. Optimization of the data pipeline to scale for large volumes of companies.
  4. API integration to provide the extracted information and classifications to Leyton's consulting teams.

2. Web Scraping Pipeline

Purpose

The web scraping pipeline was designed to automate data collection from company websites. The aim was to extract relevant information that could help consultants determine whether a company might be eligible for tax credits or funding opportunities, particularly in R&D.

Process

  1. Accessing Websites

    The pipeline began by using web scraping tools, primarily BeautifulSoup, to access and navigate company websites. To access these sites effectively, different user agents were tried to mimic different web browsers, ensuring compatibility and avoiding blocks from servers. These user agents needed regular updates to keep up with changes to the websites.

  2. Extracting Links

    After gaining access to a website, the scraper collected all the links (sub-pages) available. These links were compiled into a list for further processing.

  3. Keyword Selection

    To filter out the relevant pages, a set of keywords was needed. The keyword dictionaries, one for English websites and another for French websites, were created through a systematic approach:

    1. A list of all company websites from the custom dataset was gathered.
    2. Sub-pages from these websites were scraped, and all words from the links were extracted.
    3. Using TFIDF (Term Frequency-Inverse Document Frequency), the most frequent and meaningful words were identified. From the top 500 words, about 20 were selected for each language.
  4. Text Extraction

    Pages that contained keywords from the dictionaries were considered relevant. Specifically, pages like "About Us" or any descriptive content from the main page were targeted. The scraper then extracted text from these pages and combined it into a single document per company.

  5. Text Summarization

    Using a summarization model from Hugging Face, the combined text was summarized to provide a concise description of the company. If the website was in French, the extracted text was first summarized and then translated into English to make it usable by subsequent models.

Challenges

  • User Agents: Websites often blocked certain user agents, requiring frequent updates and testing to maintain access.
  • Multi-Language Handling: A translator model was used to convert French text into English, ensuring all data was in a uniform language format.
  • Rate Limiting: To avoid getting blocked by websites, the scraper included a delay between requests and retries if access failed.

3. Industry Classification Model

Initial State

  • The custom dataset used for training contained the following columns: company_website, text_extracted, summary, and industries. The industries column listed the industries each company was classified under, based on a manual review of the website. The classification followed the GICS (Global Industry Classification Standard), which consists of 69 industries.
  • The initial industry classification model used was intfloat/e5-base-v2 from Hugging Face, which achieved an F1-Score of 52% on the dataset.

Improvements

  1. Model Splitting
    • To improve the model's performance, it was decided to split the industry classification task into three separate models. Each model was responsible for approximately 23 industries, making the classification problem more manageable.
    • The split was determined through an iterative process of testing which industries worked best together. Industries that were difficult for the model to distinguish were assigned to different models to improve accuracy.
  2. Custom Evaluator
    • A custom evaluation approach was developed to assess the performance of the models. Using classification metrics from sklearn, the F1-Score, precision, and recall were calculated at different thresholds (e.g., 0.05, 0.10, ..., 0.95). This helped identify the optimal threshold for each model.
  3. Outcome
    • After splitting the classification task and optimizing the evaluation process, the F1-Score improved to 67%, which represented a significant increase in the model's accuracy.

Handling Imbalanced Data

The dataset had imbalances, with some industries being overrepresented. Techniques such as balancing the dataset through undersampling and experimenting with different groupings of industries were attempted, though splitting the model proved the most effective.

4. Pipeline Optimization and Scaling

Processing Time Reduction

The goal of optimizing the pipeline was to reduce the amount of manual effort required by consultants. Running the classification script on 100,000 companies was completed in just 3% of the time it would take for manual processing, representing a 97% reduction in time.

Infrastructure

Leyton's GPU server was used to speed up the training and inference processes. This hardware capability allowed for much faster model training, which was crucial for handling a large dataset.

Validation

Validation was performed using 10% of the dataset as a test set. The models were evaluated based on classification accuracy and F1-Score, and performance metrics were logged to ensure consistency across different iterations of the model.

5. API Integration

Purpose

The FastAPI framework was used to build an API that served the results of the web scraping and classification processes. The API provided the summarized company description and the industry classifications, making it easy for Leyton's consultants to access the information.

Implementation

The API had endpoints to return both the company summary and the list of industries for each company. While a user interface was not yet implemented, the API laid the groundwork for easy integration into future applications that consultants might use.

Future Directions

A user interface (UI) could be built on top of this API to make the information more accessible to consultants. Such a UI could include features like search, filter, and interactive company profiles to streamline consultant workflows.

6. Challenges and Lessons Learned

Challenges

  • Project Handoff: It took time to take over the project from the previous team member, which required understanding existing code and infrastructure.
  • Tool Familiarity: As this was the first hands-on experience with tools like Docker and FastAPI, there was a learning curve. Learning through online resources and hands-on experimentation was necessary.
  • Iterative Model Tuning: Finding the right industry groupings and optimizing model parameters was a time-consuming and iterative process.

Key Takeaways

  • Iterative Improvement: Iterative testing and validation were key to improving the model's performance.
  • Importance of Automation: Automating manual tasks like company classification and data extraction can significantly enhance operational efficiency, providing tangible benefits for consulting workflows.
  • Real-World Tools: Gaining experience with tools like Docker, FastAPI, and GPU servers was invaluable, as it provided practical skills needed in a real-world data science setting.

7. Impact and Conclusion

Impact

  • The project provided significant efficiency gains for Leyton's consultants by automating the collection and classification of company information. By providing a summary of company activities along with an industry classification, consultants were better equipped to identify potential clients quickly.
  • The improved F1-Score of 67% meant a much higher quality of classification, reducing the time needed to manually review and validate potential clients.

Conclusion

The internship project successfully addressed the challenges associated with automating company classification for potential R&D funding opportunities. It laid a strong foundation for future work, including further scaling the dataset, improving the classification model, and integrating a user-friendly UI.

8. Appendix

Technical Details

  • Libraries and Tools: BeautifulSoup, TFIDF (from Scikit-learn), Hugging Face (for NLP models), FastAPI, Docker, GPU for training.

Dataset Statistics

  • The dataset contained over 100,000 companies, with data categorized into 69 industries as per GICS standards.

References

  • Hugging Face Models: intfloat/e5-base for industry classification, summarization models for text extraction.
  • Scikit-learn Metrics: Used for calculating F1-Score, precision, and recall.