Data Scientist and Analyst at Data Science Institute - Research and Education

1. Project Overview

The 2024-2025 Invenergy Project aims to create a generative AI-powered chatbot focused on energy industry regulatory documents. The project provides a complete end-to-end solution involving data ingestion, text processing, similarity-based retrieval, and an interactive user interface. The core goals include:

Automatically extracting regulatory documents and their metadata.
Leveraging generative AI to respond to user queries contextually.
Providing a web-based interface for users to interact with the chatbot.

This project targets industry professionals seeking quick access to complex regulatory information through conversational AI. The solution will provide source references to ensure transparency and traceability.

Example: Imagine an energy industry professional needs information about a specific regulation's effective date. Instead of searching through multiple PDFs, they can ask the chatbot, "What is the effective date for regulation XYZ?" The chatbot will provide the answer along with the document reference.

2. Project Scope and Objectives

Scope

The project covers three main stages:

Document Ingestion and Processing: Convert energy regulatory PDFs into a structured and searchable format.
Generative AI Chatbot: Build a chatbot leveraging retrieval-augmented generation (RAG) to accurately respond to user questions, using similarity-based document retrieval and a generative model for responses.
Web Interface: Develop a user-friendly interface for querying the chatbot and getting responses.

Example: If a regulatory PDF contains information about different compliance requirements, the system will break this down into structured data fields, allowing users to easily search and retrieve specific sections through the chatbot.

Objectives

Flexible Integration: Ensure modular components (data ingestion, model generation, and interface) can be reused for different use cases.
Confidentiality: Keep scraped data private while making model generation publicly accessible.
Cited Responses: Provide references for all chatbot responses to ensure credibility.
Scalable Deployment: Allow easy deployment in platforms like Databricks or Microsoft Teams.
Manual and Automated Data Updates: Support manual ingestion of documents with an option for future automated updates.

3. System Architecture

Components

Data Ingestion: PDF documents are manually downloaded and processed to extract metadata and text. The metadata fields include effective dates and docket numbers, and the extracted data is saved in a structured format for further processing.

Example: A document about market regulations is manually downloaded and processed to extract relevant information like dates and identifiers. This information is saved in a structured format, making it easier to search later.
Data Storage: The processed data is loaded into a relational database, ensuring it conforms to a predefined schema for consistency.

Example: After extracting metadata and text, the data is saved in a database where each document is assigned fields such as document name, date, docket number, and text content. This allows for organized retrieval later.
Embedding Generation: A pre-trained model is used to generate text embeddings. These embeddings are stored in the database to facilitate similarity matching.

Example: The text from each document is converted into a numerical vector (embedding) that helps identify how similar it is to other documents. This is useful when users submit queries to find related information.
Similarity Matching and Query Handling: User queries are processed through pipelines, which calculate similarity between the query embeddings and document embeddings. The solution further utilizes a generative model to provide more refined, context-aware answers.

Example: If a user asks, "What are the compliance requirements for renewable energy in 2024?" the system finds the most similar documents and provides a summarized answer using a generative model.
Web Interface: The project leverages an intuitive chat interface, allowing users to submit questions and receive responses interactively.

Example: Users can type in questions like, "Show me all regulations effective in 2024," and receive immediate answers in a chat format, with links to the source documents.

4. Data Flow

Manual Data Collection: PDFs are manually downloaded and placed in a specified directory.

Example: Regulatory documents are downloaded from government websites and added to the system's data directory for processing.
Data Extraction: Text and metadata are extracted from the PDFs and saved in a structured format.

Example: The system extracts text, dates, and identifiers from each PDF to create a structured dataset that can be queried later.
Data Loading: The structured data is processed and loaded into a relational database.

Example: The extracted information is loaded into a database to enable efficient retrieval of specific documents or sections.
Metadata Analysis: The extracted data is validated, checking for coverage and consistency of metadata fields.

Example: The system checks if all documents have valid dates and docket numbers, and flags any inconsistencies for manual review.
Query Processing: User queries are submitted via the web interface, where the appropriate pipeline processes the request, performs similarity matching, and generates responses.

Example: A user asks, "What are the requirements for interconnection standards?" The system matches the query to relevant documents and generates an informative response.

5. Infrastructure Overview

Containerization

The project components are managed as containers:

Database: Stores document data and embeddings.
Generative AI Model: Provides generative responses.
Pipelines: Handles query processing and data retrieval.
Data Processing App: Runs custom scripts for data ingestion and processing.
Database Management Interface: Provides a graphical interface for database management.

Example: Each part of the system runs in its own container, making it easy to manage and scale the project components. The database container stores extracted data, while the generative model container provides context-aware answers.

Automation

Automation scripts manage container setup, data ingestion, and model operations, allowing for easy setup and orchestration.

Example: A script is run to set up all components, load the data, and ensure everything is connected properly. This makes the deployment process much more streamlined.

6. Pipeline Descriptions

Generative Pipeline

This pipeline handles user queries by:

Calculating similarity between query embeddings and document embeddings.
Sending the context (top similarity matches) to the generative model.
Returning a response generated by the model based on the query and context.

Example: If a user asks, "What changes were made to renewable energy policies in 2023?" the pipeline finds the most relevant documents and uses a generative model to provide a summary, along with references to specific sections.

Similarity-Only Pipeline

This simpler pipeline focuses on similarity matching without generating additional context-based answers. It is best suited for retrieving relevant document excerpts directly in response to user queries.

Example: A user asks, "Show me documents related to energy transmission." The pipeline retrieves the most similar documents but does not generate a summarized response.

7. Analysis and Validation

Extraction Success: The extracted metadata is analyzed to calculate the success rate for each document and identify inconsistencies.

Example: If 90% of the documents have valid effective dates extracted, the extraction success rate is reported as 90%, helping identify areas for improvement.
Pattern Coverage: Regex patterns are validated to ensure comprehensive coverage of metadata fields such as effective dates and docket numbers.

Example: The system checks if the regex patterns are correctly identifying effective dates in all documents and flags those that do not match.
Consistency Check: The metadata consistency within each document is assessed, flagging any discrepancies for manual review.

Example: If a document contains multiple conflicting dates, it is flagged for review to ensure the information is accurate.

8. Challenges and Future Work

Manual Data Ingestion: Currently, the data ingestion is a manual download process. Future work includes implementing automated web scraping for regulatory documents to improve scalability and efficiency.

Example: Automating the download of regulatory documents from trusted sources would save time and ensure that the latest information is always available.
Feedback Integration: A mechanism to collect user feedback on chatbot responses could enhance model fine-tuning and improve accuracy.

Example: Adding a thumbs-up/thumbs-down feature to chatbot responses would help gather user feedback and improve the quality of answers over time.
Performance Monitoring: Adding monitoring tools to track API response times, similarity matching accuracy, and overall system performance would help maintain reliability at scale.

Example: Monitoring how quickly the chatbot responds to user queries can help identify bottlenecks and optimize system performance.

9. Conclusion

The 2024-2025 Invenergy Project aims to streamline access to complex energy regulatory documents through an AI-driven chatbot. By leveraging modular pipelines, embedding-based similarity, and a generative model for context generation, the solution offers an effective means for industry experts to get quick, reliable insights. Further enhancements, such as automated data ingestion and user feedback, will strengthen the robustness and efficiency of the system.

Example: With this solution, an energy professional can easily ask, "What are the current requirements for interconnecting solar energy systems?" and receive an immediate, detailed response, saving time and effort compared to searching through multiple documents manually.

Repository Link

Invenergy Project Repository →