Ml_Automation Module¶

Total files documented: 8

`ml_automation/cleantxt.py`¶

Purpose¶

This file, cleantxt.py, contains a function to clean and preprocess text data for machine learning tasks. It removes punctuation, converts text to lowercase, and removes numbers.

Key Responsibilities¶

Clean and preprocess text data for machine learning tasks
Remove punctuation and special characters
Convert text to lowercase
Remove numbers from text

Important Functions¶

`clean_text(text)`¶

This function takes in a string of text as input and returns the cleaned text. It performs the following operations: * Converts the text to lowercase using the lower() method * Removes punctuation and special characters using regular expressions (re.sub()) * Removes numbers from the text using regular expressions (re.sub())

Important Classes¶

None

System Fit¶

This file fits into the wider 3-Cubed Python system as part of the machine learning automation pipeline. It is likely used in conjunction with other files to preprocess and clean text data before it is fed into machine learning models for analysis and prediction.

`ml_automation/database.py`¶

Purpose¶

This file, database.py, is responsible for interacting with a SQL Server database to extract and update data related to project instances. It uses the pyodbc library to connect to the database and pandas to manipulate the data.

Key Responsibilities¶

Establish a connection to the SQL Server database
Extract data from the database based on the last extracted ProjectID for each instance
Update the log file with the last extracted ProjectID for each instance and the current timestamp
Append new data to the existing CSV file or create a new one if it does not exist

Important Functions¶

read_last_extraction(): Reads the last extracted ProjectID for each instance from the log file.
update_last_extraction(log_data): Updates the log file with the last extracted ProjectID for each instance and the current timestamp.

Important Classes¶

None

System Fit¶

This file fits into the wider 3-Cubed Python system by providing a mechanism for extracting and updating data related to project instances. It is likely used in conjunction with other components of the system to perform data analysis and visualization. The extracted data is stored in a CSV file, which can be used by other components of the system to perform further analysis or visualization.

Notes¶

The database connection parameters are hardcoded in this file, which may not be desirable in a production environment. Consider using environment variables or a secure storage mechanism to store sensitive information.
The file assumes that the log file and CSV file are located in a specific directory, which may not be the case in all environments. Consider using a more flexible approach to determining the file locations.
The file uses the pyodbc library to connect to the SQL Server database, which may not be the most efficient or secure approach. Consider using a more modern library such as sqlalchemy or pandas to connect to the database.

`ml_automation/forms.py`¶

Purpose¶

This file, forms.py, is responsible for data preprocessing and feature engineering for the CrewAI system. It loads a dataset, filters and oversamples the data, and performs text preprocessing and vectorization. The file also trains a Linear SVC model using the preprocessed data and saves the trained model, TF-IDF vectorizer, and MultiLabelBinarizer to files.

Key Responsibilities¶

Load and filter the dataset
Oversample underrepresented classes
Perform text preprocessing and vectorization
Train a Linear SVC model
Save the trained model, TF-IDF vectorizer, and MultiLabelBinarizer to files

Important Functions¶

oversample(df, column, threshold): This function oversamples underrepresented classes in the dataset by replicating groups with fewer than the specified threshold.
clean_text(text): This function is not implemented in this file, but it is imported from the cleantxt module and used for text preprocessing.

Important Classes¶

None

System Fit¶

This file fits into the wider 3-Cubed Python system by providing a preprocessed dataset and a trained model that can be used for prediction and inference. The saved model, TF-IDF vectorizer, and MultiLabelBinarizer can be loaded and used in other parts of the system to perform similar tasks. The file is likely used as a component of a larger workflow that involves data ingestion, preprocessing, model training, and deployment.

`ml_automation/iscontrol.py`¶

iscontrol.py¶

Purpose¶

This file is responsible for training a machine learning model to predict whether a given activity is a control or not. It utilizes a dataset containing industry, activity name, systems and applications, and a label indicating whether the activity is a control or not. The model is trained using a random forest classifier and utilizes TF-IDF vectorization for text feature extraction.

Key Responsibilities¶

Load the dataset from a CSV file
Preprocess the text data using TF-IDF vectorization and cleaning
Split the data into training and testing sets
Oversample the training data using SMOTE to handle class imbalance
Train a random forest classifier model
Evaluate the model's performance on the training and testing sets
Save the trained model, TF-IDF vectorizer, and label encoder to files for later use

Important Functions¶

train_model(): Trains the random forest classifier model using the oversampled training data
evaluate_model(): Evaluates the model's performance on the training and testing sets
save_model(): Saves the trained model, TF-IDF vectorizer, and label encoder to files

Important Classes¶

None

System Fit¶

This file fits into the wider 3-Cubed Python system as part of the machine learning automation pipeline. It is responsible for training a model to predict whether a given activity is a control or not, which can be used in various applications such as process automation and quality control. The trained model, TF-IDF vectorizer, and label encoder can be loaded and used in other parts of the system for prediction and analysis.

`ml_automation/nva.py`¶

NVA Model Training Script¶

Purpose¶

This script trains a machine learning model to predict NVA (Non-Voting Authority) types based on industry, activity name, and systems and applications data. It utilizes TF-IDF vectorization, SMOTE oversampling, and a random forest classifier.

Key Responsibilities¶

Loads the dataset from a CSV file
Preprocesses the data by cleaning and combining text features
Applies TF-IDF vectorization and label encoding
Oversamples the training data using SMOTE
Trains a random forest classifier model
Evaluates the model's performance on the training and testing sets
Saves the trained model, vectorizer, and label encoder to disk

Important Functions¶

clean_text: a function from the cleantxt library used to clean and preprocess text data
TfidfVectorizer: a scikit-learn class used to apply TF-IDF vectorization to text data
LabelEncoder: a scikit-learn class used to encode categorical labels
SMOTE: an imbalanced-learn class used to oversample the minority class in the training data
RandomForestClassifier: a scikit-learn class used to train the machine learning model

Important Classes¶

RandomForestClassifier: the machine learning model used to predict NVA types
TfidfVectorizer: the vectorizer used to transform text data into numerical features
LabelEncoder: the encoder used to convert categorical labels into numerical values

System Fit¶

This script is part of the wider 3-Cubed Python system and is used to train and deploy machine learning models for NVA type prediction. The trained model, vectorizer, and label encoder are saved to disk and can be loaded and used in other parts of the system for prediction and inference.

`ml_automation/outlookmail.py`¶

ml_automation/outlookmail.py¶

Purpose¶

This file provides functionality to send emails using the Outlook application. It utilizes the win32com library to interact with the Outlook API.

Key Responsibilities¶

Send emails to specified recipients
Set email subject and body

Important Functions¶

send_email(to, subject, body): Sends an email to the specified recipient with the given subject and body.

System Fit¶

This file is part of the wider 3-Cubed Python system, specifically within the ml_automation module. It is designed to be used in automated workflows, such as sending notifications or updates to team members.

Notes¶

The send_email function uses the win32com.client library to interact with the Outlook application.
The if __name__ == "__main__": block provides an example usage of the send_email function, sending an email to two specified recipients.
The email recipients and subject/body content are hardcoded in this example, but can be modified to suit specific use cases.

`ml_automation/product.py`¶

Purpose¶

This file, product.py, is responsible for automating the machine learning process for product data. It reads data from two CSV files, preprocesses the data, trains a random forest classifier, and saves the trained model along with the label encoder and TF-IDF vectorizer.

Key Responsibilities¶

Read data from two CSV files and concatenate them into a single DataFrame
Preprocess the data by dropping null values, filtering product counts, and converting datetime columns to strings
Combine text features for TF-IDF transformation
Convert text data into vectors using TF-IDF
Apply SMOTE oversampling to handle class imbalance
Train a random forest classifier with class weight adjustment
Evaluate the model using accuracy score and classification report
Save the trained model, label encoder, and TF-IDF vectorizer using joblib

Important Functions¶

Not clear from code. However, the following functions are likely to be important:

clean_text: a function from the cleantxt library used to clean text data
train_test_split: a function from scikit-learn used to split the data into training and testing sets
fit_transform: a method from scikit-learn used to fit the TF-IDF vectorizer to the data and transform it into vectors
fit_resample: a method from imbalanced-learn used to apply SMOTE oversampling to the training data
fit: a method from scikit-learn used to train the random forest classifier
predict: a method from scikit-learn used to make predictions using the trained model
score: a method from scikit-learn used to evaluate the model's performance

Important Classes¶

Not clear from code. However, the following classes are likely to be important:

RandomForestClassifier: a class from scikit-learn used to train a random forest classifier
LabelEncoder: a class from scikit-learn used to encode categorical data
TfidfVectorizer: a class from scikit-learn used to transform text data into vectors using TF-IDF

System Fit¶

This file fits into the wider 3-Cubed Python system as part of the machine learning automation pipeline. It reads data from CSV files, preprocesses the data, trains a model, and saves the trained model along with the necessary artifacts. The trained model can then be used for making predictions on new data.

`ml_automation/rules.py`¶

Purpose¶

This file, rules.py, is part of the ML Automation system and is responsible for data preprocessing, oversampling, and training a machine learning model for classification tasks.

Key Responsibilities¶

Load and preprocess the dataset
Oversample underrepresented classes
Preprocess text data using TF-IDF vectorization
Train a Linear SVC model using OneVsRestClassifier
Save the trained model, TF-IDF vectorizer, and MultiLabelBinarizer to files

Important Functions¶

oversample(df, column, threshold): Manually oversamples underrepresented classes in the dataset
clean_text(text): Not clear from code, but likely a function from the cleantxt library used for text cleaning

Important Classes¶

None

System Fit¶

This file fits into the wider 3-Cubed Python system as part of the ML Automation module. It is responsible for preprocessing and training a machine learning model for classification tasks, which can be used in conjunction with other modules to automate business rules and decision-making processes. The trained model, TF-IDF vectorizer, and MultiLabelBinarizer are saved to files and can be loaded and used in other parts of the system.

🤖

Ml_Automation Module¶

ml_automation/cleantxt.py¶

Purpose¶

Key Responsibilities¶

Important Functions¶

clean_text(text)¶

Important Classes¶

System Fit¶

ml_automation/database.py¶

Purpose¶

Key Responsibilities¶

Important Functions¶

Important Classes¶

System Fit¶

Notes¶

ml_automation/forms.py¶

Purpose¶

Key Responsibilities¶

Important Functions¶

Important Classes¶

System Fit¶

ml_automation/iscontrol.py¶

iscontrol.py¶

Purpose¶

Key Responsibilities¶

Important Functions¶

Important Classes¶

System Fit¶

ml_automation/nva.py¶

NVA Model Training Script¶

Purpose¶

Key Responsibilities¶

Important Functions¶

Important Classes¶

System Fit¶

ml_automation/outlookmail.py¶

ml_automation/outlookmail.py¶

Purpose¶

Key Responsibilities¶

Important Functions¶

System Fit¶

Notes¶

ml_automation/product.py¶

Purpose¶

Key Responsibilities¶

Important Functions¶

Important Classes¶

System Fit¶

ml_automation/rules.py¶

Purpose¶

Key Responsibilities¶

Important Functions¶

Important Classes¶

System Fit¶

`ml_automation/cleantxt.py`¶

`clean_text(text)`¶

`ml_automation/database.py`¶

`ml_automation/forms.py`¶

`ml_automation/iscontrol.py`¶

`ml_automation/nva.py`¶

`ml_automation/outlookmail.py`¶

`ml_automation/product.py`¶

`ml_automation/rules.py`¶