Ml_Automation Module¶
Total files documented: 8
ml_automation/cleantxt.py¶
Purpose¶
This file, cleantxt.py, contains a function to clean and preprocess text data for machine learning tasks. It removes punctuation, converts text to lowercase, and removes numbers.
Key Responsibilities¶
- Clean and preprocess text data for machine learning tasks
- Remove punctuation and special characters
- Convert text to lowercase
- Remove numbers from text
Important Functions¶
clean_text(text)¶
This function takes in a string of text as input and returns the cleaned text. It performs the following operations:
* Converts the text to lowercase using the lower() method
* Removes punctuation and special characters using regular expressions (re.sub())
* Removes numbers from the text using regular expressions (re.sub())
Important Classes¶
None
System Fit¶
This file fits into the wider 3-Cubed Python system as part of the machine learning automation pipeline. It is likely used in conjunction with other files to preprocess and clean text data before it is fed into machine learning models for analysis and prediction.
ml_automation/database.py¶
Purpose¶
This file, database.py, is responsible for interacting with a SQL Server database to extract and update data related to project instances. It uses the pyodbc library to connect to the database and pandas to manipulate the data.
Key Responsibilities¶
- Establish a connection to the SQL Server database
- Extract data from the database based on the last extracted ProjectID for each instance
- Update the log file with the last extracted ProjectID for each instance and the current timestamp
- Append new data to the existing CSV file or create a new one if it does not exist
Important Functions¶
read_last_extraction(): Reads the last extracted ProjectID for each instance from the log file.update_last_extraction(log_data): Updates the log file with the last extracted ProjectID for each instance and the current timestamp.
Important Classes¶
None
System Fit¶
This file fits into the wider 3-Cubed Python system by providing a mechanism for extracting and updating data related to project instances. It is likely used in conjunction with other components of the system to perform data analysis and visualization. The extracted data is stored in a CSV file, which can be used by other components of the system to perform further analysis or visualization.
Notes¶
- The database connection parameters are hardcoded in this file, which may not be desirable in a production environment. Consider using environment variables or a secure storage mechanism to store sensitive information.
- The file assumes that the log file and CSV file are located in a specific directory, which may not be the case in all environments. Consider using a more flexible approach to determining the file locations.
- The file uses the
pyodbclibrary to connect to the SQL Server database, which may not be the most efficient or secure approach. Consider using a more modern library such assqlalchemyorpandasto connect to the database.
ml_automation/forms.py¶
Purpose¶
This file, forms.py, is responsible for data preprocessing and feature engineering for the CrewAI system. It loads a dataset, filters and oversamples the data, and performs text preprocessing and vectorization. The file also trains a Linear SVC model using the preprocessed data and saves the trained model, TF-IDF vectorizer, and MultiLabelBinarizer to files.
Key Responsibilities¶
- Load and filter the dataset
- Oversample underrepresented classes
- Perform text preprocessing and vectorization
- Train a Linear SVC model
- Save the trained model, TF-IDF vectorizer, and MultiLabelBinarizer to files
Important Functions¶
oversample(df, column, threshold): This function oversamples underrepresented classes in the dataset by replicating groups with fewer than the specified threshold.clean_text(text): This function is not implemented in this file, but it is imported from thecleantxtmodule and used for text preprocessing.
Important Classes¶
- None
System Fit¶
This file fits into the wider 3-Cubed Python system by providing a preprocessed dataset and a trained model that can be used for prediction and inference. The saved model, TF-IDF vectorizer, and MultiLabelBinarizer can be loaded and used in other parts of the system to perform similar tasks. The file is likely used as a component of a larger workflow that involves data ingestion, preprocessing, model training, and deployment.
ml_automation/iscontrol.py¶
iscontrol.py¶
Purpose¶
This file is responsible for training a machine learning model to predict whether a given activity is a control or not. It utilizes a dataset containing industry, activity name, systems and applications, and a label indicating whether the activity is a control or not. The model is trained using a random forest classifier and utilizes TF-IDF vectorization for text feature extraction.
Key Responsibilities¶
- Load the dataset from a CSV file
- Preprocess the text data using TF-IDF vectorization and cleaning
- Split the data into training and testing sets
- Oversample the training data using SMOTE to handle class imbalance
- Train a random forest classifier model
- Evaluate the model's performance on the training and testing sets
- Save the trained model, TF-IDF vectorizer, and label encoder to files for later use
Important Functions¶
train_model(): Trains the random forest classifier model using the oversampled training dataevaluate_model(): Evaluates the model's performance on the training and testing setssave_model(): Saves the trained model, TF-IDF vectorizer, and label encoder to files
Important Classes¶
None
System Fit¶
This file fits into the wider 3-Cubed Python system as part of the machine learning automation pipeline. It is responsible for training a model to predict whether a given activity is a control or not, which can be used in various applications such as process automation and quality control. The trained model, TF-IDF vectorizer, and label encoder can be loaded and used in other parts of the system for prediction and analysis.
ml_automation/nva.py¶
NVA Model Training Script¶
Purpose¶
This script trains a machine learning model to predict NVA (Non-Voting Authority) types based on industry, activity name, and systems and applications data. It utilizes TF-IDF vectorization, SMOTE oversampling, and a random forest classifier.
Key Responsibilities¶
- Loads the dataset from a CSV file
- Preprocesses the data by cleaning and combining text features
- Applies TF-IDF vectorization and label encoding
- Oversamples the training data using SMOTE
- Trains a random forest classifier model
- Evaluates the model's performance on the training and testing sets
- Saves the trained model, vectorizer, and label encoder to disk
Important Functions¶
clean_text: a function from thecleantxtlibrary used to clean and preprocess text dataTfidfVectorizer: a scikit-learn class used to apply TF-IDF vectorization to text dataLabelEncoder: a scikit-learn class used to encode categorical labelsSMOTE: an imbalanced-learn class used to oversample the minority class in the training dataRandomForestClassifier: a scikit-learn class used to train the machine learning model
Important Classes¶
RandomForestClassifier: the machine learning model used to predict NVA typesTfidfVectorizer: the vectorizer used to transform text data into numerical featuresLabelEncoder: the encoder used to convert categorical labels into numerical values
System Fit¶
This script is part of the wider 3-Cubed Python system and is used to train and deploy machine learning models for NVA type prediction. The trained model, vectorizer, and label encoder are saved to disk and can be loaded and used in other parts of the system for prediction and inference.
ml_automation/outlookmail.py¶
ml_automation/outlookmail.py¶
Purpose¶
This file provides functionality to send emails using the Outlook application. It utilizes the win32com library to interact with the Outlook API.
Key Responsibilities¶
- Send emails to specified recipients
- Set email subject and body
Important Functions¶
send_email(to, subject, body): Sends an email to the specified recipient with the given subject and body.
System Fit¶
This file is part of the wider 3-Cubed Python system, specifically within the ml_automation module. It is designed to be used in automated workflows, such as sending notifications or updates to team members.
Notes¶
- The
send_emailfunction uses thewin32com.clientlibrary to interact with the Outlook application. - The
if __name__ == "__main__":block provides an example usage of thesend_emailfunction, sending an email to two specified recipients. - The email recipients and subject/body content are hardcoded in this example, but can be modified to suit specific use cases.
ml_automation/product.py¶
Purpose¶
This file, product.py, is responsible for automating the machine learning process for product data. It reads data from two CSV files, preprocesses the data, trains a random forest classifier, and saves the trained model along with the label encoder and TF-IDF vectorizer.
Key Responsibilities¶
- Read data from two CSV files and concatenate them into a single DataFrame
- Preprocess the data by dropping null values, filtering product counts, and converting datetime columns to strings
- Combine text features for TF-IDF transformation
- Convert text data into vectors using TF-IDF
- Apply SMOTE oversampling to handle class imbalance
- Train a random forest classifier with class weight adjustment
- Evaluate the model using accuracy score and classification report
- Save the trained model, label encoder, and TF-IDF vectorizer using joblib
Important Functions¶
Not clear from code. However, the following functions are likely to be important:
clean_text: a function from thecleantxtlibrary used to clean text datatrain_test_split: a function from scikit-learn used to split the data into training and testing setsfit_transform: a method from scikit-learn used to fit the TF-IDF vectorizer to the data and transform it into vectorsfit_resample: a method from imbalanced-learn used to apply SMOTE oversampling to the training datafit: a method from scikit-learn used to train the random forest classifierpredict: a method from scikit-learn used to make predictions using the trained modelscore: a method from scikit-learn used to evaluate the model's performance
Important Classes¶
Not clear from code. However, the following classes are likely to be important:
RandomForestClassifier: a class from scikit-learn used to train a random forest classifierLabelEncoder: a class from scikit-learn used to encode categorical dataTfidfVectorizer: a class from scikit-learn used to transform text data into vectors using TF-IDF
System Fit¶
This file fits into the wider 3-Cubed Python system as part of the machine learning automation pipeline. It reads data from CSV files, preprocesses the data, trains a model, and saves the trained model along with the necessary artifacts. The trained model can then be used for making predictions on new data.
ml_automation/rules.py¶
Purpose¶
This file, rules.py, is part of the ML Automation system and is responsible for data preprocessing, oversampling, and training a machine learning model for classification tasks.
Key Responsibilities¶
- Load and preprocess the dataset
- Oversample underrepresented classes
- Preprocess text data using TF-IDF vectorization
- Train a Linear SVC model using OneVsRestClassifier
- Save the trained model, TF-IDF vectorizer, and MultiLabelBinarizer to files
Important Functions¶
oversample(df, column, threshold): Manually oversamples underrepresented classes in the datasetclean_text(text): Not clear from code, but likely a function from thecleantxtlibrary used for text cleaning
Important Classes¶
None
System Fit¶
This file fits into the wider 3-Cubed Python system as part of the ML Automation module. It is responsible for preprocessing and training a machine learning model for classification tasks, which can be used in conjunction with other modules to automate business rules and decision-making processes. The trained model, TF-IDF vectorizer, and MultiLabelBinarizer are saved to files and can be loaded and used in other parts of the system.