Practical Data Science Programming for Medical Datasets Analysis and Prediction with Python GUI

Practical Data Science Programming for Medical Datasets Analysis and Prediction with Python GUI PDF Author: Vivian Siahaan
Publisher: BALIGE PUBLISHING
ISBN:
Category : Computers
Languages : en
Pages : 402

Book Description
In this book, you will implement two data science projects using Scikit-Learn, Scipy, and other libraries with Python GUI. In chapter 1, you will learn how to use Scikit-Learn, SVM, NumPy, Pandas, and other libraries to perform how to predict early stage diabetes using Early Stage Diabetes Risk Prediction Dataset (https://viviansiahaan.blogspot.com/2023/06/practical-data-science-programming-for.html). This dataset contains the sign and symptom data of newly diabetic or would be diabetic patient. This has been collected using direct questionnaires from the patients of Sylhet Diabetes Hospital in Sylhet, Bangladesh and approved by a doctor. The dataset consist of total 15 features and one target variable named class. Age: Age in years ranging from (20years to 65 years); Gender: Male / Female; Polyuria: Yes / No; Polydipsia: Yes/ No; Sudden weight loss: Yes/ No; Weakness: Yes/ No; Polyphagia: Yes/ No; Genital Thrush: Yes/ No; Visual blurring: Yes/ No; Itching: Yes/ No; Irritability: Yes/No; Delayed healing: Yes/ No; Partial Paresis: Yes/ No; Muscle stiffness: yes/ No; Alopecia: Yes/ No; Obesity: Yes/ No; This dataset contains the sign and symptpom data of newly diabetic or would be diabetic patient. This has been collected using direct questionnaires from the patients of Sylhet Diabetes Hospital in Sylhet, Bangladesh and approved by a doctor. You will develop a GUI using PyQt5 to plot distribution of features, feature importance, cross validation score, and prediced values versus true values. The machine learning models used in this project are Adaboost, Random Forest, Gradient Boosting, Logistic Regression, and Support Vector Machine. In chapter 2, you will learn how to use Scikit-Learn, NumPy, Pandas, and other libraries to perform how to analyze and predict breast cancer using Breast Cancer Prediction Dataset (https://viviansiahaan.blogspot.com/2023/06/practical-data-science-programming-for.html). Worldwide, breast cancer is the most common type of cancer in women and the second highest in terms of mortality rates.Diagnosis of breast cancer is performed when an abnormal lump is found (from self-examination or x-ray) or a tiny speck of calcium is seen (on an x-ray). After a suspicious lump is found, the doctor will conduct a diagnosis to determine whether it is cancerous and, if so, whether it has spread to other parts of the body. This breast cancer dataset was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. You will develop a GUI using PyQt5 to plot distribution of features, pairwise relationship, test scores, prediced values versus true values, confusion matrix, and decision boundary. The machine learning models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, and Support Vector Machine.

Data Science For Programmer: A Project-Based Approach With Python GUI

Data Science For Programmer: A Project-Based Approach With Python GUI PDF Author: Vivian Siahaan
Publisher: BALIGE PUBLISHING
ISBN:
Category : Computers
Languages : en
Pages : 520

Book Description
Book 1: Practical Data Science Programming for Medical Datasets Analysis and Prediction with Python GUI In this book, you will implement two data science projects using Scikit-Learn, Scipy, and other libraries with Python GUI. In Project 1, you will learn how to use Scikit-Learn, NumPy, Pandas, Seaborn, and other libraries to perform how to predict early stage diabetes using Early Stage Diabetes Risk Prediction Dataset provided by Kaggle. This dataset contains the sign and symptpom data of newly diabetic or would be diabetic patient. This has been collected using direct questionnaires from the patients of Sylhet Diabetes Hospital in Sylhet, Bangladesh and approved by a doctor. You will develop a GUI using PyQt5 to plot distribution of features, feature importance, cross validation score, and prediced values versus true values. The machine learning models used in this project are Adaboost, Random Forest, Gradient Boosting, Logistic Regression, and Support Vector Machine. In Project 2, you will learn how to use Scikit-Learn, NumPy, Pandas, and other libraries to perform how to analyze and predict breast cancer using Breast Cancer Prediction Dataset provided by Kaggle. Worldwide, breast cancer is the most common type of cancer in women and the second highest in terms of mortality rates.Diagnosis of breast cancer is performed when an abnormal lump is found (from self-examination or x-ray) or a tiny speck of calcium is seen (on an x-ray). After a suspicious lump is found, the doctor will conduct a diagnosis to determine whether it is cancerous and, if so, whether it has spread to other parts of the body. This breast cancer dataset was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. You will develop a GUI using PyQt5 to plot distribution of features, pairwise relationship, test scores, prediced values versus true values, confusion matrix, and decision boundary. The machine learning models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, and Support Vector Machine. Book 2: Step by Step Tutorials For Data Science With Python GUI: Traffic And Heart Attack Analysis And Prediction In this book, you will implement two data science projects using Scikit-Learn, Scipy, and other libraries with Python GUI. In Chapter 1, you will learn how to use Scikit-Learn, Scipy, and other libraries to perform how to predict traffic (number of vehicles) in four different junctions using Traffic Prediction Dataset provided by Kaggle. This dataset contains 48.1k (48120) observations of the number of vehicles each hour in four different junctions: 1) DateTime; 2) Juction; 3) Vehicles; and 4) ID. In Chapter 2, you will learn how to use Scikit-Learn, NumPy, Pandas, and other libraries to perform how to analyze and predict heart attack using Heart Attack Analysis & Prediction Dataset provided by Kaggle. Book 3: BRAIN TUMOR: Analysis, Classification, and Detection Using Machine Learning and Deep Learning with Python GUI In this project, you will learn how to use Scikit-Learn, TensorFlow, Keras, NumPy, Pandas, Seaborn, and other libraries to implement brain tumor classification and detection with machine learning using Brain Tumor dataset provided by Kaggle. This dataset contains five first order features: Mean (the contribution of individual pixel intensity for the entire image), Variance (used to find how each pixel varies from the neighboring pixel 0, Standard Deviation (the deviation of measured Values or the data from its mean), Skewness (measures of symmetry), and Kurtosis (describes the peak of e.g. a frequency distribution). It also contains eight second order features: Contrast, Energy, ASM (Angular second moment), Entropy, Homogeneity, Dissimilarity, Correlation, and Coarseness. The machine learning models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, and Support Vector Machine. The deep learning models used in this project are MobileNet and ResNet50. In this project, you will develop a GUI using PyQt5 to plot boundary decision, ROC, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, training loss, and training accuracy.

The Applied Data Science Workshop On Medical Datasets Using Machine Learning and Deep Learning with Python GUI

The Applied Data Science Workshop On Medical Datasets Using Machine Learning and Deep Learning with Python GUI PDF Author: Vivian Siahaan
Publisher: BALIGE PUBLISHING
ISBN:
Category : Computers
Languages : en
Pages : 1574

Book Description
Workshop 1: Heart Failure Analysis and Prediction Using Scikit-Learn, Keras, and TensorFlow with Python GUI Cardiovascular diseases (CVDs) are the number 1 cause of death globally taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure. People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning models can be of great help. Dataset used in this project is from Davide Chicco, Giuseppe Jurman. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making 20, 16 (2020). Attribute information in the dataset are as follows: age: Age; anaemia: Decrease of red blood cells or hemoglobin (boolean); creatinine_phosphokinase: Level of the CPK enzyme in the blood (mcg/L); diabetes: If the patient has diabetes (boolean); ejection_fraction: Percentage of blood leaving the heart at each contraction (percentage); high_blood_pressure: If the patient has hypertension (boolean); platelets: Platelets in the blood (kiloplatelets/mL); serum_creatinine: Level of serum creatinine in the blood (mg/dL); serum_sodium: Level of serum sodium in the blood (mEq/L); sex: Woman or man (binary); smoking: If the patient smokes or not (boolean); time: Follow-up period (days); and DEATH_EVENT: If the patient deceased during the follow-up period (boolean). The models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, LGBM classifier, Gradient Boosting, XGB classifier, MLP classifier, and CNN 1D. Finally, you will develop a GUI using PyQt5 to plot boundary decision, ROC, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, learning curve, performace of the model, scalability of the model, training loss, and training accuracy. WORKSHOP 2: Cervical Cancer Classification and Prediction Using Machine Learning and Deep Learning with Python GUI About 11,000 new cases of invasive cervical cancer are diagnosed each year in the U.S. However, the number of new cervical cancer cases has been declining steadily over the past decades. Although it is the most preventable type of cancer, each year cervical cancer kills about 4,000 women in the U.S. and about 300,000 women worldwide. Numerous studies report that high poverty levels are linked with low screening rates. In addition, lack of health insurance, limited transportation, and language difficulties hinder a poor woman’s access to screening services. Human papilloma virus (HPV) is the main risk factor for cervical cancer. In adults, the most important risk factor for HPV is sexual activity with an infected person. Women most at risk for cervical cancer are those with a history of multiple sexual partners, sexual intercourse at age 17 years or younger, or both. A woman who has never been sexually active has a very low risk for developing cervical cancer. Sexual activity with multiple partners increases the likelihood of many other sexually transmitted infections (chlamydia, gonorrhea, syphilis). Studies have found an association between chlamydia and cervical cancer risk, including the possibility that chlamydia may prolong HPV infection. Therefore, early detection of cervical cancer using machine and deep learning models can be of great help. The dataset used in this project is obtained from UCI Repository and kindly acknowledged. This file contains a List of Risk Factors for Cervical Cancer leading to a Biopsy Examination. The models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, LGBM classifier, Gradient Boosting, XGB classifier, MLP classifier, and CNN 1D. Finally, you will develop a GUI using PyQt5 to plot boundary decision, ROC, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, learning curve, performace of the model, scalability of the model, training loss, and training accuracy. WORKSHOP 3: Chronic Kidney Disease Classification and Prediction Using Machine Learning and Deep Learning with Python GUI Chronic kidney disease is the longstanding disease of the kidneys leading to renal failure. The kidneys filter waste and excess fluid from the blood. As kidneys fail, waste builds up. Symptoms develop slowly and aren't specific to the disease. Some people have no symptoms at all and are diagnosed by a lab test. Medication helps manage symptoms. In later stages, filtering the blood with a machine (dialysis) or a transplant may be required The dataset used in this project was taken over a 2-month period in India with 25 features (eg, red blood cell count, white blood cell count, etc). The target is the 'classification', which is either 'ckd' or 'notckd' - ckd=chronic kidney disease. It contains measures of 24 features for 400 people. Quite a lot of features for just 400 samples. There are 14 categorical features, while 10 are numerical. The dataset needs cleaning: in that it has NaNs and the numeric features need to be forced to floats. Attribute Information: Age(numerical) age in years; Blood Pressure(numerical) bp in mm/Hg; Specific Gravity(categorical) sg - (1.005,1.010,1.015,1.020,1.025); Albumin(categorical) al - (0,1,2,3,4,5); Sugar(categorical) su - (0,1,2,3,4,5); Red Blood Cells(categorical) rbc - (normal,abnormal); Pus Cell (categorical) pc - (normal,abnormal); Pus Cell clumps(categorical) pcc - (present, notpresent); Bacteria(categorical) ba - (present,notpresent); Blood Glucose Random(numerical) bgr in mgs/dl; Blood Urea(numerical) bu in mgs/dl; Serum Creatinine(numerical) sc in mgs/dl; Sodium(numerical) sod in mEq/L; Potassium(numerical) pot in mEq/L; Hemoglobin(numerical) hemo in gms; Packed Cell Volume(numerical); White Blood Cell Count(numerical) wc in cells/cumm; Red Blood Cell Count(numerical) rc in millions/cmm; Hypertension(categorical) htn - (yes,no); Diabetes Mellitus(categorical) dm - (yes,no); Coronary Artery Disease(categorical) cad - (yes,no); Appetite(categorical) appet - (good,poor); Pedal Edema(categorical) pe - (yes,no); Anemia(categorical) ane - (yes,no); and Class (categorical) class - (ckd,notckd). The models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, LGBM classifier, Gradient Boosting, XGB classifier, MLP classifier, and CNN 1D. Finally, you will develop a GUI using PyQt5 to plot boundary decision, ROC, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, learning curve, performace of the model, scalability of the model, training loss, and training accuracy. WORKSHOP 4: Lung Cancer Classification and Prediction Using Machine Learning and Deep Learning with Python GUI The effectiveness of cancer prediction system helps the people to know their cancer risk with low cost and it also helps the people to take the appropriate decision based on their cancer risk status. The data is collected from the website online lung cancer prediction system. Total number of attributes in the dataset is 16, while number of instances is 309. Following are attribute information of dataset: Gender: M(male), F(female); Age: Age of the patient; Smoking: YES=2 , NO=1; Yellow fingers: YES=2 , NO=1; Anxiety: YES=2 , NO=1; Peer_pressure: YES=2 , NO=1; Chronic Disease: YES=2 , NO=1; Fatigue: YES=2 , NO=1; Allergy: YES=2 , NO=1; Wheezing: YES=2 , NO=1; Alcohol: YES=2 , NO=1; Coughing: YES=2 , NO=1; Shortness of Breath: YES=2 , NO=1; Swallowing Difficulty: YES=2 , NO=1; Chest pain: YES=2 , NO=1; and Lung Cancer: YES , NO. The models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, LGBM classifier, Gradient Boosting, XGB classifier, MLP classifier, and CNN 1D. Finally, you will develop a GUI using PyQt5 to plot boundary decision, ROC, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, learning curve, performace of the model, scalability of the model, training loss, and training accuracy. WORKSHOP 5: Alzheimer’s Disease Classification and Prediction Using Machine Learning and Deep Learning with Python GUI Alzheimer's is a type of dementia that causes problems with memory, thinking and behavior. Symptoms usually develop slowly and get worse over time, becoming severe enough to interfere with daily tasks. Alzheimer's is not a normal part of aging. The greatest known risk factor is increasing age, and the majority of people with Alzheimer's are 65 and older. But Alzheimer's is not just a disease of old age. Approximately 200,000 Americans under the age of 65 have younger-onset Alzheimer’s disease (also known as early-onset Alzheimer’s). The dataset consists of a longitudinal MRI data of 374 subjects aged 60 to 96. Each subject was scanned at least once. Everyone is right-handed. 206 of the subjects were grouped as 'Nondemented' throughout the study. 107 of the subjects were grouped as 'Demented' at the time of their initial visits and remained so throughout the study. 14 subjects were grouped as 'Nondemented' at the time of their initial visit and were subsequently characterized as 'Demented' at a later visit. These fall under the 'Converted' category. Following are some important features in the dataset: EDUC:Years of Education; SES: Socioeconomic Status; MMSE: Mini Mental State Examination; CDR: Clinical Dementia Rating; eTIV: Estimated Total Intracranial Volume; nWBV: Normalize Whole Brain Volume; and ASF: Atlas Scaling Factor. The models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, LGBM classifier, Gradient Boosting, XGB classifier, MLP classifier, and CNN 1D. Finally, you will develop a GUI using PyQt5 to plot boundary decision, ROC, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, learning curve, performance of the model, scalability of the model, training loss, and training accuracy. WORKSHOP 6: Parkinson Classification and Prediction Using Machine Learning and Deep Learning with Python GUI The dataset was created by Max Little of the University of Oxford, in collaboration with the National Centre for Voice and Speech, Denver, Colorado, who recorded the speech signals. The original study published the feature extraction methods for general voice disorders. This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD. The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column. Attribute information of this dataset are as follows: name - ASCII subject name and recording number; MDVP:Fo(Hz) - Average vocal fundamental frequency; MDVP:Fhi(Hz) - Maximum vocal fundamental frequency; MDVP:Flo(Hz) - Minimum vocal fundamental frequency; MDVP:Jitter(%); MDVP:Jitter(Abs); MDVP:RAP; MDVP:PPQ; Jitter:DDP – Several measures of variation in fundamental frequency; MDVP:Shimmer; MDVP:Shimmer(dB); Shimmer:APQ3; Shimmer:APQ5; MDVP:APQ; Shimmer:DDA - Several measures of variation in amplitude; NHR; HNR - Two measures of ratio of noise to tonal components in the voice; status - Health status of the subject (one) - Parkinson's, (zero) – healthy; RPDE,D2 - Two nonlinear dynamical complexity measures; DFA - Signal fractal scaling exponent; and spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation. The models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, LGBM classifier, Gradient Boosting, XGB classifier, MLP classifier, and CNN 1D. Finally, you will develop a GUI using PyQt5 to plot boundary decision, ROC, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, learning curve, performance of the model, scalability of the model, training loss, and training accuracy. WORKSHOP 7: Liver Disease Classification and Prediction Using Machine Learning and Deep Learning with Python GUI Patients with Liver disease have been continuously increasing because of excessive consumption of alcohol, inhale of harmful gases, intake of contaminated food, pickles and drugs. This dataset was used to evaluate prediction algorithms in an effort to reduce burden on doctors. This dataset contains 416 liver patient records and 167 non liver patient records collected from North East of Andhra Pradesh, India. The "Dataset" column is a class label used to divide groups into liver patient (liver disease) or not (no disease). This data set contains 441 male patient records and 142 female patient records. Any patient whose age exceeded 89 is listed as being of age "90". Columns in the dataset: Age of the patient; Gender of the patient; Total Bilirubin; Direct Bilirubin; Alkaline Phosphotase; Alamine Aminotransferase; Aspartate Aminotransferase; Total Protiens; Albumin; Albumin and Globulin Ratio; and Dataset: field used to split the data into two sets (patient with liver disease, or no disease). The models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, LGBM classifier, Gradient Boosting, XGB classifier, MLP classifier, and CNN 1D. Finally, you will develop a GUI using PyQt5 to plot boundary decision, ROC, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, learning curve, performance of the model, scalability of the model, training loss, and training accuracy.

DATA VISUALIZATION, TIME-SERIES FORECASTING, AND PREDICTION USING MACHINE LEARNING WITH TKINTER

DATA VISUALIZATION, TIME-SERIES FORECASTING, AND PREDICTION USING MACHINE LEARNING WITH TKINTER PDF Author: Vivian Siahaan
Publisher: BALIGE PUBLISHING
ISBN:
Category : Computers
Languages : en
Pages : 267

Book Description
This "Data Visualization, Time-Series Forecasting, and Prediction using Machine Learning with Tkinter" project is a comprehensive and multifaceted application that leverages data visualization, time-series forecasting, and machine learning techniques to gain insights into bitcoin data and make predictions. This project serves as a valuable tool for financial analysts, traders, and investors seeking to make informed decisions in the stock market. The project begins with data visualization, where historical bitcoin market data is visually represented using various plots and charts. This provides users with an intuitive understanding of the data's trends, patterns, and fluctuations. Features distribution analysis is conducted to assess the statistical properties of the dataset, helping users identify key characteristics that may impact forecasting and prediction. One of the project's core functionalities is time-series forecasting. Through a user-friendly interface built with Tkinter, users can select a stock symbol and specify the time horizon for forecasting. The project supports multiple machine learning regressors, such as Linear Regression, Decision Trees, Random Forests, Gradient Boosting, Extreme Gradient Boosting, Multi-Layer Perceptron, Lasso, Ridge, AdaBoost, and KNN, allowing users to choose the most suitable algorithm for their forecasting needs. Time-series forecasting is crucial for making predictions about stock prices, which is essential for investment strategies. The project employs various machine learning regressors to predict the adjusted closing price of bitcoin stock. By training these models on historical data, users can obtain predictions for future adjusted closing prices. This information is invaluable for traders and investors looking to make buy or sell decisions. The project also incorporates hyperparameter tuning and cross-validation to enhance the accuracy of these predictions. These models employ metrics such as Mean Absolute Error (MAE), which quantifies the average absolute discrepancy between predicted values and actual values. Lower MAE values signify superior model performance. Additionally, Mean Squared Error (MSE) is used to calculate the average squared differences between predicted and actual values, with lower MSE values indicating better model performance. Root Mean Squared Error (RMSE), derived from MSE, provides insights in the same units as the target variable and is valued for its lower values, denoting superior performance. Lastly, R-squared (R2) evaluates the fraction of variance in the target variable that can be predicted from independent variables, with higher values signifying better model fit. An R2 of 1 implies a perfect model fit. In addition to close price forecasting, the project extends its capabilities to predict daily returns. By implementing grid search, users can fine-tune the hyperparameters of machine learning models such as Random Forests, Gradient Boosting, Support Vector, Decision Tree, Gradient Boosting, Extreme Gradient Boosting, Multi-Layer Perceptron, and AdaBoost Classifiers. This optimization process aims to maximize the predictive accuracy of daily returns. Accurate daily return predictions are essential for assessing risk and formulating effective trading strategies. Key metrics in these classifiers encompass Accuracy, which represents the ratio of correctly predicted instances to the total number of instances, Precision, which measures the proportion of true positive predictions among all positive predictions, and Recall (also known as Sensitivity or True Positive Rate), which assesses the proportion of true positive predictions among all actual positive instances. The F1-Score serves as the harmonic mean of Precision and Recall, offering a balanced evaluation, especially when considering the trade-off between false positives and false negatives. The ROC Curve illustrates the trade-off between Recall and False Positive Rate, while the Area Under the ROC Curve (AUC-ROC) summarizes this trade-off. The Confusion Matrix provides a comprehensive view of classifier performance by detailing true positives, true negatives, false positives, and false negatives, facilitating the computation of various metrics like accuracy, precision, and recall. The selection of these metrics hinges on the project's specific objectives and the characteristics of the dataset, ensuring alignment with the intended goals and the ramifications of false positives and false negatives, which hold particular significance in financial contexts where decisions can have profound consequences. Overall, the "Data Visualization, Time-Series Forecasting, and Prediction using Machine Learning with Tkinter" project serves as a powerful and user-friendly platform for financial data analysis and decision-making. It bridges the gap between complex machine learning techniques and accessible user interfaces, making financial analysis and prediction more accessible to a broader audience. With its comprehensive features, this project empowers users to gain insights from historical data, make informed investment decisions, and develop effective trading strategies in the dynamic world of finance. You can download the dataset from: http://viviansiahaan.blogspot.com/2023/09/data-visualization-time-series.html.

TKINTER, DATA SCIENCE, AND MACHINE LEARNING

TKINTER, DATA SCIENCE, AND MACHINE LEARNING PDF Author: Vivian Siahaan
Publisher: BALIGE PUBLISHING
ISBN:
Category : Computers
Languages : en
Pages : 173

Book Description
In this project, we embarked on a comprehensive journey through the world of machine learning and model evaluation. Our primary goal was to develop a Tkinter GUI and assess various machine learning models on a given dataset to identify the best-performing one. This process is essential in solving real-world problems, as it helps us select the most suitable algorithm for a specific task. By crafting this Tkinter-powered GUI, we provided an accessible and user-friendly interface for users engaging with machine learning models. It simplified intricate processes, allowing users to load data, select models, initiate training, and visualize results without necessitating code expertise or command-line operations. This GUI introduced a higher degree of usability and accessibility to the machine learning workflow, accommodating users with diverse levels of technical proficiency. We began by loading and preprocessing the dataset, a fundamental step in any machine learning project. Proper data preprocessing involves tasks such as handling missing values, encoding categorical features, and scaling numerical attributes. These operations ensure that the data is in a format suitable for training and testing machine learning models. Once our data was ready, we moved on to the model selection phase. We evaluated multiple machine learning algorithms, each with its strengths and weaknesses. The models we explored included Logistic Regression, Random Forest, K-Nearest Neighbors (KNN), Decision Trees, Gradient Boosting, Extreme Gradient Boosting (XGBoost), Multi-Layer Perceptron (MLP), and Support Vector Classifier (SVC). For each model, we employed a systematic approach to find the best hyperparameters using grid search with cross-validation. This technique allowed us to explore different combinations of hyperparameters and select the configuration that yielded the highest accuracy on the training data. These hyperparameters included settings like the number of estimators, learning rate, and kernel function, depending on the specific model. After obtaining the best hyperparameters for each model, we trained them on our preprocessed dataset. This training process involved using the training data to teach the model to make predictions on new, unseen examples. Once trained, the models were ready for evaluation. We assessed the performance of each model using a set of well-established evaluation metrics. These metrics included accuracy, precision, recall, and F1-score. Accuracy measured the overall correctness of predictions, while precision quantified the proportion of true positive predictions out of all positive predictions. Recall, on the other hand, represented the proportion of true positive predictions out of all actual positives, highlighting a model's ability to identify positive cases. The F1-score combined precision and recall into a single metric, helping us gauge the overall balance between these two aspects. To visualize the model's performance, we created key graphical representations. These included confusion matrices, which showed the number of true positive, true negative, false positive, and false negative predictions, aiding in understanding the model's classification results. Additionally, we generated Receiver Operating Characteristic (ROC) curves and area under the curve (AUC) scores, which depicted a model's ability to distinguish between classes. High AUC values indicated excellent model performance. Furthermore, we constructed true values versus predicted values diagrams to provide insights into how well our models aligned with the actual data distribution. Learning curves were also generated to observe a model's performance as a function of training data size, helping us assess whether the model was overfitting or underfitting. Lastly, we presented the results in a clear and organized manner, saving them to Excel files for easy reference. This allowed us to compare the performance of different models and make an informed choice about which one to select for our specific task. In summary, this project was a comprehensive exploration of the machine learning model development and evaluation process. We prepared the data, selected and fine-tuned various models, assessed their performance using multiple metrics and visualizations, and ultimately arrived at a well-informed decision about the most suitable model for our dataset. This approach serves as a valuable blueprint for tackling real-world machine learning challenges effectively.

Data Science and Deep Learning Workshop For Scientists and Engineers

Data Science and Deep Learning Workshop For Scientists and Engineers PDF Author: Vivian Siahaan
Publisher: BALIGE PUBLISHING
ISBN:
Category : Computers
Languages : en
Pages : 1977

Book Description
WORKSHOP 1: In this workshop, you will learn how to use TensorFlow, Keras, Scikit-Learn, OpenCV, Pandas, NumPy and other libraries to implement deep learning on recognizing traffic signs using GTSRB dataset, detecting brain tumor using Brain Image MRI dataset, classifying gender, and recognizing facial expression using FER2013 dataset In Chapter 1, you will learn to create GUI applications to display line graph using PyQt. You will also learn how to display image and its histogram. In Chapter 2, you will learn how to use TensorFlow, Keras, Scikit-Learn, Pandas, NumPy and other libraries to perform prediction on handwritten digits using MNIST dataset with PyQt. You will build a GUI application for this purpose. In Chapter 3, you will learn how to perform recognizing traffic signs using GTSRB dataset from Kaggle. There are several different types of traffic signs like speed limits, no entry, traffic signals, turn left or right, children crossing, no passing of heavy vehicles, etc. Traffic signs classification is the process of identifying which class a traffic sign belongs to. In this Python project, you will build a deep neural network model that can classify traffic signs in image into different categories. With this model, you will be able to read and understand traffic signs which are a very important task for all autonomous vehicles. You will build a GUI application for this purpose. In Chapter 4, you will learn how to perform detecting brain tumor using Brain Image MRI dataset provided by Kaggle (https://www.kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection) using CNN model. You will build a GUI application for this purpose. In Chapter 5, you will learn how to perform classifying gender using dataset provided by Kaggle (https://www.kaggle.com/cashutosh/gender-classification-dataset) using MobileNetV2 and CNN models. You will build a GUI application for this purpose. In Chapter 6, you will learn how to perform recognizing facial expression using FER2013 dataset provided by Kaggle (https://www.kaggle.com/nicolejyt/facialexpressionrecognition) using CNN model. You will also build a GUI application for this purpose. WORKSHOP 2: In this workshop, you will learn how to use TensorFlow, Keras, Scikit-Learn, OpenCV, Pandas, NumPy and other libraries to implement deep learning on classifying fruits, classifying cats/dogs, detecting furnitures, and classifying fashion. In Chapter 1, you will learn to create GUI applications to display line graph using PyQt. You will also learn how to display image and its histogram. Then, you will learn how to use OpenCV, NumPy, and other libraries to perform feature extraction with Python GUI (PyQt). The feature detection techniques used in this chapter are Harris Corner Detection, Shi-Tomasi Corner Detector, and Scale-Invariant Feature Transform (SIFT). In Chapter 2, you will learn how to use TensorFlow, Keras, Scikit-Learn, OpenCV, Pandas, NumPy and other libraries to perform classifying fruits using Fruits 360 dataset provided by Kaggle (https://www.kaggle.com/moltean/fruits/code) using Transfer Learning and CNN models. You will build a GUI application for this purpose. In Chapter 3, you will learn how to use TensorFlow, Keras, Scikit-Learn, OpenCV, Pandas, NumPy and other libraries to perform classifying cats/dogs using dataset provided by Kaggle (https://www.kaggle.com/chetankv/dogs-cats-images) using Using CNN with Data Generator. You will build a GUI application for this purpose. In Chapter 4, you will learn how to use TensorFlow, Keras, Scikit-Learn, OpenCV, Pandas, NumPy and other libraries to perform detecting furnitures using Furniture Detector dataset provided by Kaggle (https://www.kaggle.com/akkithetechie/furniture-detector) using VGG16 model. You will build a GUI application for this purpose. In Chapter 5, you will learn how to use TensorFlow, Keras, Scikit-Learn, OpenCV, Pandas, NumPy and other libraries to perform classifying fashion using Fashion MNIST dataset provided by Kaggle (https://www.kaggle.com/zalando-research/fashionmnist/code) using CNN model. You will build a GUI application for this purpose. WORKSHOP 3: In this workshop, you will implement deep learning on detecting vehicle license plates, recognizing sign language, and detecting surface crack using TensorFlow, Keras, Scikit-Learn, OpenCV, Pandas, NumPy and other libraries. In Chapter 1, you will learn how to use TensorFlow, Keras, Scikit-Learn, OpenCV, Pandas, NumPy and other libraries to perform detecting vehicle license plates using Car License Plate Detection dataset provided by Kaggle (https://www.kaggle.com/andrewmvd/car-plate-detection/download). In Chapter 2, you will learn how to use TensorFlow, Keras, Scikit-Learn, OpenCV, Pandas, NumPy and other libraries to perform sign language recognition using Sign Language Digits Dataset provided by Kaggle (https://www.kaggle.com/ardamavi/sign-language-digits-dataset/download). In Chapter 3, you will learn how to use TensorFlow, Keras, Scikit-Learn, OpenCV, Pandas, NumPy and other libraries to perform detecting surface crack using Surface Crack Detection provided by Kaggle (https://www.kaggle.com/arunrk7/surface-crack-detection/download). WORKSHOP 4: In this workshop, implement deep learning-based image classification on detecting face mask, classifying weather, and recognizing flower using TensorFlow, Keras, Scikit-Learn, OpenCV, Pandas, NumPy and other libraries. In Chapter 1, you will learn how to use TensorFlow, Keras, Scikit-Learn, OpenCV, Pandas, NumPy and other libraries to perform detecting face mask using Face Mask Detection Dataset provided by Kaggle (https://www.kaggle.com/omkargurav/face-mask-dataset/download). In Chapter 2, you will learn how to use TensorFlow, Keras, Scikit-Learn, OpenCV, Pandas, NumPy and other libraries to perform how to classify weather using Multi-class Weather Dataset provided by Kaggle (https://www.kaggle.com/pratik2901/multiclass-weather-dataset/download). WORKSHOP 5: In this workshop, implement deep learning-based image classification on classifying monkey species, recognizing rock, paper, and scissor, and classify airplane, car, and ship using TensorFlow, Keras, Scikit-Learn, OpenCV, Pandas, NumPy and other libraries. In Chapter 1, you will learn how to use TensorFlow, Keras, Scikit-Learn, OpenCV, Pandas, NumPy and other libraries to perform how to classify monkey species using 10 Monkey Species dataset provided by Kaggle (https://www.kaggle.com/slothkong/10-monkey-species/download). In Chapter 2, you will learn how to use TensorFlow, Keras, Scikit-Learn, OpenCV, Pandas, NumPy and other libraries to perform how to recognize rock, paper, and scissor using 10 Monkey Species dataset provided by Kaggle (https://www.kaggle.com/sanikamal/rock-paper-scissors-dataset/download). WORKSHOP 6: In this worksshop, you will implement two data science projects using Scikit-Learn, Scipy, and other libraries with Python GUI. In Chapter 1, you will learn how to use Scikit-Learn, Scipy, and other libraries to perform how to predict traffic (number of vehicles) in four different junctions using Traffic Prediction Dataset provided by Kaggle (https://www.kaggle.com/fedesoriano/traffic-prediction-dataset/download). This dataset contains 48.1k (48120) observations of the number of vehicles each hour in four different junctions: 1) DateTime; 2) Juction; 3) Vehicles; and 4) ID. In Chapter 2, you will learn how to use Scikit-Learn, NumPy, Pandas, and other libraries to perform how to analyze and predict heart attack using Heart Attack Analysis & Prediction Dataset provided by Kaggle (https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset/download). WORKSHOP 7: In this workshop, you will implement two data science projects using Scikit-Learn, Scipy, and other libraries with Python GUI. In Project 1, you will learn how to use Scikit-Learn, NumPy, Pandas, Seaborn, and other libraries to perform how to predict early stage diabetes using Early Stage Diabetes Risk Prediction Dataset provided by Kaggle (https://www.kaggle.com/ishandutta/early-stage-diabetes-risk-prediction-dataset/download). This dataset contains the sign and symptpom data of newly diabetic or would be diabetic patient. This has been collected using direct questionnaires from the patients of Sylhet Diabetes Hospital in Sylhet, Bangladesh and approved by a doctor. You will develop a GUI using PyQt5 to plot distribution of features, feature importance, cross validation score, and prediced values versus true values. The machine learning models used in this project are Adaboost, Random Forest, Gradient Boosting, Logistic Regression, and Support Vector Machine. In Project 2, you will learn how to use Scikit-Learn, NumPy, Pandas, and other libraries to perform how to analyze and predict breast cancer using Breast Cancer Prediction Dataset provided by Kaggle (https://www.kaggle.com/merishnasuwal/breast-cancer-prediction-dataset/download). Worldwide, breast cancer is the most common type of cancer in women and the second highest in terms of mortality rates.Diagnosis of breast cancer is performed when an abnormal lump is found (from self-examination or x-ray) or a tiny speck of calcium is seen (on an x-ray). After a suspicious lump is found, the doctor will conduct a diagnosis to determine whether it is cancerous and, if so, whether it has spread to other parts of the body. This breast cancer dataset was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. You will develop a GUI using PyQt5 to plot distribution of features, pairwise relationship, test scores, prediced values versus true values, confusion matrix, and decision boundary. The machine learning models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, and Support Vector Machine. WORKSHOP 8: In this workshop, you will learn how to use Scikit-Learn, TensorFlow, Keras, NumPy, Pandas, Seaborn, and other libraries to implement brain tumor classification and detection with machine learning using Brain Tumor dataset provided by Kaggle. This dataset contains five first order features: Mean (the contribution of individual pixel intensity for the entire image), Variance (used to find how each pixel varies from the neighboring pixel 0, Standard Deviation (the deviation of measured Values or the data from its mean), Skewness (measures of symmetry), and Kurtosis (describes the peak of e.g. a frequency distribution). It also contains eight second order features: Contrast, Energy, ASM (Angular second moment), Entropy, Homogeneity, Dissimilarity, Correlation, and Coarseness. The machine learning models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, and Support Vector Machine. The deep learning models used in this project are MobileNet and ResNet50. In this project, you will develop a GUI using PyQt5 to plot boundary decision, ROC, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, training loss, and training accuracy. WORKSHOP 9: In this workshop, you will learn how to use Scikit-Learn, Keras, TensorFlow, NumPy, Pandas, Seaborn, and other libraries to perform COVID-19 Epitope Prediction using COVID-19/SARS B-cell Epitope Prediction dataset provided in Kaggle. All of three datasets consists of information of protein and peptide: parent_protein_id : parent protein ID; protein_seq : parent protein sequence; start_position : start position of peptide; end_position : end position of peptide; peptide_seq : peptide sequence; chou_fasman : peptide feature; emini : peptide feature, relative surface accessibility; kolaskar_tongaonkar : peptide feature, antigenicity; parker : peptide feature, hydrophobicity; isoelectric_point : protein feature; aromacity: protein feature; hydrophobicity : protein feature; stability : protein feature; and target : antibody valence (target value). The machine learning models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, Gradient Boosting, XGB classifier, and MLP classifier. Then, you will learn how to use sequential CNN and VGG16 models to detect and predict Covid-19 X-RAY using COVID-19 Xray Dataset (Train & Test Sets) provided in Kaggle. The folder itself consists of two subfolders: test and train. Finally, you will develop a GUI using PyQt5 to plot boundary decision, ROC, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, training loss, and training accuracy. WORKSHOP 10: In this workshop, you will learn how to use Scikit-Learn, Keras, TensorFlow, NumPy, Pandas, Seaborn, and other libraries to perform analyzing and predicting stroke using dataset provided in Kaggle. The dataset consists of attribute information: id: unique identifier; gender: "Male", "Female" or "Other"; age: age of the patient; hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension; heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease; ever_married: "No" or "Yes"; work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"; Residence_type: "Rural" or "Urban"; avg_glucose_level: average glucose level in blood; bmi: body mass index; smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"; and stroke: 1 if the patient had a stroke or 0 if not. The models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, LGBM classifier, Gradient Boosting, XGB classifier, MLP classifier, and CNN 1D. Finally, you will develop a GUI using PyQt5 to plot boundary decision, ROC, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, learning curve, performace of the model, scalability of the model, training loss, and training accuracy. WORKSHOP 11: In this workshop, you will learn how to use Scikit-Learn, Keras, TensorFlow, NumPy, Pandas, Seaborn, and other libraries to perform classifying and predicting Hepatitis C using dataset provided by UCI Machine Learning Repository. All attributes in dataset except Category and Sex are numerical. Attributes 1 to 4 refer to the data of the patient: X (Patient ID/No.), Category (diagnosis) (values: '0=Blood Donor', '0s=suspect Blood Donor', '1=Hepatitis', '2=Fibrosis', '3=Cirrhosis'), Age (in years), Sex (f,m), ALB, ALP, ALT, AST, BIL, CHE, CHOL, CREA, GGT, and PROT. The target attribute for classification is Category (2): blood donors vs. Hepatitis C patients (including its progress ('just' Hepatitis C, Fibrosis, Cirrhosis). The models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, LGBM classifier, Gradient Boosting, XGB classifier, MLP classifier, and ANN 1D. Finally, you will develop a GUI using PyQt5 to plot boundary decision, ROC, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, learning curve, performace of the model, scalability of the model, training loss, and training accuracy.

DATA SCIENCE WORKSHOP: Lung Cancer Classification and Prediction Using Machine Learning and Deep Learning with Python GUI

DATA SCIENCE WORKSHOP: Lung Cancer Classification and Prediction Using Machine Learning and Deep Learning with Python GUI PDF Author: Vivian Siahaan
Publisher: BALIGE PUBLISHING
ISBN:
Category : Computers
Languages : en
Pages : 294

Book Description
This Data Science Workshop presents a comprehensive journey through lung cancer analysis. Beginning with data exploration, the dataset is thoroughly examined to uncover insights into its structure and contents. The focus then shifts to categorizing features and understanding their distribution patterns, revealing key trends and relationships that could impact the predictive models. To predict lung cancer using machine learning models, an extensive grid search is conducted, fine-tuning model hyperparameters for optimal performance. The iterative process involves training various models, such as K-Nearest Neighbors, Decision Trees, Random Forests, Gradient Boosting, Naive Bayes, Extreme Gradient Boosting, Light Gradient Boosting, and Multi-Layer Perceptron, and evaluating their outcomes to select the best-performing approach. Utilizing GridSearchCV aids in systematically optimizing parameters to enhance predictive accuracy. Deep Learning is harnessed through Artificial Neural Networks (ANN), which involve building multi-layered models capable of learning intricate patterns from data. The ANN architecture, comprising input, hidden, and output layers, is designed to capture the complex relationships within the dataset. Metrics like accuracy, precision, recall, and F1-score are employed to comprehensively evaluate model performance. These metrics provide a holistic view of the model's ability to classify lung cancer cases accurately and minimize false positives or negatives. The Graphical User Interface (GUI) aspect of the project is developed using PyQt, enabling user-friendly interactions with the predictive models. The GUI design includes features such as radio buttons for selecting preprocessing options (Raw, Normalization, or Standardization), a combobox for choosing the ANN model type (e.g., CNN 1D), and buttons to initiate training and prediction. The PyQt interface enhances usability by allowing users to visualize predictions, classification reports, confusion matrices, and loss-accuracy plots. The GUI's functionality expands to encompass the entire workflow. It enables data preprocessing by loading and splitting the dataset into training and testing subsets. Users can then select machine learning or deep learning models for training. The trained models are saved for future use to avoid retraining. The interface also facilitates model evaluation, showcasing accuracy scores, classification reports detailing precision and recall, and visualizations depicting loss and accuracy trends over epochs. The project's educational value lies in its comprehensive approach, taking participants through every step of a data science pipeline. Attendees gain insights into data preprocessing, model selection, hyperparameter tuning, and performance evaluation. The integration of machine learning and deep learning methodologies, along with GUI development, provides a well-rounded understanding of creating predictive tools for real-world applications. Participants leave the workshop empowered with the skills to explore and analyze medical datasets, implement machine learning and deep learning models, and build user-friendly interfaces for effective interaction. The workshop bridges the gap between theoretical knowledge and practical implementation, fostering a deeper understanding of data-driven decision-making in the realm of medical diagnostics and classification.

PYTHON GUI PROJECTS WITH MACHINE LEARNING AND DEEP LEARNING

PYTHON GUI PROJECTS WITH MACHINE LEARNING AND DEEP LEARNING PDF Author: Vivian Siahaan
Publisher: BALIGE PUBLISHING
ISBN:
Category : Computers
Languages : en
Pages : 917

Book Description
PROJECT 1: THE APPLIED DATA SCIENCE WORKSHOP: Prostate Cancer Classification and Recognition Using Machine Learning and Deep Learning with Python GUI Prostate cancer is cancer that occurs in the prostate. The prostate is a small walnut-shaped gland in males that produces the seminal fluid that nourishes and transports sperm. Prostate cancer is one of the most common types of cancer. Many prostate cancers grow slowly and are confined to the prostate gland, where they may not cause serious harm. However, while some types of prostate cancer grow slowly and may need minimal or even no treatment, other types are aggressive and can spread quickly. The dataset used in this project consists of 100 patients which can be used to implement the machine learning and deep learning algorithms. The dataset consists of 100 observations and 10 variables (out of which 8 numeric variables and one categorical variable and is ID) which are as follows: Id, Radius, Texture, Perimeter, Area, Smoothness, Compactness, Diagnosis Result, Symmetry, and Fractal Dimension. The models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, LGBM classifier, Gradient Boosting, XGB classifier, MLP classifier, and CNN 1D. Finally, you will develop a GUI using PyQt5 to plot boundary decision, ROC, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, learning curve, performance of the model, scalability of the model, training loss, and training accuracy. PROJECT 2: THE APPLIED DATA SCIENCE WORKSHOP: Urinary Biomarkers Based Pancreatic Cancer Classification and Prediction Using Machine Learning with Python GUI Pancreatic cancer is an extremely deadly type of cancer. Once diagnosed, the five-year survival rate is less than 10%. However, if pancreatic cancer is caught early, the odds of surviving are much better. Unfortunately, many cases of pancreatic cancer show no symptoms until the cancer has spread throughout the body. A diagnostic test to identify people with pancreatic cancer could be enormously helpful. In a paper by Silvana Debernardi and colleagues, published this year in the journal PLOS Medicine, a multi-national team of researchers sought to develop an accurate diagnostic test for the most common type of pancreatic cancer, called pancreatic ductal adenocarcinoma or PDAC. They gathered a series of biomarkers from the urine of three groups of patients: Healthy controls, Patients with non-cancerous pancreatic conditions, like chronic pancreatitis, and Patients with pancreatic ductal adenocarcinoma. When possible, these patients were age- and sex-matched. The goal was to develop an accurate way to identify patients with pancreatic cancer. The key features are four urinary biomarkers: creatinine, LYVE1, REG1B, and TFF1. Creatinine is a protein that is often used as an indicator of kidney function. YVLE1 is lymphatic vessel endothelial hyaluronan receptor 1, a protein that may play a role in tumor metastasis. REG1B is a protein that may be associated with pancreas regeneration. TFF1 is trefoil factor 1, which may be related to regeneration and repair of the urinary tract. The models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, LGBM classifier, Gradient Boosting, XGB classifier, and MLP classifier. Finally, you will develop a GUI using PyQt5 to plot boundary decision, ROC, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, learning curve, performance of the model, scalability of the model, training loss, and training accuracy. PROJECT 3: DATA SCIENCE CRASH COURSE: Voice Based Gender Classification and Prediction Using Machine Learning and Deep Learning with Python GUI This dataset was created to identify a voice as male or female, based upon acoustic properties of the voice and speech. The dataset consists of 3,168 recorded voice samples, collected from male and female speakers. The voice samples are pre-processed by acoustic analysis in R using the seewave and tuneR packages, with an analyzed frequency range of 0hz-280hz (human vocal range). The following acoustic properties of each voice are measured and included within the CSV: meanfreq: mean frequency (in kHz); sd: standard deviation of frequency; median: median frequency (in kHz); Q25: first quantile (in kHz); Q75: third quantile (in kHz); IQR: interquantile range (in kHz); skew: skewness; kurt: kurtosis; sp.ent: spectral entropy; sfm: spectral flatness; mode: mode frequency; centroid: frequency centroid (see specprop); peakf: peak frequency (frequency with highest energy); meanfun: average of fundamental frequency measured across acoustic signal; minfun: minimum fundamental frequency measured across acoustic signal; maxfun: maximum fundamental frequency measured across acoustic signal; meandom: average of dominant frequency measured across acoustic signal; mindom: minimum of dominant frequency measured across acoustic signal; maxdom: maximum of dominant frequency measured across acoustic signal; dfrange: range of dominant frequency measured across acoustic signal; modindx: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range; and label: male or female. The models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, LGBM classifier, Gradient Boosting, XGB classifier, MLP classifier, and CNN 1D. Finally, you will develop a GUI using PyQt5 to plot boundary decision, ROC, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, learning curve, performance of the model, scalability of the model, training loss, and training accuracy. PROJECT 4: DATA SCIENCE CRASH COURSE: Thyroid Disease Classification and Prediction Using Machine Learning and Deep Learning with Python GUI Thyroid disease is a general term for a medical condition that keeps your thyroid from making the right amount of hormones. Thyroid typically makes hormones that keep body functioning normally. When the thyroid makes too much thyroid hormone, body uses energy too quickly. The two main types of thyroid disease are hypothyroidism and hyperthyroidism. Both conditions can be caused by other diseases that impact the way the thyroid gland works. Dataset used in this project was from Garavan Institute Documentation as given by Ross Quinlan 6 databases from the Garavan Institute in Sydney, Australia. Approximately the following for each database: 2800 training (data) instances and 972 test instances. This dataset contains plenty of missing data, while 29 or so attributes, either Boolean or continuously-valued. The models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, LGBM classifier, Gradient Boosting, XGB classifier, MLP classifier, and CNN 1D. Finally, you will develop a GUI using PyQt5 to plot boundary decision, ROC, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, learning curve, performance of the model, scalability of the model, training loss, and training accuracy.

5 FIVE DATA SCIENCE PROJECTS FOR ANALYSIS, CLASSIFICATION, PREDICTION, AND SENTIMENT ANALYSIS WITH PYTHON GUI

5 FIVE DATA SCIENCE PROJECTS FOR ANALYSIS, CLASSIFICATION, PREDICTION, AND SENTIMENT ANALYSIS WITH PYTHON GUI PDF Author: Vivian Siahaan
Publisher: BALIGE PUBLISHING
ISBN:
Category : Computers
Languages : en
Pages : 979

Book Description
PROJECT 1: SUPERMARKET SALES ANALYSIS AND PREDICTION USING MACHINE LEARNING WITH PYTHON GUI The dataset used in this project consists of the growth of supermarkets with high market competitions in most populated cities. The dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data. Predictive data analytics methods are easy to apply with this dataset. Attribute information in the dataset are as follows: Invoice id: Computer generated sales slip invoice identification number; Branch: Branch of supercenter (3 branches are available identified by A, B and C); City: Location of supercenters; Customer type: Type of customers, recorded by Members for customers using member card and Normal for without member card; Gender: Gender type of customer; Product line: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel; Unit price: Price of each product in $; Quantity: Number of products purchased by customer; Tax: 5% tax fee for customer buying; Total: Total price including tax; Date: Date of purchase (Record available from January 2019 to March 2019); Time: Purchase time (10am to 9pm); Payment: Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet); COGS: Cost of goods sold; Gross margin percentage: Gross margin percentage; Gross income: Gross income; and Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10). In this project, you will perform predicting rating using machine learning. The machine learning models used in this project to predict clusters as target variable are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, LGBM, Gradient Boosting, XGB, and MLP. Finally, you will plot boundary decision, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, learning curve, performance of the model, scalability of the model, training loss, and training accuracy. PROJECT 2: DETECTING CYBERBULLYING TWEETS USING MACHINE LEARNING AND DEEP LEARNING WITH PYTHON GUI As social media usage becomes increasingly prevalent in every age group, a vast majority of citizens rely on this essential medium for day-to-day communication. Social media’s ubiquity means that cyberbullying can effectively impact anyone at any time or anywhere, and the relative anonymity of the internet makes such personal attacks more difficult to stop than traditional bullying. On April 15th, 2020, UNICEF issued a warning in response to the increased risk of cyberbullying during the COVID-19 pandemic due to widespread school closures, increased screen time, and decreased face-to-face social interaction. The statistics of cyberbullying are outright alarming: 36.5% of middle and high school students have felt cyberbullied and 87% have observed cyberbullying, with effects ranging from decreased academic performance to depression to suicidal thoughts. In light of all of this, this dataset contains more than 47000 tweets labelled according to the class of cyberbullying: Age; Ethnicity; Gender; Religion; Other type of cyberbullying; and Not cyberbullying. The data has been balanced in order to contain ~8000 of each class. The models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, LGBM classifier, Gradient Boosting, XGB classifier, LSTM, and CNN. Three feature scaling used in machine learning are raw, minmax scaler, and standard scaler. Finally, you will develop a GUI using PyQt5 to plot cross validation score, predicted values versus true values, confusion matrix, learning curve, decision boundaries, performance of the model, scalability of the model, training loss, and training accuracy. PROJECT 3: HIGHER EDUCATION STUDENT ACADEMIC PERFORMANCE ANALYSIS AND PREDICTION USING MACHINE LEARNING WITH PYTHON GUI The dataset used in this project was collected from the Faculty of Engineering and Faculty of Educational Sciences students in 2019. The purpose is to predict students' end-of-term performances using ML techniques. Attribute information in the dataset are as follows: Student ID; Student Age (1: 18-21, 2: 22-25, 3: above 26); Sex (1: female, 2: male); Graduated high-school type: (1: private, 2: state, 3: other); Scholarship type: (1: None, 2: 25%, 3: 50%, 4: 75%, 5: Full); Additional work: (1: Yes, 2: No); Regular artistic or sports activity: (1: Yes, 2: No); Do you have a partner: (1: Yes, 2: No); Total salary if available (1: USD 135-200, 2: USD 201-270, 3: USD 271-340, 4: USD 341-410, 5: above 410); Transportation to the university: (1: Bus, 2: Private car/taxi, 3: bicycle, 4: Other); Accommodation type in Cyprus: (1: rental, 2: dormitory, 3: with family, 4: Other); Mother's education: (1: primary school, 2: secondary school, 3: high school, 4: university, 5: MSc., 6: Ph.D.); Father's education: (1: primary school, 2: secondary school, 3: high school, 4: university, 5: MSc., 6: Ph.D.); Number of sisters/brothers (if available): (1: 1, 2:, 2, 3: 3, 4: 4, 5: 5 or above); Parental status: (1: married, 2: divorced, 3: died - one of them or both); Mother's occupation: (1: retired, 2: housewife, 3: government officer, 4: private sector employee, 5: self-employment, 6: other); Father's occupation: (1: retired, 2: government officer, 3: private sector employee, 4: self-employment, 5: other); Weekly study hours: (1: None, 2: <5 hours, 3: 6-10 hours, 4: 11-20 hours, 5: more than 20 hours); Reading frequency (non-scientific books/journals): (1: None, 2: Sometimes, 3: Often); Reading frequency (scientific books/journals): (1: None, 2: Sometimes, 3: Often); Attendance to the seminars/conferences related to the department: (1: Yes, 2: No); Impact of your projects/activities on your success: (1: positive, 2: negative, 3: neutral); Attendance to classes (1: always, 2: sometimes, 3: never); Preparation to midterm exams 1: (1: alone, 2: with friends, 3: not applicable); Preparation to midterm exams 2: (1: closest date to the exam, 2: regularly during the semester, 3: never); Taking notes in classes: (1: never, 2: sometimes, 3: always); Listening in classes: (1: never, 2: sometimes, 3: always); Discussion improves my interest and success in the course: (1: never, 2: sometimes, 3: always); Flip-classroom: (1: not useful, 2: useful, 3: not applicable); Cumulative grade point average in the last semester (/4.00): (1: <2.00, 2: 2.00-2.49, 3: 2.50-2.99, 4: 3.00-3.49, 5: above 3.49); Expected Cumulative grade point average in the graduation (/4.00): (1: <2.00, 2: 2.00-2.49, 3: 2.50-2.99, 4: 3.00-3.49, 5: above 3.49); Course ID; and OUTPUT: Grade (0: Fail, 1: DD, 2: DC, 3: CC, 4: CB, 5: BB, 6: BA, 7: AA). The models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, LGBM classifier, Gradient Boosting, and XGB classifier. Three feature scaling used in machine learning are raw, minmax scaler, and standard scaler. Finally, you will develop a GUI using PyQt5 to plot cross validation score, predicted values versus true values, confusion matrix, learning curve, decision boundaries, performance of the model, scalability of the model, training loss, and training accuracy. PROJECT 4: COMPANY BANKRUPTCY ANALYSIS AND PREDICTION USING MACHINE LEARNING WITH PYTHON GUI The dataset was collected from the Taiwan Economic Journal for the years 1999 to 2009. Company bankruptcy was defined based on the business regulations of the Taiwan Stock Exchange. Attribute information in the dataset are as follows: Y - Bankrupt?: Class label; X1 - ROA(C) before interest and depreciation before interest: Return On Total Assets(C); X2 - ROA(A) before interest and % after tax: Return On Total Assets(A); X3 - ROA(B) before interest and depreciation after tax: Return On Total Assets(B); X4 - Operating Gross Margin: Gross Profit/Net Sales; X5 - Realized Sales Gross Margin: Realized Gross Profit/Net Sales; X6 - Operating Profit Rate: Operating Income/Net Sales; X7 - Pre-tax net Interest Rate: Pre-Tax Income/Net Sales; X8 - After-tax net Interest Rate: Net Income/Net Sales; X9 - Non-industry income and expenditure/revenue: Net Non-operating Income Ratio; X10 - Continuous interest rate (after tax): Net Income-Exclude Disposal Gain or Loss/Net Sales; X11 - Operating Expense Rate: Operating Expenses/Net Sales; X12 - Research and development expense rate: (Research and Development Expenses)/Net Sales X13 - Cash flow rate: Cash Flow from Operating/Current Liabilities; X14 - Interest-bearing debt interest rate: Interest-bearing Debt/Equity; X15 - Tax rate (A): Effective Tax Rate; X16 - Net Value Per Share (B): Book Value Per Share(B); X17 - Net Value Per Share (A): Book Value Per Share(A); X18 - Net Value Per Share (C): Book Value Per Share(C); X19 - Persistent EPS in the Last Four Seasons: EPS-Net Income; X20 - Cash Flow Per Share; X21 - Revenue Per Share (Yuan ¥): Sales Per Share; X22 - Operating Profit Per Share (Yuan ¥): Operating Income Per Share; X23 - Per Share Net profit before tax (Yuan ¥): Pretax Income Per Share; X24 - Realized Sales Gross Profit Growth Rate; X25 - Operating Profit Growth Rate: Operating Income Growth; X26 - After-tax Net Profit Growth Rate: Net Income Growth; X27 - Regular Net Profit Growth Rate: Continuing Operating Income after Tax Growth; X28 - Continuous Net Profit Growth Rate: Net Income-Excluding Disposal Gain or Loss Growth; X29 - Total Asset Growth Rate: Total Asset Growth; X30 - Net Value Growth Rate: Total Equity Growth; X31 - Total Asset Return Growth Rate Ratio: Return on Total Asset Growth; X32 - Cash Reinvestment %: Cash Reinvestment Ratio X33 - Current Ratio; X34 - Quick Ratio: Acid Test; X35 - Interest Expense Ratio: Interest Expenses/Total Revenue; X36 - Total debt/Total net worth: Total Liability/Equity Ratio; X37 - Debt ratio %: Liability/Total Assets; X38 - Net worth/Assets: Equity/Total Assets; X39 - Long-term fund suitability ratio (A): (Long-term Liability+Equity)/Fixed Assets; X40 - Borrowing dependency: Cost of Interest-bearing Debt; X41 - Contingent liabilities/Net worth: Contingent Liability/Equity; X42 - Operating profit/Paid-in capital: Operating Income/Capital; X43 - Net profit before tax/Paid-in capital: Pretax Income/Capital; X44 - Inventory and accounts receivable/Net value: (Inventory+Accounts Receivables)/Equity; X45 - Total Asset Turnover; X46 - Accounts Receivable Turnover; X47 - Average Collection Days: Days Receivable Outstanding; X48 - Inventory Turnover Rate (times); X49 - Fixed Assets Turnover Frequency; X50 - Net Worth Turnover Rate (times): Equity Turnover; X51 - Revenue per person: Sales Per Employee; X52 - Operating profit per person: Operation Income Per Employee; X53 - Allocation rate per person: Fixed Assets Per Employee; X54 - Working Capital to Total Assets; X55 - Quick Assets/Total Assets; X56 - Current Assets/Total Assets; X57 - Cash/Total Assets; X58 - Quick Assets/Current Liability; X59 - Cash/Current Liability; X60 - Current Liability to Assets; X61 - Operating Funds to Liability; X62 - Inventory/Working Capital; X63 - Inventory/Current Liability X64 - Current Liabilities/Liability; X65 - Working Capital/Equity; X66 - Current Liabilities/Equity; X67 - Long-term Liability to Current Assets; X68 - Retained Earnings to Total Assets; X69 - Total income/Total expense; X70 - Total expense/Assets; X71 - Current Asset Turnover Rate: Current Assets to Sales; X72 - Quick Asset Turnover Rate: Quick Assets to Sales; X73 - Working capitcal Turnover Rate: Working Capital to Sales; X74 - Cash Turnover Rate: Cash to Sales; X75 - Cash Flow to Sales; X76 - Fixed Assets to Assets; X77 - Current Liability to Liability; X78 - Current Liability to Equity; X79 - Equity to Long-term Liability; X80 - Cash Flow to Total Assets; X81 - Cash Flow to Liability; X82 - CFO to Assets; X83 - Cash Flow to Equity; X84 - Current Liability to Current Assets; X85 - Liability-Assets Flag: 1 if Total Liability exceeds Total Assets, 0 otherwise; X86 - Net Income to Total Assets; X87 - Total assets to GNP price; X88 - No-credit Interval; X89 - Gross Profit to Sales; X90 - Net Income to Stockholder's Equity; X91 - Liability to Equity; X92 - Degree of Financial Leverage (DFL); X93 - Interest Coverage Ratio (Interest expense to EBIT); X94 - Net Income Flag: 1 if Net Income is Negative for the last two years, 0 otherwise; and X95 - Equity to Liabilitys. The models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, LGBM classifier, Gradient Boosting, and XGB classifier. Three feature scaling used in machine learning are raw, minmax scaler, and standard scaler. Finally, you will develop a GUI using PyQt5 to plot cross validation score, predicted values versus true values, confusion matrix, learning curve, decision boundaries, performance of the model, scalability of the model, training loss, and training accuracy. PROJECT 5: DATA SCIENCE FOR RAIN CLASSIFICATION AND PREDICTION WITH PYTHON GUI This dataset contains about 10 years of daily weather observations from many locations across Australia. RainTomorrow is the target variable to predict. You will determine rain or not in the next day. This column is Yes if the rain for that day was 1mm or more. Observations were drawn from numerous weather stations. The daily observations are available from http://www.bom.gov.au/climate/data. The dataset contains 23 attributes. Some of them are as follows: About some of them are: DATE - The date of observation; LOCATION - The common name of the location of the weather station; MINTEMP - The minimum temperature in degrees celsius; MAXTEMP - The maximum temperature in degrees celsius; RAINFALL - The amount of rainfall recorded for the day in mm; EVAPORATION - The so-called Class A pan evaporation (mm) in the 24 hours to 9am; SUNSHINE - The number of hours of bright sunshine in the day; WINDGUESTDIR - The direction of the strongest wind gust in the 24 hours to midnight; WINDGUESTSPEED- The speed (km/h) of the strongest wind gust in the 24 hours to midnight; and WINDDIR9AM - Direction of the wind at 9am. The models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, LGBM classifier, Gradient Boosting, and XGB classifier. Three feature scaling used in machine learning are raw, minmax scaler, and standard scaler. Finally, you will develop a GUI using PyQt5 to plot cross validation score, predicted values versus true values, confusion matrix, learning curve, decision boundaries, performance of the model, scalability of the model, training loss, and training accuracy.

DATA SCIENCE WORKSHOP: Parkinson Classification and Prediction Using Machine Learning and Deep Learning with Python GUI

DATA SCIENCE WORKSHOP: Parkinson Classification and Prediction Using Machine Learning and Deep Learning with Python GUI PDF Author: Vivian Siahaan
Publisher: BALIGE PUBLISHING
ISBN:
Category : Computers
Languages : en
Pages : 373

Book Description
In this data science workshop focused on Parkinson's disease classification and prediction, we begin by exploring the dataset containing features relevant to the disease. We perform data exploration to understand the structure of the dataset, check for missing values, and gain insights into the distribution of features. Visualizations are used to analyze the distribution of features and their relationship with the target variable, which is whether an individual has Parkinson's disease or not. After data exploration, we preprocess the dataset to prepare it for machine learning models. This involves handling missing values, scaling numerical features, and encoding categorical variables if necessary. We ensure that the dataset is split into training and testing sets to evaluate model performance effectively. With the preprocessed dataset, we move on to the classification task. Using various machine learning algorithms such as Logistic Regression, K-Nearest Neighbors, Decision Trees, Random Forests, Gradient Boosting, Naive Bayes, Adaboost, Extreme Gradient Boosting, Light Gradient Boosting, and Multi-Layer Perceptron (MLP), we train multiple models on the training data. To optimize the hyperparameters of these models, we utilize Grid Search, a technique to exhaustively search for the best combination of hyperparameters. For each machine learning model, we evaluate their performance on the test set using various metrics such as accuracy, precision, recall, and F1-score. These metrics help us understand the model's ability to correctly classify individuals with and without Parkinson's disease. Next, we delve into building an Artificial Neural Network (ANN) for Parkinson's disease prediction. The ANN architecture is designed with input, hidden, and output layers. We utilize the TensorFlow library to construct the neural network with appropriate activation functions, dropout layers, and optimizers. The ANN is trained on the preprocessed data for a fixed number of epochs, and we monitor its training and validation loss and accuracy to ensure proper training. After training the ANN, we evaluate its performance using the same metrics as the machine learning models, comparing its accuracy, precision, recall, and F1-score against the previous models. This comparison helps us understand the benefits and limitations of using deep learning for Parkinson's disease prediction. To provide a user-friendly interface for the classification and prediction process, we design a Python GUI using PyQt. The GUI allows users to load their own dataset, choose data preprocessing options, select machine learning classifiers, train models, and predict using the ANN. The GUI provides visualizations of the data distribution, model performance, and prediction results for better understanding and decision-making. In the GUI, users have the option to choose different data preprocessing techniques, such as raw data, normalization, and standardization, to observe how these techniques impact model performance. The choice of classifiers is also available, allowing users to compare different models and select the one that suits their needs best. Throughout the workshop, we emphasize the importance of proper evaluation metrics and the significance of choosing the right model for Parkinson's disease classification and prediction. We highlight the strengths and weaknesses of each model, enabling users to make informed decisions based on their specific requirements and data characteristics. Overall, this data science workshop provides participants with a comprehensive understanding of Parkinson's disease classification and prediction using machine learning and deep learning techniques. Participants gain hands-on experience in data preprocessing, model training, hyperparameter tuning, and designing a user-friendly GUI for efficient and effective data analysis and prediction.