vineri, 20 ianuarie 2023

Disease Symptom Prediction

Introduction:

Machine learning is programming computers to optimize a performance using example data or past data.

The development and exploitation of several Machine Learning techniques in numerous real-world application areas (e.g. Industry, Healthcare and Bio science) has led to a more accurate analysis of medical databases, in order to extract useful pieces of information of the specified data in healthcare communities, biomedical fields.


 

Disease prediction and in a broader context, medical informatics, have recently gained significant attention from the data science research community in recent years. This is primarily due to the wide adaptation of computer-based technology into the health sector in different forms (e.g., electronic health records and administrative data) and subsequent availability of large health databases for researchers.

The aim of developing classifier system using machine learning algorithms is to    immensely help to solve the health-related issues by assisting the physicians to predict  and diagnose diseases at an early stage.


 

Various classification and data mining techniques are used to classify the disease data and predict particular diseases.

Machine learning techniques provide the solution to reduce false and late prediction and understand the symptoms for the particular disease.

 

Methods: 

1.  Supervised machine learning algorithm 

At its most basic sense, machine learning uses programmed algorithms that  learn and optimize their operations by analysing input data to make predictions within an acceptable range. With the feeding of new data, these algorithms tend to make more accurate predictions.

In supervised machine learning algorithms, a labelled training data set is used first to train the underlying algorithm. This trained algorithm is then fed on the unlabelled test data  set to categorize them into similar groups. 

 

 

2. Support vector machine

Support vector machine (SVM) algorithm can classify both linear and non-linear  data.  It first maps each data item into an n-dimensional feature space where n is the number of features. It then identifies the hyperplane that separates the data items into two classes while maximising the marginal distance for both classes and minimising the classification errors


 

3. Decision tree

Decision tree (DT) is one of the earliest and prominent machine learning algorithms. A decision tree models the decision logics i.e., tests and corresponds outcomes for classifying data items into a tree-like structure.

  
 

4. Random forest

A random forest (RF) is an ensemble classifier and consisting of many DTs similar to the way a forest is a collection of many trees. The different DTs of an RF are trained using the different parts of the training dataset. To classify a new sample, the input vector of that sample is required to pass down with each DT of the forest.


  

5. Naïve Bayes

Naïve Bayes (NB) is a classification technique based on the Bayes’ theorem. This theorem can describe the probability of an event based on the prior knowledge of conditions related to that event. This classifier assumes that a particular feature in a class is not directly related to any other feature although features for that class could have interdependence among themselves.

 

        Disease prediction:

       Proposed system flow:


       Conclusion:

Developing a medical diagnosis system based on machine learning (ML) algorithms for  prediction of any disease can help in a more accurate diagnosis than the conventional method. 

 


 

 

 

 

 

 

Bibliography:

https://www.irjmets.com/uploadedfiles/paper/issue_5_may_2022/24065/final/fin_irjmets1653367944.pdf

https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-019-1004-8 

https://ieeexplore.ieee.org/document/8819782

https://d1wqtxts1xzle7.cloudfront.net/61714216/IRJET-V6I1212220200108-1822-1wxmovz-libre.pdf?1578481824=&response-content-disposition=inline%3B+filename%3DIRJET_Disease_Prediction_using_Machine_L.pdf&Expires=1674210215&Signature=H9YtWH1Zv9-9FJlDCIgL~Oph2LqaMxk4wcabGj9ZtSwahQnQ6IE2fDMwf~nSJ7hPly8uesRd9jyYj0OAi0Ril08KZDY6IsNw~xVCnatjGPvRY0fNjDdUGREMlvFBKBVtLLOf6IlhqwqED3vxxtmOB-jtnI23e-q3im9376qUdFiBCXngdl9RzFrTd9UYPFtLqNLHlsx0jzPPH6negRQv5Q5ZnAX8z~eXehag~lasxn6J0kfYi6ZjfGX6eHV6tEF7s5J0EfqDeQUYvFRzac-RnlERd10XjdYgNVfPpQip8K73vauY5WEqab5osiuqKoThbmzgMevlZBvbPxgTIDGadw__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA 

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3661426

 
 

miercuri, 18 ianuarie 2023

            Customer Segmentation 




What is customer segmentation?

The practice of dividing a company’s customers into groups of individuals that are similar in specific ways relevant to marketing such as age, gender, interests, and spending habits is nothing but Customer Segmentation.

Customer segmentation is basically identifying key differentiators that divide customers into groups that can be targeted. The main goal of segmenting customers is to decide how to relate to customers in each segment to maximize the value of each customer to the business.

Nowadays Companies want to know their customers better so that they can provide useful customers with some goodies and also attract customers to purchase or make some kind of business with the company. When Companies know about customers and segment it into groups, it becomes easy for companies to send customers special offers meant to encourage them to buy more products.

The goals of customer segmentation are customer acquisition, customer retention, increasing customer profitability, customer satisfaction, resource allocation by designing marketing measures or programs and improving target marketing measures .

Customer segmentation also improves customer service and assists in customer loyalty and retention.

Customer segmentation can also help to save waste of money on marketing campaigns. As we will be knowing the customers we need to target.

About Machine Learning Customer Segmentation

Earlier Customer segmentation was a challenging and time-consuming task. This is because for segmenting customers we need to perform hours of manual poring on different tables and querying the data in hopes of finding ways to group customers together. So to overcome this task, machine learning is used to segment customers.

K means clustering is one of the most popular clustering algorithms and usually the first thing practitioners apply when solving clustering tasks to get an idea of the structure of the dataset. The goal of K means is to group data points into distinct non-overlapping subgroups. One of the major application of K means clustering is segmentation of customers to get a better understanding of them which in turn could be used to increase the revenue of the company.



 


marți, 17 ianuarie 2023

House Price Prediction (Kaggle competition)



 Introduction

    Predicting the sale price of a house is a real-life problem that can have a significant impact on the housing market. Accurately predicting the sale price of a home can benefit both buyers and sellers by providing them with a more accurate understanding of a home's value. However, the real estate market is constantly changing and predicting the sale price of a house can be a difficult task. In this blog post, we will explore how we can use advanced regression techniques and machine learning algorithms to accurately predict the sale price of a house.







The Problem

    The real estate market is constantly changing and predicting the sale price of a house can be a difficult task. Real-world issues like estimating a home's sale price can have a big impact on the housing market. By giving a better idea of a house's value, accurate home sale price predictions can be advantageous to both buyers and sellers. Accurate price forecasts can assist purchasers in making more knowledgeable judgments about their purchases and may even prevent them from overpaying for a property. Accurate price forecasts can assist sellers in setting a more appealing asking price and even expedite the sale of their home.

    However, the problem is challenging because the real estate market is affected by many variables such as local economy, housing market conditions, and interest rates. Also, the prices of houses can vary greatly depending on the location, type of house and other factors. 'House pricing prediction' will provide a great opportunity to apply machine learning techniques to a real-world problem and make valuable predictions that can be used by real estate agents, investors, and homebuyers.



The Solution

    With the advancements in artificial intelligence (AI), it is now possible to use machine learning models to accurately predict the sale price of a house. We can use regression techniques and machine learning algorithms to assess a variety of traits and attributes associated with a given property in order to address this issue. The goal of this project is to train an ML model to predict the sale price of a house based on a set of input features. These features can include characteristics of the house such as size, number of bedrooms, location, age, and other factors that can affect the price of a house. By doing this, we can offer



Steps Followed

    -Data Collection: The first step in this project was to collect the data for training and testing the model. The dataset used for this project is available on Kaggle and is part of the competition named “House Prices - Advanced Regression Techniques” (https://www.kaggle.com/c/house-prices-advanced-regressiontechniques). This dataset refers to the reality of residential homes in Ames, Iowa and includes information on houses that have already been sold, such as the sale price and the input features.

    -Data Exploration: Once the dataset was obtained, we performed an initial exploration of the data. This included analyzing the variable types, missing values, and the distribution of the target variable.

    -Data Cleaning: After the initial exploration, we performed data cleaning tasks to ensure that the dataset was ready for modeling. This included filling missing values, transforming categorical variables into numerical, and removing outliers.




    -Feature Selection: Next, we selected the most relevant features from the dataset to be used in the model. We used various feature selection techniques such as correlation matrix, chi-square test, and recursive feature elimination.






    -Model Building: With the cleaned and preprocessed dataset, we built a machine learning model to predict the sale price of a house. We used advanced regression techniques such as Random Forest, XGBoost, and LightGBM to build the model.

    -Model Evaluation: After building the model, we evaluated its performance using various evaluation metrics such as R-squared and root mean squared error (RMSE). We also compared the performance of the different models built to select the best performing one.





Conclusion

    In this blog post, we explored how we can use advanced regression techniques and machine learning algorithms to predict the sale price of a house. By following the steps outlined above, we were able to build a model that can accurately predict the sale price of a house based on a set of input features. This will enable us to provide a better understanding of the value of a property and help buyers and sellers make more informed decisions in the housing market.






Bibliography


[1] Marill, K. A. (2004). Advanced statistics: Linear regression,part I: Simple linear regression. Academic Emergency Medicine, 87 - 93. doi:10.1111/j.1553-2712.2004.tb01378.x

[2] python-reference_catboostregressor. (2022, 04 30). Retrieved from catboost.ai:

https://catboost.ai/en/docs/concepts/python-reference_catboostregressor

[3] Sykes, A. O. (1993). An introduction to regression analysis.

[4] Fedorov, N., & Petrichenko, Y. (2020). Gradient boosting–based machine learning methods in real estate market forecasting. ITIDS 2020. doi:10.2991/aisr.k.201029.039

[5] Arora, M., Sharma, A., Katoch, S., Malviya, M., & Chopra, S. (2021). A state of the Art Regressor Model’s comparison for effort estimation of Agile Software. 2021 2nd International Conference on Intelligent Engineering and Management (ICIEM). doi:10.1109/iciem51511.2021.9445345

[6] Dean De Cock. “Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project”. In: Journal of Statistics Education 19.3 (2011).


joi, 12 ianuarie 2023

MorpheusID: A journey into verifying users' identity smoothly

 

 MorpheusID

A journey into verifying users' identity smoothly

 https://morpheus-identity.com/static/media/logo.8f548da9ac1ae0d7966d.png

Introduction

MorpheusID is an innovative solution for identity verification that uses a combination of identity card validation and gesture-based detection to verify the authenticity of a user in just 30 seconds. By using a custom-trained model with TensorFlow, MorpheusID is able to track Romanian identity cards and enhance the image quality for more accurate results. The plugin also utilizes Google Vision as an OCR tool to extract the necessary details from the ID, and has a data pipeline in place to process and template each ID pattern. The application is visible at https://morpheus-identity.com/.

 

Caption 1. Short example of current UI integration

 

To further improve the accuracy of the verification process, MorpheusID utilizes mediapipe for gesture-based tracking and employs a Laplassian variation to avoid blurring. The whole flow is delivered in a user-friendly manner through the use of React with Chakra UI.

Journey so far

MorpheusID first began back in August as a project by two individuals at HackTM, the biggest hackathon in Eastern Europe, where it won in the AI, Payments, and Cybersecurity verticals. 

From there, the team behind MorpheusID went on to win Startup Weekend, a local competition focused on the business side of things. They were also selected as finalists in the Innovation Labs, the largest pre-accelerator for startups in Romania.

 

Caption 2. HackTM result back in August 
 

Caption 3.Moment during the InnovationLabs presentation
 
Since then, the team has expanded to seven members and has gone on to win the Rubik Cube accelerator. They have also successfully implemented their solution with two clients.
 

Machine Learning Course Application

For the context of the Machine Learning laboratory, we took the opportunity to focus improving the whole flow with a dedicated precheck mechanism for incoming IDs. Essentially, we needed a firewall to prevent 'bad images' to reach our backend. If this were to be handled completely on the client side, this would mean fewer requests and lower operation costs for our product.

This precheck implied two things: 
- a manual verification of the gestures before sending them to the backend pipeline;
- a way to ensure the quality of the image containing the ID. 
 
 
Unfortunately, there is no clear way of measuring that so we resorted to a more obvious approach. In order to figure out if the MRZ on the ID is visible, we need to initially find the ID and control the blur level by giving the end user dynamic prompts.

Technical details

As mentioned earlier, MorpheusID uses a custom-trained model with TensorFlow to track Romanian identity cards and enhance the image quality for more accurate results. This model is trained specifically to recognize the unique patterns and features of Romanian IDs, ensuring a high level of accuracy in the verification process.

We used a cured dataset of over 200 images containing Romanian Identity Cards collected from a campaign soon after InnovationLabs took place. This enabled us to tweak the algorithm easier for our specific use case and allowed us to give improved directions to the end user in order to obtain the best possible output.

 

Caption 4. Real-time ID tracking using Tensorflow for Romanian IDs

 

The plugin also utilizes Google Vision as an OCR tool to extract the necessary details from the ID, such as the user's name and date of birth. This allows for a quick and efficient way to verify the user's identity without the need for manual data entry, which is one of the main reasons why companies look for thrid parties to handle user verification, as the process in itself is rather time consuming.

MorpheusID has a data pipeline in place to process and template each ID pattern, allowing for easy comparison and verification of the user's identity. The plugin also employs mediapipe for gesture-based tracking and uses a Laplassian variation to avoid blurring, further improving the accuracy of the verification process.

Finally, the whole flow is delivered in a user-friendly manner through the use of React with Chakra UI, ensuring a smooth and seamless experience for the user.

Overall, MorpheusID is a cutting-edge solution for identity verification that uses a combination of identity card validation and gesture-based detection to verify the authenticity of a user in just 30 seconds. With a custom-trained model, OCR technology, a data pipeline, and gesture-based tracking, MorpheusID is able to provide a high level of accuracy and convenience for users.

Research and further developments

In today's digital age, user identity verification is more important than ever. With the increasing prevalence of online fraud and identity theft, it is crucial that businesses and organizations have reliable ways to verify the authenticity of their users. MorpheusID offers a unique solution for identity verification that combines identity card validation with gesture-based detection to provide a fast and accurate way to verify users.

According to a study by Javelin Strategy & Research, identity fraud has cost consumers billions of dollars in the past year alone. In addition, a report by the National Institute of Standards and Technology (NIST) found that user identity verification using internet activity can be an effective way to combat fraud and improve security.

MorpheusID's combination of identity card validation and gesture-based detection provides a reliable and convenient solution for user identity verification. Its custom-trained model, OCR technology, data pipeline, and gesture-based tracking make it a valuable tool for businesses and organizations looking to protect their users and prevent fraudulent activity.

Bibliography

Interesting papers related to ways of preventing malicious users  through other means, such as montioring user activity and using that as a measure of user reliability:

1. "Digital Identity Verification: A Survey of Techniques and Challenges" by F. Abbasi and M. K. Rehman (IEEE Access, 2018): This paper provides an overview of various techniques used to verify digital identities and discusses the challenges involved in this process.
2. "The Digital Footprint: An Overview of Online Identity Management" by S. K. D. J. K. D. D. R. D. W. (International Journal of Information Management, 2016): This paper discusses the concept of digital footprint and the various factors that contribute to it, as well as the implications of online identity management for individuals and organizations.
3. "Online Identity Verification: A Survey" by M. A. Baset, M. A. Imran, and M. Yousaf (IEEE Access, 2019): This paper presents a survey of various approaches to online identity verification, including biometric methods, knowledge-based methods, and possession-based methods.
4. "A Survey of Digital Identity Management: Issues and Challenges" by S. M. A. Hossain and M. M. Rahman (IEEE Access, 2016): This paper provides an overview of digital identity management, including the challenges and issues involved in establishing and maintaining a digital identity.
5. "Internet Footprint: A New Form of Digital Identity" by C. C. S. H. and M. B. (International Journal of Computer Science and Information Security, 2013): This paper discusses the concept of internet footprint as a new form of digital identity and the potential risks and benefits associated with it.

I have posted this blog initially in December 2022, check it out at: https://morpheus-machine-learning.blogspot.com/2022/12/morpheusid-journey-into-verifying-users.html

Disease Symptom Prediction

Introduction: Machine learning is programming computers to optimize a performance using example data or past data. The development and e...