ANJANA TIHA

Research Software Engineer, University of Pennsylvania, USA.
Specialized in Machine Learning, Deep Learning, Natural Language Processing, Computer Vision, and Distributed Big Data Analytics.

Home: Welcome

RESEARCH

Machine Learning, Deep Learning, Natural Language Processing, Computer Vision, Distributed Big Data Analytics, Information Retrieval, Recommendation Engine

Home: Research

GENERATIVE OPEN DOMAIN CHATBOT APPLICATION WITH DEEP LEARNING

Algorithm and Techniques: Machine Learning, Deep Learning, Recurrent Neural Network (RNN), Long Short Term Memory(LSTM), Bidirectional LSTM, Sequence to Sequence (Seq2Seq), Beam Search, Neural Attention Mechanism

Language: Python

Technology: TensorFlow, PyQT

Tools: Anaconda, Linux

Date: January 2018 - May 2018

Description:

- Developed generative model based open domain conversational agent (Human vs AI) using state of the art architecture, Sequence-to-Sequence (Seq2Seq) and attained validation perplexity 46.82 and Bleu 10.6.

- Trained encoder-decoder based Seq2Seq model fully from scratch and further optimized the Recurrent Neural Network based model with Bidirectional LSTM cells, Neural Attention Mechanism and Beam Search.

- Used Cornell Movie Subtitle Corpus following data preprocessing as data, PyQT for chat interface (GUI) development and untrained Google’s Neural Machine Translation (NMT) model for Seq2Seq module.

Source code

Chat Bot Interface

Link to GitHub Repository (Code)

Chat Bot Responses

Link to GitHub Repository (Code)

AI Powered Chat Bot

Link to Demo

Chat Bot Interface

Link to GitHub Repository (Code)

DISTRIBUTED MACHINE LEARNING FOR BIOMARKERS DETECTION FROM WEARABLE SENSOR BIG DATA

Algorithm & Techniques: Machine Learning, Distributed Machine Learning, Classification, Supervised Learning, Mobile Health, Big Data Analytics

Language: Python

Technology: Apache Spark, scikit-learn, Git, GitHub

Tools: IntelliJ Idea, Linux

Date: January 2017 - April 2017

Description:

- Developed Machine Learning (ML) module for training ML models on multiple clusters with Apache Spark.

- Developed Grid & Random Grid Search CV module for training time and parameter search optimization.

- Detected biomarkers (psychological stress) from big stream data (accelerometer, ECG, respiration rate) from multi-modal wearable sensors with prediction accuracy (F-1 Score) of 87% with SVM radial kernel initially.

Source code

Bio-Marker Detection using Machine Learning from Wearable Sensor Data

Link to GitHub Repository (Code)

Mobile Sensor

Link to GitHub Repository (Code)

Smoking Detection using Machine Learning from Wearable Sensor Data

Link to GitHub Repository (Code)

Bio-Marker Detection using Machine Learning from Wearable Sensor Data

Link to GitHub Repository (Code)

SURVEY ON MACHINE LEARNING BASED PHYSICAL ACTIVITY RECOGNITION METHODS FROM SENSOR DATA

Date: December 2016 – February 2017

- Conducted research on machine learning based algorithms for physical activity recognition (eg. walking, running, eating and drinking) from multimodal wearable sensor data.

Source code

Detecting Locomotion

Walking, Running, Jogging

Detecting Locomotion

Walking, Running, Jogging

Activity Detection

Detecting Eating, Drinking

Detecting Locomotion

Walking, Running, Jogging

DISTRIBUTED BIG DATA APPLICATION FOR LARGE SCALE US STOCK MARKET DATA ANALYSIS

Algorithm & Techniques: Financial Analysis, Stock Market Analysis, Anomaly Detection, Distributed Big Data Analytics, Big Data Analytics, Big Data, Data Analytics, FinTech

Language: Java, Python

Technology: Apache Spark, Maven, Git

Tools: IntelliJ Idea, Linux

Date: May 2017 - August 2017

Description:

- Developed framework for processing and analysis of 7 years of historical US stock market data (50TB) of nanosecond granularity from 13 US exchanges on multiple clusters with Apache Spark.

- Added support for information extraction from binary files based on field spec for multiple year, file formats.

- Conducted multi market analysis (for market dominance detection), anomaly detection (for Flash crash day).

- Proposed using unsupervised learning/clustering on large-scale unlabeled stock market data for anomaly detection and general market analysis in absence of labels.

Source code

Multi-Market Data Analysis

Link to GitHub Repository (Code)

Stock Market Big Data Analysis

Link to GitHub Repository (Code)

Anomaly Detection

Link to GitHub Repository (Code)

Multi-Market Data Analysis

Link to GitHub Repository (Code)

ECONOMIC MODEL DEVELOPMENT FOR COVID-19 PANDEMIC WITH MACHINE LEARNING

Algorithm & Techniques: Machine Learning, Data Analytics, Economics

Language: Python

Technology:

Tools: Anaconda

Date: July 2020 - September 2020

Description:

- Conducted analysis for developing economic model around COVID-19 pandemic with country level economic data of 20 years and applied machine learning algorithms.

Source Code

Covid 19

Would you like to learn more about my research projects?

Check out my GitHub repository

PROJECTS

Machine Learning, Deep Learning, Natural Language Processing, Computer Vision, Distributed Big Data Analytics, Information Retrieval, Recommendation Engine

GitHub

Home: Projects

MOVIE REVENUE & RATING PREDICTION FROM IMDB MOVIE DATA

Algorithm & Techniques: Machine Learning, Supervised Learning, Regression Analysis

Language: Python

Technology: scikit-learn

Tools: Anaconda

Date: October 2016 - December 2016

Description:

- Developed regression model for predicting revenue and ratings with 5,000 movies and attained regression error (Mean Squared Error) 0.0005 on scale of 1 for revenue after 5-fold cross-validation.

- Conducted preprocessing, feature extraction (28 numerical, textual and categorical feature).

- Performed data analysis, visualization, feature extraction, cleaning (missing value, anomaly), preprocessing (rescaling, normalization, feature transformation (one hot encoding)) and trained with cross-validation.

Source code

Movie Revenue and Rating Prediction Using Machine Learning

Link to GitHub Repository (Code)

Movie Revenue Prediction

Link to GitHub Repository (Code)

Movie Revenue Prediction

Link to GitHub Repository (Code)

Movie Revenue and Rating Prediction Using Machine Learning

Link to GitHub Repository (Code)

WEB RETRIEVAL & SEARCH ENGINE IMPLEMENTATION FOR UNIVERSITY WEB DOMAIN

Algorithm & Techniques: Search Engine, Search Relevance, Information Retrieval, Vector Space Model, Cosine Similarity

Language: Python

Technology: Django

Tools: Anaconda

Date: August 2017 - December 2017

Description:

- Developed vector space model based end-to-end web retrieval engine for University of Memphis and evaluated performance with 10, 000 web pages and docs (text, pdf, docx and pptx) from university domain.
- Used TF-IDF vector space model and cosine similarity function for web page matching and ranking.
- Developed modules - web crawler (with memory), text preprocessor (preprocess, tokenize, stem from raw HTML/docs), page indexer, page relevance ranker and performance evaluator (F1, precision, recall).

Source code

Web Retrieval/Search Engine Interface using

Link to GitHub Repository (Code)

Web Retrieval/Search Engine Query Operation

Link to GitHub Repository (Code)

Web Retrieval/Search Engine using Vector Space Model

Link to GitHub Repository (Code)

Web Retrieval/Search Engine Interface using

Link to GitHub Repository (Code)

MOVIE RECOMMENDATION ENGINE USING USER BASED COLLABORATIVE FILTERING

Algorithm & Techniques: Recommendation Engine, Recommendation Systems, Collaborative Filtering

Language: C++, Python

Technology: NA

Tools: Sublime Text

Date: February 2017 - April 2017

Description:

- Developed user-based movie recommender system by implementing user-user collaborative filtering with runtime and space complexity optimization and separate implementation in both C++ and Python.
- Used Netflix movie dataset with 100K user records as dataset.

Source code

User Based Movie Recommendation Engine Using Collaborative Filtering and Netflix Movie Dataset

Link to GitHub Repository (Code)

Movie Recommendation Engine Using Netflix Movie Dataset

Link to GitHub Repository (Code)

User Based Movie Recommendation Engine

Link to GitHub Repository (Code)

User Based Movie Recommendation Engine Using Collaborative Filtering and Netflix Movie Dataset

Link to GitHub Repository (Code)

RESTAURANT RECOMMENDATION SYSTEM USING RELATIONAL DATABASE

Algorithm & Techniques: Recommendation System, Relational Database

Language: Python

Technology: MySQL, Django

Tools: Anaconda

Date: October 2017 - December 2017

Description:

- Implemented restaurant recommendation system based on user (eg. location, cuisine preference) and restaurant (location, cuisine, ratings, reviews) info.
- Included features to derive review effectiveness and user trustworthiness from available data.

Source code

Restaurant Recommendation System (Entity Relationship (E-R Diagram))

Link to GitHub Repository (Code)

Restaurant Recommendation System Using Relational Database

Link to GitHub Repository (Code)

Restaurant Recommendation System Using Relational Database

Link to GitHub Repository (Code)

Restaurant Recommendation System (Entity Relationship (E-R Diagram))

Link to GitHub Repository (Code)

TOXIC COMMENT IDENTIFICATION / CLASSIFICATION

Algorithm & Techniques: Machine Learning, Supervised Learning, Classification, Natural Language Processing, Text Classification, Text Analysis

Language: Python

Technology: scikit-learn, NLTK

Tools: Anaconda

Date: August 2018 - September 2018

Description:

- Classify around 130, 000 text comments of size 34MB on categories - "Toxic", "Severe Toxic", "Obscene", "Threat", "Insult", "Identity Hate", "Any of the Above", "None of the Above".
- Used features fro AAAI 2018 paper "Anatomy of Online Hate: Developing a Taxonomy and Machine Learning Models for Identifying and Classifying Hate in Online News Media" by "Salminen, Almerekhi".
- Built pipelines for machine learning model training for reading file, creating training testing dataset, preprocessing, extracting features, and training and evaluation in grid search approach for multiple models.
- Generated visualization and aggregated report on the performance of various models.

Source code

Toxic Comment Classification

Link to GitHub Repository (Code)

Toxic Comment Classification

Link to GitHub Repository (Code)

Toxic Comment Classification

Link to GitHub Repository (Code)

REGRESSION MODELING FOR HOUSING PRICE PREDICTION

Algorithm & Techniques: Machine Learning, Supervised Learning, Regression

Language: Python

Technology: scikit-learn, NLTK

Tools: Anaconda

Date: August 2018 - September 2018

Description:

- Built regression model for predicting housing price using 79 numerical and categorical features with regression error (Mean Squared Error) of 0.000685 on a scale of 1.

- Built pipelines for machine learning (regression) model training with preprocessing (normalization, label encoding of categorical features), features extraction, training and evaluation in grid search approach for multiple regression models with visualization and aggregated report on the performance.

Source code

Housing Price Prediction using Numerical and Categorical Features

Link to GitHub Repository (Code)

Housing Price Prediction using Numerical and Categorical Features

Link to GitHub Repository (Code)

Housing Price Prediction using Numerical and Categorical Features

Link to GitHub Repository (Code)

Housing Price Prediction using Numerical and Categorical Features

Link to GitHub Repository (Code)

IMAGE RECOGNITION USING DEEP CONVOLUTIONAL NEURAL NETWORK

Algorithm & Techniques: Image Classification, Deep Learning, Convolutional Neural Network (CNN), Transfer Learning.

Language: Python

Technology: Keras, TensorFlow

Tools: Anaconda

Date: September 2018 - December 2018

Description:

- Developed image classification tools using Deep Convolutional Neural Network built from scratch with Keras and pretrained model “InceptionV3” separately for fine-tuning with new class labels.

- Trained on multiple datasets - Flower dataset (testing accuracy - 85.68%, 5 species, 4.5K images, 228 MB), 10 Monkey species (validation accuracy – 97.06%, 553MB), Dog Breed dataset (Testing accuracy - 76.41%, 120 class, 10.2K images, 344MB).

Source code