ANJANA TIHA
Research Software Engineer, University of Pennsylvania, USA.
Specialized in Machine Learning, Deep Learning, Natural Language Processing, Computer Vision, and Distributed Big Data Analytics.
RESEARCH
Machine Learning, Deep Learning, Natural Language Processing, Computer Vision, Distributed Big Data Analytics, Information Retrieval, Recommendation Engine
GENERATIVE OPEN DOMAIN CHATBOT APPLICATION WITH DEEP LEARNING
Algorithm and Techniques: Machine Learning, Deep Learning, Recurrent Neural Network (RNN), Long Short Term Memory(LSTM), Bidirectional LSTM, Sequence to Sequence (Seq2Seq), Beam Search, Neural Attention Mechanism
Language: Python
Technology: TensorFlow, PyQT
Tools: Anaconda, Linux
Date: January 2018 - May 2018
Description:
- Developed generative model based open domain conversational agent (Human vs AI) using state of the art architecture, Sequence-to-Sequence (Seq2Seq) and attained validation perplexity 46.82 and Bleu 10.6.
- Trained encoder-decoder based Seq2Seq model fully from scratch and further optimized the Recurrent Neural Network based model with Bidirectional LSTM cells, Neural Attention Mechanism and Beam Search.
- Used Cornell Movie Subtitle Corpus following data preprocessing as data, PyQT for chat interface (GUI) development and untrained Google’s Neural Machine Translation (NMT) model for Seq2Seq module.
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
Link to Demo
Link to GitHub Repository (Code)
DISTRIBUTED MACHINE LEARNING FOR BIOMARKERS DETECTION FROM WEARABLE SENSOR BIG DATA
Algorithm & Techniques: Machine Learning, Distributed Machine Learning, Classification, Supervised Learning, Mobile Health, Big Data Analytics
Language: Python
Technology: Apache Spark, scikit-learn, Git, GitHub
Tools: IntelliJ Idea, Linux
Date: January 2017 - April 2017
Description:
- Developed Machine Learning (ML) module for training ML models on multiple clusters with Apache Spark.
- Developed Grid & Random Grid Search CV module for training time and parameter search optimization.
- Detected biomarkers (psychological stress) from big stream data (accelerometer, ECG, respiration rate) from multi-modal wearable sensors with prediction accuracy (F-1 Score) of 87% with SVM radial kernel initially.
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
SURVEY ON MACHINE LEARNING BASED PHYSICAL ACTIVITY RECOGNITION METHODS FROM SENSOR DATA
Date: December 2016 – February 2017
- Conducted research on machine learning based algorithms for physical activity recognition (eg. walking, running, eating and drinking) from multimodal wearable sensor data.
Walking, Running, Jogging
Walking, Running, Jogging
Detecting Eating, Drinking
Walking, Running, Jogging
DISTRIBUTED BIG DATA APPLICATION FOR LARGE SCALE US STOCK MARKET DATA ANALYSIS
Algorithm & Techniques: Financial Analysis, Stock Market Analysis, Anomaly Detection, Distributed Big Data Analytics, Big Data Analytics, Big Data, Data Analytics, FinTech
Language: Java, Python
Technology: Apache Spark, Maven, Git
Tools: IntelliJ Idea, Linux
Date: May 2017 - August 2017
Description:
- Developed framework for processing and analysis of 7 years of historical US stock market data (50TB) of nanosecond granularity from 13 US exchanges on multiple clusters with Apache Spark.
- Added support for information extraction from binary files based on field spec for multiple year, file formats.
- Conducted multi market analysis (for market dominance detection), anomaly detection (for Flash crash day).
- Proposed using unsupervised learning/clustering on large-scale unlabeled stock market data for anomaly detection and general market analysis in absence of labels.
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
ECONOMIC MODEL DEVELOPMENT FOR COVID-19 PANDEMIC WITH MACHINE LEARNING
Algorithm & Techniques: Machine Learning, Data Analytics, Economics
Language: Python
Technology:
Tools: Anaconda
Date: July 2020 - September 2020
Description:
- Conducted analysis for developing economic model around COVID-19 pandemic with country level economic data of 20 years and applied machine learning algorithms.
Would you like to learn more about my research projects?
MOVIE REVENUE & RATING PREDICTION FROM IMDB MOVIE DATA
Algorithm & Techniques: Machine Learning, Supervised Learning, Regression Analysis
Language: Python
Technology: scikit-learn
Tools: Anaconda
Date: October 2016 - December 2016
Description:
- Developed regression model for predicting revenue and ratings with 5,000 movies and attained regression error (Mean Squared Error) 0.0005 on scale of 1 for revenue after 5-fold cross-validation.
- Conducted preprocessing, feature extraction (28 numerical, textual and categorical feature).
- Performed data analysis, visualization, feature extraction, cleaning (missing value, anomaly), preprocessing (rescaling, normalization, feature transformation (one hot encoding)) and trained with cross-validation.
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
WEB RETRIEVAL & SEARCH ENGINE IMPLEMENTATION FOR UNIVERSITY WEB DOMAIN
Algorithm & Techniques: Search Engine, Search Relevance, Information Retrieval, Vector Space Model, Cosine Similarity
Language: Python
Technology: Django
Tools: Anaconda
Date: August 2017 - December 2017
Description:
- Developed vector space model based end-to-end web retrieval engine for University of Memphis and evaluated performance with 10, 000 web pages and docs (text, pdf, docx and pptx) from university domain.
- Used TF-IDF vector space model and cosine similarity function for web page matching and ranking.
- Developed modules - web crawler (with memory), text preprocessor (preprocess, tokenize, stem from raw HTML/docs), page indexer, page relevance ranker and performance evaluator (F1, precision, recall).
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
MOVIE RECOMMENDATION ENGINE USING USER BASED COLLABORATIVE FILTERING
Algorithm & Techniques: Recommendation Engine, Recommendation Systems, Collaborative Filtering
Language: C++, Python
Technology: NA
Tools: Sublime Text
Date: February 2017 - April 2017
Description:
- Developed user-based movie recommender system by implementing user-user collaborative filtering with runtime and space complexity optimization and separate implementation in both C++ and Python.
- Used Netflix movie dataset with 100K user records as dataset.
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
RESTAURANT RECOMMENDATION SYSTEM USING RELATIONAL DATABASE
Algorithm & Techniques: Recommendation System, Relational Database
Language: Python
Technology: MySQL, Django
Tools: Anaconda
Date: October 2017 - December 2017
Description:
- Implemented restaurant recommendation system based on user (eg. location, cuisine preference) and restaurant (location, cuisine, ratings, reviews) info.
- Included features to derive review effectiveness and user trustworthiness from available data.
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
TOXIC COMMENT IDENTIFICATION / CLASSIFICATION
Algorithm & Techniques: Machine Learning, Supervised Learning, Classification, Natural Language Processing, Text Classification, Text Analysis
Language: Python
Technology: scikit-learn, NLTK
Tools: Anaconda
Date: August 2018 - September 2018
Description:
- Classify around 130, 000 text comments of size 34MB on categories - "Toxic", "Severe Toxic", "Obscene", "Threat", "Insult", "Identity Hate", "Any of the Above", "None of the Above".
- Used features fro AAAI 2018 paper "Anatomy of Online Hate: Developing a Taxonomy and Machine Learning Models for Identifying and Classifying Hate in Online News Media" by "Salminen, Almerekhi".
- Built pipelines for machine learning model training for reading file, creating training testing dataset, preprocessing, extracting features, and training and evaluation in grid search approach for multiple models.
- Generated visualization and aggregated report on the performance of various models.
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
REGRESSION MODELING FOR HOUSING PRICE PREDICTION
Algorithm & Techniques: Machine Learning, Supervised Learning, Regression
Language: Python
Technology: scikit-learn, NLTK
Tools: Anaconda
Date: August 2018 - September 2018
Description:
- Built regression model for predicting housing price using 79 numerical and categorical features with regression error (Mean Squared Error) of 0.000685 on a scale of 1.
- Built pipelines for machine learning (regression) model training with preprocessing (normalization, label encoding of categorical features), features extraction, training and evaluation in grid search approach for multiple regression models with visualization and aggregated report on the performance.
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
Link to GitHub Repository (Code)
IMAGE RECOGNITION USING DEEP CONVOLUTIONAL NEURAL NETWORK
Algorithm & Techniques: Image Classification, Deep Learning, Convolutional Neural Network (CNN), Transfer Learning.
Language: Python
Technology: Keras, TensorFlow
Tools: Anaconda
Date: September 2018 - December 2018
Description:
- Developed image classification tools using Deep Convolutional Neural Network built from scratch with Keras and pretrained model “InceptionV3” separately for fine-tuning with new class labels.
- Trained on multiple datasets - Flower dataset (testing accuracy - 85.68%, 5 species, 4.5K images, 228 MB), 10 Monkey species (validation accuracy – 97.06%, 553MB), Dog Breed dataset (Testing accuracy - 76.41%, 120 class, 10.2K images, 344MB).
Sample prediction on 5 species of flowers
Prediction on 5 species of flowers for 64 images
Prediction on 120 species of Dog images
Sample prediction on 5 species of flowers
SERVER-CLIENT CHAT APPLICATION
TECHNOLOGY
Algorithm & Techniques: Machine Learning, Supervised Learning, Classification, Natural Language Processing, Text Classification, Text Analysis
Language: Java, Android
Technology: TCP/ IP
Tools: Android Studio
Date: August 2015
Description:
- Developed TCP/IP based chat server and client application where multiple clients can chat simultaneously.
- Built client application for both Android and desktop platform.