Overview

Personal Photo

I have finished my PhD degree on October 2020 and currently working as Apps Dev Tech Lead--Vice President at Citi Bank Canada.

I am a Ph.D. of Compuate Science at the Data Science Lab, McMaster University. My research interests are data cleaning, data privacy, data mining, entity resolution, and applying machine learning techniques to improve data quality and discover useful patterns.

Research Statement: Big data plays a crucial role in the decision-making process, however, data can also be rife with errors, inconsistencies and incompleteness. I aim to provide clean, trustable, valuable data for data scientists, industry and market. The design of intelligent data cleaning systems involves data quality metrics, data cleansing, data privacy-protecting, data repairing and machine learning algorithms.

Publications

An important challenge in data cleaning is the trade-off between data repairing and privacy-protecting. Data cleaning focuses on scalable techniques to resolve inconsistencies quickly, however, data privacy-protecting concerns how to protect and hide sensitive information. We design a privacy-aware data cleaning framework which can resolve data inconsistencies while minimizing the amount of sensitive information disclosed.

[1]Yu Huang, Mostafa Milani, Fei Chiang. Privacy-aware data cleaning-as-a-service. The Journal of Information Systems 2020. Link

[2]Yu Huang, Mostafa Milani, Fei Chiang. "PACAS: Privacy-Aware, Data Cleaning-as-a-Service". IEEE International Conference on Big Data 2018 Link

[3] Yu Huang, Fei Chiang, Albert Maier, Martin Petitclerc, Yannick Saillet, Damir Spisic, Calisto Zuzarte. "Quantifying Duplication to Improve Data Quality”. In ACM CASCON Conference 2017.(Link)

[4] Yu Huang, Fei Chiang. “Refining Duplicate Detection for Improved Data Quality”. Meta-Data Quality Workshop (in conjunction with TPDL 2017), 10 pages, 2017.(Link)

[5] Yu Huang, Fei Chiang. Towards a Unified Framework for Data Cleaning and Data Privacy. QUAT 2015 (in conjunction with WISE)(Link)

[6] Dejun Huang, Dhruv Gairola, Yu Huang, Zheng Zheng, Fei Chiang. PARC: Privacy-Aware Data Cleaning CIKM 2016. (Link)

Evaluation and storage in Big Data

Another challenge in big data is how to evaluate the performace of predictor and how to store data in scalable and efficient

[7] Yu Huang, Tiejian Luo, Xiang Wang, Kai Hui, Wen-Jie Wang, Ben He: On Evaluating Query Performance Predictors. ICPCA/SWS 2013. (Link)

[8] Yu Huang, Tiejian Luo. NoSQL Database: A Scalable, Availability, High Performance Storage for Big Data. ICPCA/SWS 2013. (Link)

Research Projects

Privacy-Aware Data Cleaning

October 2015 -- October 2020

Most data cleaning algorithms only focus on how to search for optimal repair candidates for the erroneous data, and they do not consider data privacy issues when they consult data from the clean database. In practical terms, records are not always available for access due to some privacy reasons, and even within one record, some attribute values may have different privacy requirements. Given the proliferation of sensitive, confidential user information, data privacy concerns have largely remained unexplored in data cleaning techniques. We also present a new privacy-aware, data cleaning framework that aims to resolve data inconsistencies while protecting sensitive information. To protect the privacy of individuals in the master database, we apply generalization techniques. We also propose a pricing scheme that assigns prices to generalized queries that are posed over generalized databases. This pricing scheme can be used as a mechanism for applying privacy measures such as k-anonymity and also to price and sell sensitive data at different levels of granularity.

A Dynamic and User-Centric Data Cleaning System for Watson Analytics

May 2015 -- May 2019

The project is funded by OCE-SOSCIP Smart Computing R&D Challenge and IBM. I worked with IBM team from different branches (IBM Toronto Lab, IBM Ottawa Lab, IBM Germany, IBM Chicago) to improve the data quality metrics in Watson Analytics, IBM cloud-based data analytics platform. As the lead developer, I designed a set of detailed quality metrics and repair algorithms which can provide in-depth information on the data quality problems and help users better understand their data.

We present a record deduplication framework that differentiates terms during the matching process to improve overall accuracy. We also define a duplication metric that quantifies the level of duplication for an attribute value, and within an attribute. This metric can be used by analysts to understand the distribution and similarity of values during the data cleaning process.

PARC:a Privacy-Aware Data Cleaning system

July 2016 -– October 2017

We implement a Privacy-Aware Data Cleaning system that corrects data inconsistencies w.r.t. a set of FDs, and limits the disclosure of sensitive values during the cleaning process. This project is implemented in Javascript, Groovy and the core algorithms in Java.

Design large-scale social network system for China National Network Television

May 2012 -- September 2013

This project is part of China Network Television, in which I worked as an architecture to design the data access and store module of large-scale social network system.

Experience

Lead Developer on Dynamic and User-centric Data Cleaning System For IBM Watson Analytics

IBM Center for Advanced Studies, Ontario, Canada. May 2015 -- August 2019

In collaboration with several IBM branches (IBM Toronto Lab, IBM Ottawa Lab, IBM Germany, IBM Chicago), I am developing new data quality metrics for IBM’s cloud-based data analytics platform, Watson Analytics. The metrics may be customized according to desirable properties based on a user’s data analysis task. This project aims to provide organizations with more accurate measurements to clean their data, thereby saving money and time to enable faster decision making.
The project is also funded by OCE-SOSCIP Smart Computing R&D Challenge.

Research Assistant at McMaster Univeristy

Ontario, Canada. September 2014 -- October 2019
  • Develop a unified continuous data cleaning framework which combines machine learning algorithms with data cleaning algorithm, and leverages the semantics and statistics of the data to predict the type of repairs for data cleaning in the dynamic environment.
  • In collaboration with IBM, I develop new data quality metrics for IBM cloud-based data analytics platform, Watson Analytics. I develop a deduplication framework which can provide finer measurement of duplicate values within an attribute, and to accurately identify duplicate records in a dataset.
  • Propose a set of new repair operations that increase data utility while preserving data privacy. I develop a privacy-preserving data cleaning system which allows the user with options of how to improve data utility (i.e., cleanliness) while carefully controlling the level of information disclosure from sensitive data values.

Invited Academic Visitor at IBM Deutschland Research & Development Centre

Stuttgart, Germany. October 3 -- October 19, 2016

Funded by IBM Canada as an academic visitor to IBM Germany Stuttgart Deutschland Research & Development Centre, and worked within their Data Forge team to build data quality metrics and research data entity resolution.

Teaching Assistant for CS3DB3 at McMaster University

  • Teaching Assistant for Web Systems and Web Computing (4WW3), Sep 2019
  • Teaching Assistant for Real-Time Systems and Control Application (4AA4), Sep 2019
  • Teaching Assistant for Data Structures and Algorithms (2C03), Jan 2019 - May 2019
  • Teaching Assistant for Advanced Topics in Data Management (CAS764), Jan 2019 - May 2019
  • Teaching Assistant for Principles of Programming (2S03), Sep 2018 - Dec 2018
  • Teaching Assistant for Performance Analysis of Computer Systems (4E03), Sep 2018 - Dec 2018
  • Teaching Assistant for Database (3DB3), Sep 2017 - Dec 2017
  • Teaching Assistant for Modern Software Technology for eHealth (CAS757), Sep 2017
  • Teaching Assistant for Modern Software Technology for eHealth (CAS757), Jan 2017 - May 2017
  • Teaching Assistant for Computer Science Practice and Experience: Binding Theory to Practice, Jan 2017 - May 2017
  • Teaching Assistant for Database (3DB3), Sep 2015 - Dec 2015
  • Teaching Assistant for Database (3DB3), Jan 2015 - May 2015

Instructor at ZTE University

July 2009 -- September 2010

Worked as a full-time lecturer at ZTE Universtiy. Taught telecommunication Network course, Optical Transport Network course for undergraduate students and junior engineers.

Talks

  1. Privacy-Aware, Data Cleaning-as-a-Service: Given at 2018 IEEE International Conference on Big Data , Dec 12, 2018, Westin Hotel, Seattle, US

  2. Semantic-Aware Disambiguation for Duplication Detection: Given at McMaster Engineering Technology Research and Innovation Conference (METRIC), August 23, 2018, McMaster Innovation Park, Hamilton, Canada

  3. A Dynamic and User-Centric Data Cleaning System for Watson Analytics: Given at IBM Centre for Advanced Studies Technical Link Event (CASTLE) 2018, May 8, 2018, IBM Canada Lab, Markham, Canada

  4. Health Data Privacy: Given at IBM Hacking Health Meetup, March 29, 2018, IBM Innovation Space, Hamilton, Canada

  5. Data Quality for IBM Watson Analytics: Given at IBM CASCON2017 Expo, November 6, 2017, Toronto, Canada

  6. Quantifying Duplication to Improve Data Quality: Given at ACM CASCON Conference 2017, November 6, 2017, Toronto, Canada

  7. Refining Duplicate Detection for Improved Data Quality: Given at MDQUAL 2017 workshop, September 21, 2017, Thessaloniki, Greece

  8. Data Quality Metrics for IBM Watson Analytics, IBM TechConnect, IBM Canada Software Lab, Toronto, Ontario, Canada, May 2, 2017

  9. New Data Quality Metrics for Watson Analytics: Given at PechaKucha CASCON2016 Expo presentations

  10. A Data Quality Framework for Customer Relationship Analytics: Given at WISE2016 International Conference, Miami, FL, USA, Nov 2 2015

  11. Data Cleaning and Data Privacy: Given at WISE2016 International Conference, Miami, FL, USA, Nov 2 2015.

  12. Better Data -- Better Cities: Given at President’s Club Reception “Big Ideas - Better Cities”, McMaster David Braley Health Sciences Centre, Hamilton, ON, Sep 27 2015

  13. Improving Water Quality Via Improved Data Quality: Given at Canada-China Workshop on Smart Water Monitoring and Control, McMaster Innovation Park, Hamilton, ON, Apr 29 2015

Participating in Fundings

  1. Ontology-Driven Semantic Disambiguation. Ontario Centres of Cecellence TalentEdge Internship Program (TIP), April, 2018

  2. Voucher for Industry Association R & D Challenge (VIA), Ontario Centres of Excellence (OCE-VIA), June, 2016

  3. Natural Sciences and Engineering Research Council of Canada (NSERC) Smart Computing R & D Challenge Program, June 2016

  4. A Dynamic and Scalable Data Cleaning System for Watson Analytics. SOSCIP Smart Computing for INnovation, April, 2016

  5. IBM CAS Research Fellowship for A Dynamic and Adaptable Data Cleaning System for Watson Analytics. June 2015

Press

  1. The project with IBM Watson Analytics is featured in the SOSCIP McMaster researcher to “clean” big data Smart Computing For Innovation, November 28, 2016

  2. Quality Counts, November 2015

Honors & Awards

  1. 1st Place Award at McMaster 20-year anniversary of CAS 3-Min Research Presentation and Big Ideas Competition, June 1, 2018
  2. 3rd Place Award at McMaster CAS Poster & Demo Competition, April 5, 2018
  3. 2nd Placd Award at McMaster AI Workshop Poster Contest, November 29, 2017.
  4. CIKM/SIG2016 Travel Grants, November 2016
  5. Ontario’s Ministry of Advanced Education Tuition Reduction Award, October 2016
  6. International Excellence Award, McMaster University, July 2015
  7. International Excellence Award, McMaster University, September 2014
  8. Honor Master Graduate Student Award, University of Chinese Academy of Sciences, June 2014
  9. Excellent Student Award, University of Chinese Academy of Sciences, June 2012