Research

Systems: database, cloud, supercomputing, edge, distributed systems, os, blockchain, …

  • Video analytics
    • E.g., How to count the number of birds in a year-long video efficiently as Yolo3 inference is expensive?
  • System Machine Learning
    • E.g., How can we reduce the number of cache misses when inferencing on an ARM processor?
  • Machine Learning for Database Systems
    • E.g., Learned indexes
  • Blockchain
    • E.g., Transactions across multiple chains
  • Data Science
    • We also develop software for physicists at CERN’s ALTAS and bioinformatician
  • GPGPU, serverless, data lake, lock-free programming, Kubernetes, distributed consensus, …

The followings are some of my projects.


Recently, the impressive accuracy of deep neural networks (DNNs) has created great demands on practical analytics over video data. Although accurate, practical video analytics systems require high efficiency.  In this project, we investigate various techniques to speed up video analytics on various platforms and hardware.
  • Status: Incubating
  • Areas: Video analytics
  • Keywords: Deep Learning Systems, Video Analytics
  • Selected Publications:
    • PengFei Zhang, Eric Lo, Baotong Lu: “High-Performance Depthwise and Pointwise Convolutions on Mobile Devices”.  In Proceedings of AAAI, 2020.
    • Ziliang Lai, Chenxia Han, Chris Liu, PengFei Zhang, Eric Lo, Ben Kao: “Top-K Deep Video Analytics: a Probabilistic Approach“.  SIGMOD, 2021.

Decentralized Web, or DWeb, is envisioned as a promising future of the Web. Being decentralized, there are no dedicated web servers in DWeb; Devices that retrieve web contents also serve their cached data to peer devices with straight privacy-preserving mechanisms. The fact that contents in DWeb are distributed, replicated, and decentralized lead to a number of key advantages over the conventional web. These include better resiliency against network partitioning and distributed-denial-of-service attacks (DDoS), and better browsing experiences in terms of shorter latency and higher throughput. Moreover, DWeb provides tamper-proof contents because each content piece is uniquely identified by a cryptographic hash. DWeb also clicks well with future Internet architectures, such as Named Data Networking (NDN). Search engines have been an inseparable element of the Web. Contemporary (“Web 2.0”) search engines, however, provide centralized services. They are thus subject to DDoS attacks, insider threat, and ethical issues like search bias and censorship. As the web moves from being centralized to being decentralized, search engines ought to follow. QueenBee, a decentralized search engine for DWeb, is our latest project. QueenBee is so named because worker bees and honeycomb are a common metaphor for distributed architectures, with the queen being the one that holds the colony together. QueenBee aims to revolutionize the search engine business model by offering incentives to both content providers and peers that participate in QueenBee’s page indexing and ranking operations.

Architectural Conscious Data Processing

Continued to thrive for ever-faster processing is leading computer scientists to directly leverage modern hardware innovations in the design of software systems. This trend is further amplified by the collapse of improvements in linear chip clock frequency scaling due to physical limits.  Therefore, software system designs that disregard hardware innovations are doomed to failure.

Our goal is to study different approaches to leveraging modern/emerging hardware to accelerate data processing in big data management.  Hardware under consideration includes many-core processors (e.g., Xeon Phi), data-parallel processing unit (SIMD), persistent memory (PM), fast-network (RDMA), etc.

  • Status: Ongoing
  • Areas: Computer Architecture × Operating Systems × Supercomputing × Big Data
  • Keywords: Modern Hardware, SIMD, RDMA, many-core, GPGPU, software-hardware co-design
  • Selected Publications: B. Lu, X. Hao, T Wang, Eric Lo
    • B. Lu, X. Hao, T. Wang, Eric Lo: “Dash: Scalable Hashing on Persistent Memory“.  In Proceedings of VLDB, 2020.
    • Wenjian Xu, Eric Lo, PengFei Zhang: “DIFusion: Fast Skip-Scan with Zero Space Overhead“.  In Proceedings of IEEE ICDE Conference, 2018.
    • Wenjian Xu, ZiQiang Feng, Eric Lo: “Fast Multi-column Sorting in Main-Memory Column-Stores“.  In Proceedings of ACM SIGMOD Conference, 2016.
    • ZiQiang Feng, Eric Lo, Ben Kao, Wenjian Xu: “ByteSlice: Pushing the Envelop of Main Memory Query Processing with a New Storage Layout“.  In Proceedings of ACM SIGMOD Conference, 2016.
    • ZiQiang Feng, Eric Lo: “Accelerating aggregation using intra-cycle parallelism“.  In Proceedings of IEEE ICDE Conference, 2015.

Thrifty: Massively Parallel Database as a Service (Now open source as: Vault)

Massively Parallel Database is the most high-end data analytical system in the big data market. Examples include HP’s Vertica (used by Obama’s election team during 2012 US President election) and SAP’s HANA (used by Germany soccer team during 2014 World Cup). If a Massively Parallel Database as a Service (PDaaS) is available, one can enjoy the power of parallel databases without the operational burden of provisioning machines and configuring the database.

The only PDaaS in the market is Amazon’s Redshift. However, Redshift is based on the decade-old Virtual Machine (VM) technology. Recently, our research team has successfully applied the much more powerful Shared-Process (SP) technology to PDaaS. Different from the VM technology that shares only the cluster computers among clients, the SP technology enables the sharing of cluster computers and database installations among clients, thereby largely reducing the overlapping of resources in the cluster. We have developed a prototype PDaaS management software, namely Thrifty. By using Thrifty, it is proven that a PDaaS provider can significantly reduce the resource requirements (e.g., number of computers in the cluster, electricity) by 80%.

  • Status: Completed
  • Areas: Cloud computing × Big Data
  • Keywords: Database-as-a-Service, Parallel Database
  • Selected Publications:
    • Petrie Wong, Andy He, Eric Lo: “Parallel Analytics as-a-Service”.  In Proceedings of ACM SIGMOD Conference, 2013.
    • Petrie Wong, Andy He, Ziqiang Feng, Wenjian Xu, Eric Lo: “Thrifty: Offering Parallel Database as a Service using the Shared-Process Approach”. In Proceedings of ACM SIGMOD Conference, 2015.

Genome Analytics

Modern biological and medical research depends heavily on high-throughput experiments such as massively parallel sequencing. These experiments produce large amounts of data that require various types of analysis to unravel the important information hidden within. Currently, genome data analyses are carried out in diverse ways. At one extreme, bioinformaticians of some individual research groups write their own custom scripts to achieve their analysis goals, resulting in significant redundant efforts. At the other extreme, projects like Galaxy contain many separate programs that one needs to get familiar with in order to perform an integrated analysis.

In order to provide a unified system for performing many types of common genomic data analysis seamlessly and efficiently, we are now collaborating with bioinformaticians and query language experts to develop a large-scale distributed platform for signal track analyses. Genomic signal tracks are fundamental units in bioinformatics and they are sets of genomic intervals that contain experimental measurements, biological annotations, or other types of biomedical information. The whole project involves the design of a signal track query language, the implementation of a distributed signal track query platform, the collection of public genomic datasets (size in terabytes currently, but is still growing), and the public offering of a signal- track-analytics-as-a-service.

  • Status: Completed
  • Areas: Data Science
  • Keywords: Genome Analysis, MapReduce
  • Selected Publication:
    • Q. Zhang, A. He, C. Liu, E. Lo: “Closest Interval Join Using MapReduce“.  In IEEE International Conference on Data Science and Advanced Analytics, 2016.
    • X. Zhu, Q. Zhang, E. Ho, K. Yu, C. Liu, T. Huang, A. Cheng, B. Kao, E. Lo, K. Yip: “START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries“.  BMC Bioinformatics, 2017.

Cache Management in SSD-Based Search Engine Infrastructure

Caching is an important optimization in search engine architectures. Existing caching techniques for search engine optimization are mostly biased towards the reduction of random accesses to disks because random accesses are known to be much more expensive than sequential accesses in traditional magnetic hard disk drives (HDD). Recently, solid-state drive (SSD) has emerged as a new kind of secondary storage medium, and some search engines like Baidu have already used SSD to completely replace HDD in their infrastructure. One notable property of SSD is that its random access latency is comparable to its sequential access latency. Therefore, the use of SSDs to replace HDDs in a search engine infrastructure may void the cache management of existing search engines. In this project, my team collaborated with Baidu (the largest search engine in China) to carry out a series of empirical experiments to study the impact of SSD to search engine cache management. The experiments were based on terabytes of real data and months of search log replays.  That initial results gave important insights to practitioners and researchers on how to adapt the infrastructure and how to redesign the caching policies for SSD-based search engines. Then, we further devised a set of optimal cache management techniques for SSD-based search engine architectures.

  • Status: Completed
  • Areas: Information Retrieval × Storage
  • Keywords: Caching, SSD, Search Engine
  • Selected Publications:
    • J. Wang, Eric Lo, M. L. Yiu, J. Tong, G. Wang, and X. Liu: “The impact of solid state drive on search engine cache management.” In ACM conference on research and development in Information Retrieval (SIGIR), 2013.
    • J. Wang, Eric Lo, M. L. Yiu, J. Tong, G. Wang, and X. Liu. “Cache management for solid state drive based search engine”. ACM Transactions on Information Systems (TOIS), 2014.

Answering Why-Not Questions on Preference Queries

After decades of effort working on database performance, recently the database research community has paid more attention to the issue of database usability, i.e., how to make database systems and database applications more user-friendly?  Among all the studies that focus on improving database usability (e.g., SQL query auto-completion), the feature of explaining why some expected tuples are missing in a query result, or the so-called “why-not?” feature, is gaining momentum.

A why-not question is being posed to a database when a user wants to know why her expected tuples do not show up in the query result.  In this project, our goal is to develop algorithms to answer why-not questions on preference queries.

  • Status: Completed
  • Areas: Human Computer Interaction (HCI) × Database
  • Keywords:Why-Not, Usability, Query Processing
  • Selected Publications:
    • Wenjian Xu, Zhian He, Eric Lo, C.Y. Chow: “Explaining Missing Answers to Top-K SQL Queries“.  TKDE, 2017.
    • Zhian He, Eric Lo: “Answering Why-Not Questions on Top-K Queries”.  TKDE, 2014 (Special Issue for Bests of ICDE’12).
    • Zhian He, Eric Lo: “Answering Why-Not Questions on Top-K Queries”.  In Proceedings of IEEE ICDE Conference, 2012.

Online Analytical Processing on Big Sequence Data

Many kinds of real-life data exhibit logical ordering among their data items and are thus sequential in nature. Examples of sequence data include web query logs, stock data, archived data streams and various kinds of RFID logs such as those generated by a commodity tracking system in a supply chain, and smart-card-based electronic payment systems like the Octopus system in Hong Kong.  Similar to conventional data, there is a strong demand to warehouse and to analyze the vast amount of sequence data in a user-friendly and efficient way.  This project aims to research and develop an OnLine Analytical Processing (OLAP) engine for big sequence data analysis.

  • Status: Completed
  • Areas: Database
  • Keywords: Sequence Data, Query Processing
  • Selected Publications:
    • Andy He, Petrie Wong, Ben Kao, Eric Lo, Reynold Cheng, ZiQiang Feng: “Efficient Pattern-Based Aggregation on Sequence Data”.  TKDE, 2017.
    • Chun Kit Chui, Ben Kao, Eric Lo, David W. Cheung: “S-OLAP: an OLAP system for analyzing sequence data“.  In Proceedings of ACM SIGMOD Conference, 2010.
    • Eric Lo, Ben Kao, Wai-Shing Ho, Sau Dan Lee, Chun Kit Chui, David W. Cheung: “OLAP on sequence data“. In Proceedings of ACM SIGMOD Conference, 2008.

MyBenchmark

To evaluate the performance of database applications and DBMSs, we usually execute workloads of queries on generated databases of different sizes and measure the response time.  This project aims to develop MyBenchmark, an offline data generation tool that takes a set of queries as input and generates database instances for which the users can control the characteristics of the resulting workload.  Applications of MyBenchmark include database testing, database application testing, and application-driven benchmarking.

  • Status: Completed
  • Areas: Database × Software Engineering
  • Keywords: Benchmarking, Testing, Data Generation
  • Selected Publications:
    • Eric Lo, Nick Cheng, Wilfred W. K. Lin, Wing-Kai Hon, Byron Choi: “MyBenchmark: generating databases for query workloads“. VLDB Journal, 2014.
    • Eric Lo, Nick Cheng, Wing-Kai Hon: “Generating Databases for Query Workloads”PVLDB, 2010.
    • Eric Lo, Carsten Binnig, Donald Kossmann, M. Tamer Özsu, Wing-Kai Hon: “A framework for testing DBMS features“. VLDB Journal, 2010.
    • Carsten Binnig, Donald Kossmann, Eric Lo, M. Tamer Özsu: “QAGen: Generating Query-aware Test Databases”. In Proceedings of ACM SIGMOD Conference, 2007.
    • Carsten Binnig, Donald Kossmann, Eric Lo: “Reverse Query Processing”. In Proceedings of IEEE ICDE Conference, 2007.