Selected Publications | Eric Lo's Homepage

My (almost) full publication list can be accessed here.

[VLDB’25] Bonspiel: Low Tail Latency Transactions in Geo-Distributed Databases

– Cui Fan, Eric Lo, S. Srivastava, Z Lai

Abstract: Tail latency is crucial as it impacts user satisfaction and service- level objectives (SLOs). However, geo-distributed databases have long struggled with this issue due to wide-area network access, resulting in tail latencies of several or even exceeding ten seconds. In this paper, we highlight that further optimizing atomic commit protocols does not help but hit a tail latency wall. Instead, mak- ing concurrency control and access method selection geo-aware can mitigate this issue. To this end, we present Bonspiel, a new geo-distributed database equipped with geo-aware concurrency control and access method selection. In our experiments, Bonspiel successfully caps the tail latency of TPC-C at 1.8 seconds. Remark- ably, it achieves this while maintaining full generality – it is fully SQL-compliant and strongly consistent, with both average latency and system throughput remaining at the top of the field.

Slides

[SIGMOD’25] Scalable Complex Event Processing on Video Streams

– C Han, C Chang, S. Srivastava, Y Lao, Eric Lo

Abstract: The rapid expansion of video streaming content in our daily lives has rendered the real-time processing and analysis of these video streams a critical capability. However, existing deep video analytics systems primarily support only simple queries, such as selection and aggregation. Considering the inherent temporal nature of video streams, queries capable of matching patterns of events could enable a wider range of applications. In this paper, we present Bobsled, a novel video stream processing system designed to efficiently support complex event queries. Experimental results demonstrate that Bobsled can achieve a throughput improvement over state-of-the-art ranging from 2.4× to 11.6×, without any noticeable loss in accuracy.

Slides

[SoCC’24] Occam’s Razor for Distributed Protocols

– Ziliang Lai, Cui Fan, Hua Fan, Eric Lo, Wenchao Zhou, Feifei Li

Abstract: Optimizing distributed protocols has traditionally been a real pain, requiring experts to figure out where improvements can be made, along with rigorous correctness proofs and meticulous implementation. This paper presents a theory to systematize this process. The proposed theory can optimize any existing distributed protocols while preserving all their original quality attributes (e.g., generality, correctness). Crucially, applying the optimizations derived from this theory does not necessitate touching the original implementation — all you need is just to bolt on a few new message adapters. Case studies demonstrate the effectiveness of this approach. For instance, applying the theory to optimize Spanner’s atomic commit protocol results in a mere 4 new message adapters, but improving its latency by up to 51% and peak throughput by up to 56%.

Slides

[VLDB’24] DEX: Scalable Range Indexing on Disaggregated Memory

– Baotong Lu, Kaisong Huang, Chieh-Jan Mike Liang, Tianzheng Wang, Eric Lo

Abstract: Memory disaggregation can potentially allow memory-optimized range indexes such as B+-trees to scale beyond one machine while attaining high hardware utilization and low cost. Designing scalable indexes on disaggregated memory, however, is challenging due to rudimentary caching, unprincipled offloading and excessive inconsistency among servers. This paper proposes DEX, a new scalable B+-tree for memory disaggregation. DEX includes a set of techniques to reduce remote accesses, including logical partitioning, lightweight caching and cost-aware offloading. Our evaluation shows that DEX can outperform the state-of-the-art by 1.7–56.3X, and the advantage remains under various setups, such as cache size and skewness.

Slides Code

[VLDB’24] Biathlon: Harnessing Model Resilience for Accelerating ML Inference Pipelines

– C. Chang, Eric Lo, C. Yu

Abstract: Machine learning inference pipelines commonly encountered in data science and industries often require real-time responsiveness due to their user-facing nature. However, meeting this requirement becomes particularly challenging when certain input features require aggregating a large volume of data online. Recent literature on interpretable machine learning reveals that most machine learning models exhibit a notable degree of resilience to variations in input. This suggests that machine learning models can effectively accommodate approximate input features with minimal discernible impact on accuracy. In this paper, we introduce Biathlon, a novel ML serving system that leverages the inherent resilience of models and determines the optimal degree of approximation for each aggregation feature. This approach enables maximum speedup while ensuring a guaranteed bound on accuracy loss. We evaluate Biathlon on real pipelines from both industry applications and data science competitions, demonstrating its ability to meet real-time latency requirements by achieving 5.3× to 16.6× speedup with almost no accuracy loss.

Slides Code

[SIGMOD’23] When Blockchain meets Deterministic Databases

– Z. Lai, C. Liu, Eric Lo

Abstract: Private blockchain as a replicated transactional system shares many commonalities with distributed database. However, the intimacy between private blockchain and deterministic database has never been studied. In essence, private blockchain and deterministic database both ensure replica consistency by determinism. In this paper, we present a comprehensive analysis to uncover the connections between private blockchain and deterministic database. While private blockchains have started to pursue deterministic transaction executions recently, deterministic databases have already studied deterministic concurrency control protocols for almost a decade. This motivates us to propose Harmony, a novel deterministic concurrency control protocol designed for blockchain use. We use Harmony to build a new relational blockchain, namely HarmonyBC, which features low abort rates, hotspot resiliency, and inter-block parallelism, all of which are especially important to disk-oriented blockchain. Empirical results on Smallbank, YCSB, and TPC-C show that HarmonyBC offers 2.0x to 3.5x throughput better than the state-of-the-art private blockchains.

Slides

[VLDB’22] Are Updatable Learned Indexes Ready?

– C. Wongkham, B. Lu, C. Liu, Z. Zhong, Eric Lo, T. Wang

Abstract: Recently, numerous promising results have shown that updatable learned indexes can perform better than traditional indexes with much lower memory space consumption. But it is unknown how these learned indexes compare against each other and against the traditional ones under realistic workloads with changing data distributions and concurrency levels. This makes practitioners still wary about how these new indexes would actually behave in practice. To fill this gap, this paper conducts the first comprehensive evaluation on updatable learned indexes. Our evaluation uses ten real datasets and various workloads to challenge learned indexes in three aspects: performance, memory space efficiency and robustness. Based on the results, we give a series of takeaways that can guide the future development and deployment of learned indexes.

Slides Code

[VLDB’22] APEX: A High-Performance Learned Index on Persistent Memory

– B Lu, J Ding, Eric Lo, Umar F Minhas, T. Wang

Abstract: The recently released persistent memory (PM) has been gaining popularity. PM offers high performance (still lower than DRAM), persistence, and is cheaper than DRAM. This opens up new possibilities for index design. Existing indexes on PM typically evolve traditional B+tree indexes. However, this is not the best choice. Recently proposed learned indexes use machine learning (ML) for data distribution-aware indexing and perform significantly better than B+Trees, but none support persistence and instant recovery. In this paper, we propose a new learned index on PM namely APEX, with very high performance, persistence, concurrency, and instant recovery. Our very careful design combines the best of both worlds. Specifically, APEX is designed to reduce PM accesses, especially the heavy-weight writes, flush, and memory fence instructions, while still exploiting ML, to achieve high performance. Our in-depth experimental evaluation on Intel DCPMM shows that APEX achieves up to ~15X higher throughput as compared to the state-of-the-art PM indexes, and can recover from failures in ~42ms. To the best of our knowledge, APEX is the first full-fledged and practical learned index on PM.

Slides Code

[SIGMOD’21] Top-K Deep Video Analytics: A Probabilistic Approach

– Z Lai, C Han, C Liu, P Zhang, Eric Lo, B Kao

Abstract: The impressive accuracy of deep neural networks (DNNs) has created great demands on practical analytics over video data. Although efficient and accurate, the latest video analytic systems have not supported analytics beyond selection and aggregation queries. In data analytics, Top-K is a very important analytical operation that enables analysts to focus on the most important entities. In this paper, we present Everest, the first system that supports efficient and accurate Top-K video analytics. Everest ranks and identifies the most interesting frames/moments from videos with probabilistic guarantees. Everestis a system built with a careful synthesis of deep computer vision models, uncertain data management, and Top-K query processing. Evaluations on real-world videos and the latest Visual Road benchmark show that Everest achieves between 14.3×to 20.6×higher efficiency than baseline approaches with high result accuracy.

Video Code

[VLDB’20] Dash: Scalable Hashing on Persistent Memory (ACM SIGMOD 2021 Research Highlight Award)

– B. Lu, X. Hao, T Wang, Eric Lo

Abstract: Byte-addressable persistent memory (PM) brings hash tables the potential of low latency, cheap persistence, and instant recovery. The recent advent of Intel Optane DC Persistent Memory Modules (DCPMM) further accelerates this trend. Many new hash table de- signs have been proposed, but most of them were based on emulation and perform sub-optimally on real PM. They were also piece-wise and partial solutions that side-step many important properties, in particular good scalability, high load factor and instant recovery.

We present Dash, a holistic approach to building dynamic and scalable hash tables on real PM hardware with all the aforemen- tioned properties. Based on Dash, we adapted two popular dynamic hashing schemes (extendible hashing and linear hashing). On a 24- core machine with Intel Optane DCPMM, we show that compared to state-of-the-art, Dash-enabled hash tables can achieve up to ∼3.9× higher performance with up to over 90% load factor and an instant recovery time of 57ms regardless of data size.

Slides Code

[TODS’19 (SIGMOD’19)] Historic Moments Discovery in Sequence Data

– Ran Bai, WK Hon, Eric Lo, Zhian He, Kenny Zhu

Abstract: Many emerging applications are based on finding interesting subsequences from sequence data. Finding “prominent streaks”, a set of longest contiguous subsequences with values all above (or below) a certain threshold, from sequence data, is one of that kind that receives much attention. Motivated from real applications, we observe that prominent streaks alone are not insightful enough but require the discovery of something we coined as “historic moments” as companion. In this paper, we present an algorithm to efficiently compute historic moments from sequence data. The algorithm is incremental and space-optimal, meaning that when facing new data arrival, it is able to efficiently refresh the results by keeping minimal information. Case studies show that historic moments can significantly improve the insights offered by prominent streaks alone. Furthermore, experiments show that our algorithm can outperform the baseline in both time and space.

Code

[SIGMOD’16] Fast Multi-column Sorting in Main-Memory Column-Stores
– Wenjian Xu, ZiQiang Feng, Eric Lo

Abstract: Sorting is a crucial operation that could be used to implement SQL operators such as GROUP BY, ORDER BY, and SQL:2003 PARTITION BY. Queries with multiple attributes in those clauses are common in real workloads. When executing queries of that kind, state-of-the-art main-memory column-stores require one round of sorting per input column. With the advent of recent fast scans and denormalization techniques, that kind of multi-column sorting could become a bottleneck. In this paper, we propose a new technique called “code massaging”, which manipulates the bits across the columns so that the overall sorting time can be reduced by eliminating some rounds of sorting and/or by improving the degree of SIMD data level parallelism. Empirical results show that a main memory column-store with code massaging can achieve speedup of up to 4.7X, 4.7X, 4X, and 3.2X on TPC-H, TPC-H skew, TPC-DS, and real workload, respectively.

Slides Code

[SIGMOD’15] ByteSlice: Pushing the Envelop of Main Memory Data Processing with a New Storage Layout
– ZiQiang Feng, Eric Lo, Ben Kao, Wenjian Xu

Abstract: Scan and lookup are two core operations in main memory column stores. A scan operation scans a column and returns a result bit vector that indicates which records satisfy a filter. Once a column scan is completed, the result bit vector is converted into a list of record numbers, which is then used to look up values from other columns of interest for a query. Recently there are several in-memory data layout proposals that aim to improve the performance of in-memory data processing. However, these solutions all stand at either end of a trade-off — each is either good in lookup performance or good in scan performance, but not both. In this paper we present ByteSlice, a new main memory storage layout that supports both highly efficient scans and lookups. ByteSlice is a bytelevel columnar layout that fully leverages SIMD data-parallelism. Micro-benchmark experiments show that ByteSlice achieves a data scan speed at less than 0.5 processor cycle per column value — a new limit of main memory data scan, without sacrificing lookup performance. Our experiments on TPC-H data and real data show that ByteSlice offers significant performance improvement over all state-of-the-art approaches.

Slides Code

[SIGMOD’13] Parallel Analytics as a Service
– Petrie Wong, Zhian He, Eric Lo

Abstract: Recently, massively parallel processing relational database systems (MPPDBs) have gained much momentum in the big data analytic market. With the advent of hosted cloud computing, we envision that the offering of MPPDB-as-a-Service (MPPDBaaS) will become attractive for companies having analytical tasks on only hundreds gigabytes to some ten terabytes of data because they can enjoy high-end parallel analytics at a cheap cost. This paper presents Thrifty, a prototype implementation of MPPDB-as-a-service. Themajor research issue is how to achieve a lower total cost of owner-ship by consolidating thousands of MPPDB tenants on to a shared hardware infrastructure, with a performance SLA that guarantees the tenants can obtain the query results as if they are executing their queries on dedicated machines. Thrifty achieves the goal by using atenant-driven design that includes (1) a cluster design that carefully arranges the nodes in the cluster into groups and creates an MPPDB for each group of nodes, (2) a tenant placementthat assigns each tenant to several MPPDBs (for high availability service through replication), and (3) a query routing algorithm that routesa tenant’s query to the proper MPPDB at run-time. Experiments show that in a MPPDBaaS with 5000 tenants, where each tenant requests 2 to 32 nodes MPPDB to query against 200GB to 3.2TB of data, Thrifty can serve all the tenants with a 99.9% performance SLA guarantee and a high availability replication factor of 3, using only 18.7% of the nodes requested by the tenants.

Slides Code (Open source as project Vault)

[ICDE’12] Answering Why-Not Questions on Top-K Queries
– Zhian He, Eric Lo

Abstract: After decades of effort working on database performance, the quality and the usability of database systems have received more attention in recent years. In particular, the feature of explaining missing tuples in a query result, or the so-called “why-not” questions, has recently become an active topic. In this paper, we study the problem of answering why-not questions on top-k queries. Our motivation is that we know many users love to use top-k queries when they are making multi-criteria decisions. However, they often feel frustrated when they are asked to quantify their feeling as a set of numeric weightings, and feel even more frustrated after they see the query results do not include their expected answers. In this paper, we use the query-refinement method to approach the problem. Given as inputs the original top-k query and a set of missing tuples, our algorithm returns to the user a refined top-k query that includes the missing tuples. A case study and experimental results show that our approach returns high quality explanations to users efficiently.

Slides Code

Notice: The materials contained here are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders.

Disclaimer: By accessing the code, it is understood that all codes are written by my research personnel and they hold the full account for the validity of the results.