Short version of "Fuzzy matching on big-data" in GoodIT22 ACM Proceedings

Siamese network performance

Abstract

Food retailers’ scanner data provide unprecedented details on local consumption, provided that product identifiers allow a linkage with features of interest, such as nutritional information.

In this paper, we enrich a large retailer dataset with nutritional information extracted from crowd-sourced and administrative nutritional datasets. To compensate for imperfect matching through the barcode, we develop a methodology to efficiently match short textual descriptions.

After a preprocessing step to normalize short labels, we resort to fuzzy matching based on several tokenizers (including n-grams) by querying an ElasticSearch customized index and validate candidates echos as matches with a Levensthein edit-distance and an embedding-based similarity measure created from a siamese neural network model. The pipeline is composed of several steps successively relaxing constraints to find relevant matching candidates.

Paper in the Association for Computing Machinery (ACM) Proceedings is available at https://dl.acm.org/doi/10.1145/3524458.3547244

Purpose

To make the most of automatically collected scanner data for consumption studies, we link these products with crowd-sourced nutritional databases using textual search techniques. This approach requires the application of state-of-the-art textual analysis methods, including word embeddings, as well as efficient search tools to scale up.

Understand what is the nature, nutritional or environmental quality of food products consumed in supermarket will help to develop a sustainable and healthy consumption. The development of applications that provide information on products (nutritional characteristics, packaging, carbon footprint, etc.) opens up new perspectives on the analysis of scanner data at population scale once they have been matched. It is thus important to propose a method to associate these data sources that is reliable, flexible and efficient.

This work allowed us to evaluate the contributions and limitations of some NLP methods in a context where the textual data are noisy. In addition to the constructed database which can be used for multiple applications, one of the possibilities is to make available to the community the most efficient textual processing and matching models.

This browser does not support PDFs embedding. Please download the PDF to view it: Download PDF.

Lino Galiana
Lino Galiana
Data Scientist

I am data scientist in French national statistical institute, Insee. I study how emerging data or new computational methods help to renew the production of statistical knowledge.