Fuzzy matching on big-data : an illustration with scanner data and crowd-sourced nutritional data

Similarity between products


Food retailers’ scanner data provide unprecedented details on local consumption, provided that product identifiers allow a linkage with features of interest, such as nutritional information.

In this paper, we enrich a large retailer dataset with nutritional information extracted from Open Food Facts, completed with the ANSES Ciqual dataset. To compensate for imperfect matching through the bar code, we develop a methodology to efficiently match short textual descriptions. After a preprocessing step to normalize short labels, we resort to fuzzy matching based on several tokenizers (including n-grams) by querying an ElasticSearch customized index and validate candidates echos as matches with a Levenstein edit-distances. The pipeline is composed of several steps successively relaxing constraints to find relevant matching candidates.

We finally develop a similarity based on a word embedding obtained by training a Siamese network on bar code matches. This alternative measure is used to evaluate our final matching.

A temporary version of the research I lead with Milena Suarez-Castillo on the way we can use state-of-the-art NLP techniques to bring together sources using food product names.

Working paper can be downloaded there

This browser does not support PDFs embedding. Please download the PDF to view it: Download PDF.

Lino Galiana
Lino Galiana
Data Scientist

I am data scientist in French national statistical institute, Insee. I study how emerging data or new computational methods help to renew the production of statistical knowledge.