Unlocking the Potential of Open Data: How to Make it Work for You

Analytics & I
Artificial Intelligence in Plain English
11 min readJan 22, 2023

--

Photo by Emily Morter on Unsplash

In this blog post I have asked our friendly AI to translate to English one of my articles published previously in a different language. The original article explores how to use machine learning to combine multiple datasets from open sources data and not lose your mind. It describes implementation of the following tools and technologies: TF-IDF, faiss, pgvector, and transformers. Alright, let’s have a look what we’ve got..

Open data

Photo by Jan Antonin Kolar on Unsplash

The concept of open data implies that certain data should be freely available for use and further republication without copyright, patent, and other control mechanisms. Open data can be published by various sources. The largest array of data is formed by government agencies and services:

  • Tax Authorities
  • Federal Treasury
  • Ministry of Economic Development
  • Supreme Court
  • Bailiff Department,
  • etc.

In theory, it sounds good. Down data and use it. However, in practice, it is not possible to simply find, download and start using a dataset. And here is why: different file formats and file structures, webpage layouts are constantly changing, sparse documentation (if any). And also sometimes the data is not available for download at all. All of this turns the extraction of open data into a whole adventure (you know, the “Let’s go. In and out. 20 minutes adventure.” one). And one blog post won’t be able to fit all the pain you can experience while trying to get some open data.

Bu let’s imagine that we have managed to cope with all the difficulties in getting the data we need. Now we have a stable datasets in our storage. Flawless victory? Well, almost. We still need to merge these datasets to get most of it.

Why open datasets are so hard to merge?

Usually the creator of a dataset are responsible for their own dataset only. And what will happen to the dataset later, rarely anyone cares about. Let’s take, for example, the data of the Bailiff Department. The data contains information on execution proceedings, the subject of execution, the amount of unpaid debt, contact information of the bailiff-executor, etc. It would seem, what else can you wish for? But we only have the name of the company and its address for the debtor. There is no identification number of the tax payer (TIN), no company registration number (CRN), by which it would be easy to connect these data with data from other sources and registries.

To link Bailiff Department data directly only by the name of the legal entity and/or address is problematic. Firstly, in some registries, there may be only the TIN/CRN, and there may be no name. Secondly, the names of companies are far from being unique.

In the third place, the names of legal entities can be recorded in different ways: someone adds quotes to the name, someone doesn’t, sometimes the form of business is recorded in full, sometimes it is shortened to an abbreviation only, etc. Fourth, the information about the address may be outdated, in the registers there may be different addresses: in one — registration address, in another — an actual one. Fifth, even after working for a while with open data, it can be fairly said that there are dozens of different ways to record the same company address differently. I hope the main message is clear and it is not necessary to elaborate on it further.

To summarise, if we want to successfully merge Bailiff Department data with other open data sources, we need to have TIN and CRN for each and every business entity we have in our dataset. We decided to obtain this information from another open dataset we have, doing the following steps:

  • Data cleaning
  • Preparing dataset for a neural network training
  • Training a neural network
  • Deploying the model in production.

We will explain in more detail every step below.

Data cleaning

First, we will list the data that we will use:

  • Bailiff Department data without TIN and CRN
  • historical set of data from Bailiff Department having TIN and CRN that we’ve managed to label earlier
  • data from a registry of business entities with TIN and OGRN.

From the entire array of data, we will need information about the company’s name, its address, TIN and CRN. Data cleaning consisted of normalising addresses and company names, e.g. addresses and names should be written the same manner. This is pretty standard:

  • remove punctuation, spaces at the beginning and end of the line, as well as hidden symbols, tabulation, line breaks, etc.
  • unify the abbreviations of organizational and legal forms
  • move the abbreviations to the beginning of the names

Only three simple steps and we get a version of dataset that suits us in terms of the names of legal entities.

Similar transformations were also done with the company addresses. An additional complexity in working with addresses was that sometimes there was missing information about the region or settlement in the address. In such cases, we were able to extract this information from the postal code and fill in these gaps.

Now when having cleaned and standardised addresses and company names, we can collect concatenations of company names and their addresses from Bailiff Department data and the same from a registry of business entities. And we are ready for the the next stage of our project now.

Preparing dataset for a neural network training

Now we will compare the concatenations from the Bailiff Department dataset with the concatenations from the registry of business entities dataset and find for each of them a random concatenation from the hundred closest by the matrix of vectors.

As mentioned above, we already have a labeled data set, on which we can collect positive examples of matching the Bailiff Department data and the registry of business entities data. However, in order for our neural network to learn faster and at the same time not overfitting on the training data, we need not only “praise” it for correctly found matches, but also have data on which we will impose “penalties”. Random value from the hundred closest by the matrix of vectors will fit this role. This value is not the correct answer. That is, similar, but not quite. For example, the street in the address will match, fortunately, there is a street of Independence in almost all settlements in the country, but the house number, city name and region number will not. Of course, it is possible to come up with a more sophisticated method of generating negative examples, but for now we will use this simple solution.

Now, a bit about the tools and technologies we have. We use TfidfVectorizer for building the vector matrix, and for finding random similar pairs, we will use the faiss library.

Tf-idf (Term Frequency — Inverse Document Frequency) shows the weight of a certain word for a given text, while taking into account how often that word occurs in the entire set of texts being considered (documents).

TF-IDF formula

In simple words, Tf-idf is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The TfidfVectorizer tool is used to convert a collection of raw documents to a matrix of Tf-idf features. The term frequency is the number of times a word appears in a document and the inverse document frequency is the logarithmically scaled inverse fraction of the documents that contain the word. The Tf-idf weight of a word is the product of these two statistics.

An here is an example:

  • The word “peace” appears 3 times in a text of 100 words length, so tf = 3/100 = 0.03.
  • In our dataset of 10000 texts, word “peace” is only in 10 of them, thus, idf = log(10000/10) = 3.
  • Therefore, tf-idf = 0.03 x 3 = 0.09.

The TfidfVectorizer library from scikit-learn calculates the tf-idf for each word in the text and produces an array of tf-idf values as output. After vectorization, we will look for random similar pairs for concatenations of company names and their addresses from the Bailiff Department data and the registry of business entities data. If the data set is relatively small, the standard KDTree module from scikit-learn can be used for this purpose. However, in our case, KDTree did not allow us to obtain the desired result within a reasonable amount of time, so we had to turn to the faiss library.

Faiss is a library that allows for searching for nearest neighbors and clustering data in vector space. According to the developers, faiss can efficiently work with sets of billions of rows. The library is written in C++ and its usage is through Python and working with Numpy arrays. High performance is achieved through the indexing of vectors, and then using Voronoi diagrams for clustering.

Inside one cluster, all points are closer to the center of that cluster (centroid), rather than another. This way, when searching for the nearest vector, we don’t have to go through the entire set of vectors, it is enough to compare it with existing centroids and then search within the cluster with the closest centroid. If the search results are not precise enough, we just increase the number of clusters in which we will search around the found centroid. We can also speed up the search algorithm by compressing the vectors using Product Quantisation (more details here). Additionally performance can be improved by switching from using CPU to GPU. Faiss allows this to be done without any problems.

As a result, after applying TfidfVectorizer and faiss, we have negative examples of data matching the Bailiff Department data and the registry of business entities data, where concatenations of addresses and company names from different sources are similar but do not fully match. From the positive and negative examples of matching, we obtain the final dataset using which we will train our neural network.

Training a neural network

We have a dataset generated on the previous steps described above and we want to train a neural network to differentiate between identical pairs of concatenations of addresses and company names from different pairs. This task is called Semantic Textual Similarity (STS) and is usually solved through Metric Learning. Essentially, we want to teach the neural network to vectorize our concatenations of addresses and names in such a way that semantically identical examples have the same or very similar vector representation in terms of a certain metric. We will use the cosine distance as a metric and optimize it during training.

At first, we took a pre-trained model bert-base-cased-sentence from Hugging Face. This is a 12-layer BERT transformer, fine-tuned on several large language datasets. Using this model and the Sentence-Transformers library, we immediately obtained an accuracy of about 77% on the validation dataset. Not bad for a start. But we continued to fine-tune the model on the previously obtained data. We trained Siamese networks to get embeddings of our addresses and company names in such a way as to optimize the cosine distance between them. And here, after about 4 days and 20 training epochs on an nvidia RTX A5000, our loss function stopped falling.

We measured the prediction accuracy on the validation dataset and obtained a value of 91%. A huge improvement over the result we had before, so we decided to stop at this accuracy level and move on to deploying the model in production. Of course, there is still a lot of room for improvement in the model, but given the goals and resources, we are satisfied with the result.

Deploying the model in production

Now, when we need to restore the TIN for a company from the Bailiff Department dataset, we form a vector from the transformed concatenation of the address and company name. Then, from all the vectors obtained in the same way from the registry of business entities dataset, we find the closest one in terms of cosine distance. It sounds simple enough, doesn’t it?

But there is one big problem with it: when we were writing the given article, there were tens of millions companies registered in the registry — it is not possible to calculate the cosine distance between the vector of the company from the Bailiff Department dataset and all the vectors from the registry of business entities dataset. We can go back to KDTree or faiss, but these approaches also have certain problems in production: we need to keep the library indexes in memory, and this is dozens of gigabytes.

Here, pgvector comes to our aid, a wonderful extension for Postgres. It works slower than faiss, but still fast enough for our purposes. It also allows you to store the index on disk, so you don’t have to keep it in memory all the time. And it has an additional advantage — it is easy to integrate with Postgres, so we can use it directly in SQL queries.

Now we have both components for our matching:

  • Job that checks the registry of business entities dataset for new companies or companies with changes in name OR address. If there are any, then the neural network generates vectors for them and adds it up to our database.
  • Job that checks for the presence of companies without the specified TIN in the Bailiff Department dataset. For each such company, we form a vector with help of the neural network. Then using pgvector we find the nearest vector from the registry of business entities dataset and obtain the corresponding TIN and CRN.

We wrapped this setup in Docker and roll it out to our common Kubernetes cluster. And now it’s Flawless Victory!

To sum up..

We’ve made some effort and managed to merge two open datasets that initially were not intended for matching. What for? Well, for example, now we have the opportunity to add the Bailiff Department data to the credit scoring model. Thus, the credit scoring model’s accuracy will improve, and an efficiency of our company will increase as well. Of course this is not the only use case of open data in our company’s processes and products. And the approach outlined above can be used for any dataset that does not contain TIN/CRN data. And unfortunately, there are still plenty of such datasets out there.

Thanks for reading!

If you liked this post and want to support me…

  • Clap, it will motivate me to chat about analytics with AI and post more
  • Share it with your friends and followers on social media
  • Follow me on Twitter
  • And here on Medium

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.

Interested in scaling your software startup? Check out Circuit.

--

--

Data Analyst sharing my experience with analytics, statistics, coding and ways to help business care about clients and have a substantial growth.