Introduction to invoice hashing: securing VAT reporting with cryptography

Fingerprint

Previously, we explained that real-time invoice reporting systems can tackle VAT fraud without collecting massive amounts of data. Companies will still need to register invoices, but instead of storing the data in “plain text” at a centralized location, always accessible to authorized personnel, VAT fraud can be detected even if the data is encrypted. An essential part of confidential real-time invoice reporting is that instead of storing invoice data itself, a unique fingerprint can be created of the invoice using so-called cryptographic hashing. This article is aimed to give everyone interested in VAT an introduction to hashing (crypto experts will want to consult other sources too), how hashing is applied already in existing solutions, and how to make them work for invoice reporting systems.

What is hashing?

First things first, let’s create a firm understanding of hashing. In English, ’to hash’ means ’to chop’, ’to confuse’ or ’to muddle’[1] or according to the Oxford Dictionary: “hash something out. from French hacher, from hache ‘axe’”. [2] This refers to the way in which hashing can be applied to invoice reporting: by chopping up the original invoice data in pieces, the data can be made unrecognisable. Whilst stones can be cut up into unrecognisable pieces (see figure 1 below), data can be hashed by means of a hashing algorithm or hash function.

Figure 1: could you assemble the pieces back together?

stones

Source: Khadeeja Yasser - Unsplash

So what does a hash function do exactly? A hash function transforms input data (e.g. invoice data) to a fixed-size output. The output is also called the hash, message digest or just digest. In the figure below five simple hashing examples are shown:

Figure 2: hashing examples

Hash function

Source: Wikimedia Commons - Public Domain

As you can see: if you hash a different input, you get a different output of the same size. Even when changing just one letter (as is the case in the bottom four examples) the output is unrecognizably different. As a result, the hash function creates a kind of fingerprint of the input data. The hash function hashes (scrambles) the input data in such a way that the same input data will always hash to the same random-looking fingerprint. [3]

How is hashing used to verify data?

So, what can we use these hash algorithms for? Many things! Hashes are essential building blocks for digital signatures for example, which are used widely in the realm of e-invoicing and e-archiving. If you find two fingerprints which are the same, you know that the input data used must be the same as well, and this is a powerful tool when used correctly (we’ll discuss some best practices below, for example you may not want to use hash functions which are too old!).

Note that not all hash functions are equally strong or effective. For example, for a hash function to be strong it should be practically impossible for two inputs to hash to the same output. Furthermore, if someone knows a hash output but nothing else, it should be practically impossible to find the corresponding input. A full mathematical explanation of how these properties can be achieved is out of scope for this blog post, but in the conclusion we refer to some materials to learn more for the interested reader.

Over the years, cryptographers have built better and better hash functions, but people also have gotten better at “breaking” these algorithms. The MD5 hash function was designed in 1992 by cryptographer Ronald Rivest, and can nowadays be broken by generic laptops in seconds. [4] The Secure Hash Algorithm (SHA) 1, standardized in 1995, has been used around the world, but in 2017 it was broken in practice for the first time after years of attempts. [5] At the moment of writing, the renowned U.S. National Institute of Standard and Technology advises the use of newer algorithms such as SHA-2 or even SHA-3. Note that many alternatives exist, each offering different technical trade-offs in order to be optimal for certain applications. [6]

Why use hashes to secure and verify invoice data?

To illustrate how hashing can improve the way invoice information is verified, let’s have a look at an example. In the example Alice the automaker sends an invoice to Bob the builder:

Figure 3: Alice sends an invoice to Bob

Fingerprint

Source: Freepik, summitto

A wide variety of auditors perform cross checks on the invoices in order to verify the correctness of the information. Think of the following situations:

auditors may want to know whether you included all purchase and sales invoices in your bookkeeping
tax authorities may want to know whether your counterparty registers the same amount of VAT
financial institutions such as lending providers and credit rating agencies may want to know whether you are submitting the same information to auditors and tax authorities as to the bank

Figure 4: Many parties are interested in their financial information

Fingerprint

Source: Freepik, summitto

In all of the above cases, the auditing party doesn’t actually need to see your invoice data. All they need to know is whether the data registered in different places is the same. This can be made possible by requiring companies to hash their invoices and to publicise the resulting invoice fingerprints. Auditors can be sure that the invoices of your counterparty are in your registration; tax authorities can be sure that your counterparty reports the same invoice for tax purposes; and financial institutions can be sure that you are submitting the same information everywhere. This is only possible if all parties can verify that the other parties are using the same source of information.

Figure 5: Everyone should be able to access a single source of truth

Fingerprint

Source: Freepik, summitto

Invoice fingerprinting demo: try it yourself!

To get a better understanding of invoice fingerprinting, you can try it out for yourself below. Here you see an invoice with several data fields on the left side, and the resulting SHA-3 hash output on the right side. If you change just a single character or number in the invoice, the fingerprint will change completely!

Best practices

There are a number of things to keep in mind when building a system which uses hash functions to fingerprint invoice data.

First, it is important to use a secure hash function. It is advisable that during the initial setup, you use the latest and most secure hash functions available, and that you stay on top of latest developments around secure hashing. New attacks are discovered every year, but SHA-1 took over 20 years to break; in all likelihood the extensively and publicly tested SHA-3 will be usable for many years to come. Unfortunately, a number of tax authorities are using SHA-1 even today and with new systems.

The use of these insecure algorithms allows anyone with a good computer to deliver different invoices resulting in the same hashes, which renders fingerprints unusable for any auditing purposes.

Second, in order to allow invoice fingerprints to be published to a public registry, creating a single source of truth for invoices, it should not be possible for a random person to uncover invoice details when viewing a hash through brute-force methods. [7] In order to achieve this, invoice fingerprints should be created or encrypted in such a way that only the sender, receiver and anyone they specifically authorize can reconstruct the hash in order to verify from which invoice it was generated.

Conclusion

In this post we shared a high level introduction of hash functions and their application to invoice reporting. We showed that hash functions can transform arbitrary data to a fixed-size output, thereby fingerprinting the data. This fingerprinting can be used to report invoice data in a more secure way, because it allows auditors to verify certain invoice data without revealing all of it by default. A word of caution: this is by no means a complete deep dive; there is a lot more to discover and understand. Reach out to us if you’d like to learn more about the topic at: info@summitto.com!

[1] https://www.merriam-webster.com/dictionary/hash

[2] https://www.oxfordlearnersdictionaries.com/definition/english/hash_2?q=hashing

[3] For a full technical deep dive of a famous hash function SHA-2, check out: https://github.com/in3rsha/sha256-animation

[4] Tao, X., Liu F., & Feng D. (2013). Fast Collision Attack on MD5. Retrieved from: https://eprint.iacr.org/2013/170.pdf

[5] Stevens, M. et. al. (2017). The first collision for full SHA-1. Retrieved from: https://shattered.io/

[6] For more information, see: https://csrc.nist.gov/Projects/Hash-Functions/NIST-Policy-on-Hash-Functions

[7] Some attackers may try to simply guess how the invoice hash was made, by trying to hash many different combinations of invoice details, hoping to end up at a certain target invoice hash. For more information, see: https://en.wikipedia.org/wiki/Brute-force_attack