Blog | Sourcify Docs

Finding Auxdatas in the Bytecode

February 12, 2024 · 8 min read

The problem

Source code verification requires compiling a contract written in a high-level language (e.g. Solidity, Vyper) to the bytecode, and comparing the compiled bytecode with the onchain bytecode. If there’s a match, we can say the given high-level code is the source-code of the contract at the given address.

The runtime bytecode of contracts by default also contain a special field at the end in CBOR encoding (auxdata). This field contains the hash of the contract metadata file (metadata hash), which acts as a fingerprint of the compilation. The metadata file has compiler settings, and source file hashes so the slightest change in the compiler settings or even a whitespace in any of the source files will cause a change in the metadata hash.

For a visual explanation of everything above, check out playground.sourcify.dev

Because of its sensitivity, some verifiers leave this field out in verification. In Sourcify’s case, if the recompiled bytecode and the onchain bytecodes match each other exactly (including the auxdata), it’s great. This will give us a “full match”. If not, we need to find the auxdatas and leave them out when comparing to be able to get at least a "partial match".

However this is not always trivial especially in these cases:

The creation bytecode of a contract does not necessarily have the CBOR encoded part at the very end of the bytecode. Although sometimes it’s found there, this field can be anywhere. In fact the only reason the CBOR encoded part is in the creation bytecode is because the runtime bytecode is embedded inside the creation bytecode as a whole.
When executing the creation bytecode i.e. deploying the contract, the contract’s runtime bytecode needs to be returned. The runtime bytecode is already inside the creation bytecode so this part is extracted and returned by taking the offset and the length for the related bytecode and returning it. This can be anywhere inside the code. (Check this article for a comprehensive deep dive into contract creation)
The runtime bytecode has the CBOR encoded part always at the end of the contract (unless turned off with appendCbor: false). But the bytecode can contain other contract bytecodes nested inside, which also can have their own auxdatas, and these parts need to be ignored for a verification. This is found for example in factory contracts where a contract creates another contract and the child contract’s code is nested in the factory’s bytecode.

Now for other “special” parts of the bytecode, the compiler outputs the positions such as immutables in immutableReferences. Unfortunately this is not the case for auxdatas and we need to look elsewhere and find workarounds.

Workarounds

If not the exact positions of the auxdatas, the compiler at least outputs the values. Inside the legacyAssembly object of the compiler output we can find the auxdata, which is under the key .auxdata

example legacyAssembly:

{
    ".code": [],
    ".data": {
        "0": {
            ".auxdata": "a26469706673582212203a05097003697b26b1da819218bcd95779598eaa90539e82a59ecbe4c09757e364736f6c63430007000033",
            ".code": [...]
        }
    }
}

At this point, one could think to do a simple string search in the bytecode for the auxdatas found in legacyAssembly, but it would be possible for an attacker to trick the search function and falsely ignore parts of the bytecode that are not supposed to be ignored.

The vulnerability

Imagine we have the auxdata string from the compiler’s legacyAssembly above.

a26469706673582212203a05097003697b26b1da819218bcd95779598eaa90539e82a59ecbe4c09757e364736f6c63430007000033

This could be the auxdata of a simple child contract inside the whole bytecode that we know won’t be affected by the changes of our main contract.

For this specific example the attacker could embed these bytes inside the bytecode such a code in the main contract:

assembly {
    // Split the code from a push opcode:
    // a26469706673582212203a05097003697b26b1da819218bcd957
    // 79 (PUSH26)
    // 598eaa90539e82a59ecbe4c09757e364736f6c63430007000033

    mstore(0x598eaa90539e82a59ecbe4c09757e364736f6c63430007000033, 0xa26469706673582212203a05097003697b26b1da819218bcd957)
    // PUSH26 0xa26469706673582212203a05097003697b26b1da819218bcd957
    // PUSH26 0x598eaa90539e82a59ecbe4c09757e364736f6c63430007000033
    // MSTORE
}

By chance (really) this auxdata of 53 bytes is split into two exactly from the middle but this doesn’t have to be the case. Remember the large middle portion of the CBOR encoding contains the IPFS hash so one can salt and iterate it.

Imagine the source code of the attacker compiles to the code below. Putting new lines to demonstrate the (allegedly) auxdata part:

0x6080...b732b960691b604482015260640160405180910390fd5b5f80546040516001600160a01b03808516939216917f342827c97908e5e2f71151c08502a66d44b6f758e3ac2f1de95f02eb95f0a73591a35f80546001600160a01b0319166001600160a01b039290921691909117905579
  a26469706673582212203a05097003697b26b1da819218bcd95779598eaa90539e82a59ecbe4c09757e364736f6c63430007000033
  52565b5f6a636f6e736f6c652e6c6f6790505f80835160208501845afa505050565b6101756101a4565b565b5f60208284031215610187575f80fd5b81356001600160a01b038116811461019d575f80fd5b9392505050565b634e487b7160e01b5f52605160045260245ffdfea2646970667358221220212c0514e8c0db310d02690fc2def199f4fba3828f2401ec2b8d7104e450b8b164736f6c63430008180033

This is what we get from the source code the attacker gives us to verify. So we go: “Oh right there's an auxdata a26469706673582212203a05097003697b26b1da819218bcd95779598eaa90539e82a59ecbe4c09757e364736f6c63430007000033 in this bytecode. We should ignore the corresponding part in the (onchain) bytecode to have a partial match.”

Oops now we are ignoring a part in the bytecode that we're not supposed to. These code parts are only meant for non-executable code whereas we embedded this with an assembly block.

In the attacker’s onchain bytecode (what actually will be executed vs. the verified code) the attacker could have placed anything in this assembly block for 53 bytes. I leave it up to your imagination what can be done with this ignored bytecode block.

The gist is, we need to make sure these to-be-ignored blocks are actually auxdatas and not coming for an executable code block. How do we do it?

The solution(s)

Well, we know that the IPFS hash inside the auxdata is the hash of the metadata file and the metadata file contains the source file hashes. So we can touch all source files to change their hashes, e.g. by adding a whitespace at the end of each. By touching every single source file, we make sure the nested auxdatas will be modified as well. If we compile again, we will have the exact same bytecode just with differences at the metadata hashes. Then we can locate the metadata hashes by comparing the original and edited bytecodes side by side.

But we need one more thing: Now we know where the metadata hashes are but that is just a substring of the whole CBOR auxdata. So we need to figure out where the CBOR auxdata starts and ends.

Blockscout solution

One way to do this is to start at the metadata hash positions we've found by comparing and go extend the byte substring byte-by-byte and each time try to decode the whole byte string in CBOR. If at one point successful, we know that the auxdata ends here. Remember that right after the CBOR encoding you'll find the length of the encoded part, so we know where it starts as well.

Indeed this is how Blockscout finds the auxdata positions.

Sourcify solution

The way we approach this in Sourcify is by again making use of the legacyAssembly.

These are roughly the steps:

Use bytecodes: Compare the original bytecode to the whitespaced (edited) contract’s bytecode. This will give us the positions of the metadata hashes, remember not the whole auxdata.
Use legacyAssembly: Compare the auxdatas from legacyAssembly s of both contracts. We will get a auxdataDiff between each auxdata (1st auxdata in original vs 1st in edited etc.). The diff will not exactly be the whole metadata hashes because CIDv0 IPFS hashes start with Qm but the rest of the hash. The other parts of the auxdatas will be the same. We also keep the position of the diff inside the whole auxdata diffStart:
```
interface AuxdataDiff {
    real: string;
    diffStart: number;
    diff: string;
}
```

Remember these are the metadata hashes. If they are equal, we can now find where the whole auxdata starts with:

for (const position of positions) {
    for (const auxdataDiff of auxdataDiffs) {
        // Compare if the diff from raw bytecode is equal the diff from `legacyAssembly` auxdatas
        if (editedBytecode.substring(position + auxdataDiff.diff.length) === auxdataDiff.diff)
            return originalBytecode.substring(position - auxdataDiff.diffStart, position + auxdataDiff.diff.length);
    }
}

Original:

0x6080...                CBOR auxdata     
1909117905579a26469706673582212203a05097003697b26b1da819218bcd95779598eaa90539e82a59ecbe4c09757e364736f6c6343000700003352565b5f6a636f6e736f6c652e6c6

Edited:

0x6080...                CBOR auxdata     
1909117905579a2646970667358221220dceca8706b29e917dacf25fceef95acac8d90d765ac926663ce4096195952b6164736f6c6343000700003352565b5f6a636f6e736f6c652e6c6
             └──────────────────┘↑
                diffStart        position

An Alternative

Start with a string search inside the bytecode for the auxdatas from legacyAssembly of the contract. Now we have the positions of potential auxdatas of the original contract.
Next we whitespace the source files and compile the contract again. Let’s call it the edited contract.
Finally we check if the bytecode substrings from the original contract and the edited contract have changed at the positions we found at the 1st step. We expect these to change if they indeed contain a real auxdata and not some custom bytecode.

Thanks to Rim from Blockscout for pointing out this alternative.

Making life easier for verifiers

To avoid doing all these nitty workarounds we just proposed the Solidity compiler to output the positions of the auxdatas, similar to the immutableReferences field: https://github.com/ethereum/solidity/issues/14827

We are still going to need to do this for the compiler versions before this gets implemented but still it would be less work in verification, particularly not having to compile contracts twice.

Since we edited the original source code with whitespaces and compiled the contract, we also have the legacyAssembly for the edited contract, which contain auxdatas. If we compare all the auxdatas extracted from legacyAssembly s of both, we will get a diff of each auxdata field which will be the metadata hashes. The rest of the auxdatas will be the same.

We Need to Talk About the On-Chain Metadata Hash

August 13, 2023 · 8 min read

Kaan Uzdogan

Introduction

Solidity compiler has a feature, not known by everyone, that appends the IPFS hash of the contract metadata to the contract bytecode. This hash effectively acts as a fingerprint of the compilation, and when deployed, goes onchain. With that, we can verify the contracts "perfectly" and fetch the contract source code from IPFS. One of our missions at Sourcify is to make this feature more known and used, but not everyone is a fan of it.

(If you don't fully understand the metadata hash check out our playground to see it in action.)

I argue this is the only foolproof way to verify contracts. Languages and tooling should come together and come up with a common standard. We should look back at what worked and what didn't, and come up with a better next version.

Runtime code vs Creation code

In source-code verification you compare a bytecode to a high-level code (Solidity, Vyper).

When you compile a contract you get two bytecodes:

Runtime bytecode is the code of the contract living on the blockchain. This is what really gets executed when you call a contract. You'll find it if you look at the bytecode of an unverified contract in a block explorer or when you call eth_getCode(address) on the contract.

Creation bytecode is the code that will be executed by the EVM when the contract is being deployed, which will store the runtime code at contract's address.

Since the terms are not well defined, some terminology:

"code" = "bytecode" in this context. Sometimes people just call it "runtime code", or "creation code".
"Init code" = "Creation bytecode". This is usually used in create2 context.
"Deployed Bytecode" = "Runtime Bytecode". This is another common way to refer to the runtime bytecode by the Solidity compiler and frameworks. I refrain from using this as sometimes the contract is not deployed and "runtime code" is more accurate.
evm.bytecode = "Creation bytecode". The Solidity compiler refers to it as this in the output.
evm.deployedBytecode = "Runtime bytecode". Same as above.

Which bytecode?

Let's go back to the source code verification. The problem we are trying to solve is we have a contract, and we want to see the original source code of it. Because we humans, can't really read bytecodes.

However, a contract has two bytecodes, which one should we compare the source code to?

Verifying with Creation Bytecode

One can say that the bytecode counterparty of a contract written in a high level language is the creation bytecode. Because, in a typical contract deployment this is what you give to the EVM to execute.

The problem with the creation bytecode is that it's not always stored onchain. The only time you see this is when you deploy a contract from an Externally Owned Account (EOA) by putting the creation bytecode in the tx.data and setting the receiver tx.to to null. In that case you'll see the creation bytecode if you look at the transaction.

However, for contracts created by other contracts (e.g. factories) it is executed once and then discarded. So someone needs to index and save the creation bytecodes somewhere and you need to trust them. Whereas the runtime bytecode is stored onchain and you can request it from your node with eth_getCode.

On the other hand, the creation bytecode of a contract is not necessarily what the compiler outputs. The creation bytecode can be any code that will execute and store the runtime bytecode at the contract address. See @ricmoo's CREATE2 example. He demonstrates how to deploy and SELFDESTRUCT a contract, and finally deploy a completely different contract at the same address, even though CREATE2 addreses depend on the init code. In this case the init code is the same but it dynamically gets and writes the contract code from somewhere else. If you change the code where it's dynamically fetched from, you deploy a different contract at the same address. So for this contract, even if we knew its original source code, we can't compile and compare against its creation code.

Verifying with the Runtime Bytecode

The runtime bytecode is the actual code of the contract and is readily available at eth_getCode. The compiler also outputs the runtime bytecode so one can verify contracts with the runtime bytecode too. With that, you can easily verify a contract on the "edge" (i.e. on your machine) trustlessly by getting the bytecode from your execution client.

The compiler output can be different than the onchain one as during deployment the runtime bytecode can be modified by writing the immutable values and the linked libraries in the placeholders. It's ok because, for Solidity, the compiler outputs the immutableReferences and libraries have a __$ placeholder, so we know where these are positioned in the bytecode.

The problem is, not everything in high-level contract code is represented in the runtime bytecode. Imagine this contract excerpt:

    constructor() {
        owner = msg.sender;
        emit OwnerSet(address(0), owner);
    }

I can deploy this contract but verify it with a slightly different contract with the following constructor, which can have huge implications:

    constructor() {
        owner = tx.origin;
        emit OwnerSet(address(0), owner);
    }

This is because this constructor code part will not be included in the runtime bytecode, and the owner value is not stored inside the bytecode but in the contract's storage.

Verifying with the Runtime Bytecode + Metadata Hash

There's a way around this problem. If you verify a contract with its metadata hash appended to the runtime bytecode, you'll get a full match. This means the source code you are looking at is exactly the same as the one that was originally compiled, because if you change anything about the contract (even a whitespace), the metadata hash will change and you will not get a "full match" but a "partial match".

This, I'd argue, is the only foolproof way to verify a contract's source code. This method covers all the cases above and the ones I haven't mentioned or we don't know about yet. By being based on the runtime code, this also removes the need to trust a third party to index the creation bytecode, and instead you can get the bytecode from your own execution client's JSON RPC interface.

Problems with the Metadata Hash

The main critisism of this feature is that the hash is too sensitive. It's both a bug and a feature that the hash changes even with a whitespace change.

A bigger problem is with the paths of the .sources.

  ...
  "sources": {
    "myDirectory/myFile.sol": {
      "keccak256": "0x123...",
      "license": "MIT",
      "urls": [ "bzz-raw://7d7a...", "dweb:/ipfs/QmN..." ]
    }
  }

The keys here are actually not file paths but source-unit names, meaning they can be arbitrary strings. This is especially a problem for projects deploying with CREATE2, where the address of the contract depends on the init code. Any difference in "path" will be a different metadata hash --> diferent bytecode --> different contract address. As a result, most of them just turn off this feature.

It's a bigger problem if the same codebase does not compile to the same bytecode on different platforms. The differences caused by comments/whitespaces are not that big of a deal if we can verify contracts at the deployment pipeline i.e. right at the point when they are deployed. This also means we need to stop flattening contracts. Ideally you never drag and drop any files to a website, but use a verification plugin on your tooling (Foundry, Hardhat) or IDE (Remix). No medium size contract would manually be verified.

What would be a more clever way to do this? If we are able get this right, we solve most of the problems.

Conclusion

The two bytecodes associated with a contract are not always sufficient to correctly verify a contract. The only foolproof and decentralized way to do it is to use the runtime bytecode with the metadata hash appended to it. I believe this needs to be the default way to verify contracts, and only when you can't do it (like this bug), you should fall back to the partial match. Although at Sourcify we base our verification on this, most of the ecosystem don't make the partial vs full match distinction or are just aware of it.

As an outcome of this article I'd really want to see:

Other cases where a runtime bytecode or creation bytecode fails to correctly verify a contract.
Counter-arguments to the usefulness of the metadata hash.
Clever ways to mitigate the problems with the metadata hash.
Languages other than Solidity adopting this feature, and coming up with a standard for it.

Do have anything to add for these points above? Please reach out to me on Twitter or add your remarks in the discussion issue for this article (I'll link). I'll also be updating this article with the feedback I get, and be linking to discussions. This will be a living document.

Human-Readable Transactions Working Group

April 3, 2023 · 4 min read

Kaan Uzdogan

TLDR;

Human-readability of Ethereum Transactions is a multi-faceted and complex problem that requires ecosystem-wide collaboration. Therefore, it makes sense to create a working group to gather people, projects, and knowledge.

Motivation

It is a well-known UX problem in Ethereum that users usually don't/can't verify the action they are about to take, because they are not presented with human-readable information. This has led to social engineering hacks where victims lost millions. In one case, a hacker was able to replace the browser wallet, which made the victim sign a transfer transaction on his HW wallet that sends all the tokens to the hacker. In another, the hacker created an offline signature for the victim to list all his NFTs for free.

As a basic example, our goal is to show something similar to the one on the right rather than on the left.

Bytecode vs Human-Readable Tx

Nowadays, many wallets can do the basic ABI decoding and show a verified contract link but users still lack a description of the action they are about to take and additional safety information about the contract they are going to interact with.

How we achieve this at Sourcify is through the NatSpec documentation. If you document your code using NatSpec's @notice and @dev fields and fully verify your contract on Sourcify, the wallet can show the users the description you wrote when calling the function. (details in this talk at Devcon VI or this lightning talk).

Over time it became clear to me that even if we convince the majority of developers to document using NatSpec and fully verify on Sourcify, this single route won't solve this wicked problem of Human-readable Transactions. The problem is multi-faceted and requires different approaches for different cases. For instance, you can't add NatSpec docs to an already deployed contract, or you can't use Dynamic Expressions for a commit-reveal transaction (e.g. ENS commit).

Actually, there are different approaches, some of which we gathered in the Sourcify docs. Unfortunately, most of them seem to be stale.

Another motivation for us has been the lack of knowledge of what's going on in the space. Even though we were working on this problem, we haven't been aware of the following for a long time:

I wasn't aware of the two EIPs EIP-4430: Described Transactions and EIP-3224: Described Data for a long time. Similarly, I didn't know (Draft) EIP: Rich Site-Proposed Contract Metadata
Although we mostly think of the software wallets when thinking about human readability, the hardware wallets work in a much more contained environment and need different approaches (Illustrated by alexmiller.eth from GridPlus)
Until Devcon 6, we weren't aware of the Rosette Protocol and that they'd written radspec in Typescript, which was what we needed.

Solving this problem of transaction human readability is hard and is and requires ecosystem-wide collaboration.

For this reason, it makes sense to form a "Human-Readable Transactions Working Group" focused on this specific problem with different interested parties

Scope

How do we define the scope?

Our starting point is the human-readability of the transactions but this really cannot be separated from the safety, UX and human-friendliness. Depending on the progress, other UX and safety aspects are expected to be included in the general work (audits, token registries etc.). Initially, it's called “human-readable tx's WG”, but we'll see where it goes.

The work will mostly be on EVM, but not specific to the Ethereum network.

Goals

🎯 Being the Schelling Point: Gather different parties working on the transaction readability, security, and UX in the same place. Enable collaboration between parties, and make sure everyone knows who's working on what.
📚 Being the knowledge base: Discuss and compile the different approaches to the problem. Lay out the advantages and disadvantages of different methods. Document them for the public.
🌟 Open-source the solutions to solve it once and for all.

The goal, however, is not to work on a single agreed solution to the problem. As said, there is no single solution to this problem due to its complexity and context dependence. Likely, there will be conflicts and forks, and each team will focus on what they think is the best way. Ideation and active feedback should allow us to reach the best solutions faster.

Structure

This is also a TBD but one potential place for this WG is CASA.

Interested?

Are you working on similar problems and want to collaborate? Reach out to me on Twitter @kaanuzdogan, Matrix @kuzdogan:matrix.org, or Telegram (@kuzdogan)!

Sourcify v2

March 13, 2023 · 3 min read

Kaan Uzdogan

Today we released Sourcify v2 🎉

The changes do not affect the Sourcify Server API in a non-backwards compatible way. If you are using the Sourcify API you don't need to worry. However are some non-breaking additions detailed below.

Why is this a major update then?

We are removing and deprecating the npm packages:
Introducing the backbone library @ethereum-sourcify/lib-sourcify
Rewriting the server and monitor code based on @ethereum-sourcify/lib-sourcify.

Motivation

The motivation for these changes is to make Sourcify verification more reusable. The lib-sourcify package can be imported into other projects and verify a contract given the source files, and chain&address. Another goal was to create modularity in the codebase with more separated concerns. With these changes, Sourcify server consumes the core lib-sourcify functionality, and takes care of the rest: providing an API, validating inputs, and storing the results (in the repo) etc.

This is in line with what we want to achieve with edge verification. We beleive a contract verification should be easily reproducable and you should be able to verify contracts locally without relying on a third party.

Imagine you're interacting with a contract on your wallet. Before you sign a transaction your wallet:

fetches the contract's source code from IPFS
compiles and verifies with lib-sourcify

without even talking to Sourcify or any other verifier, everything happens on your local machine. Similarly a block explorer like Otterscan can give its users the option to either fetch the verified source code directly from a verifier (like Sourcify), or verify the contract locally on the frontend.

However, the library as is it not compatible with browsers yet and we are working on it. If you are knowledgable on this front and want to help us, please reach us out.

lib-sourcify

The brand new @ethereum-sourcify/lib-sourcify is the library that will do all the weightlifting of assembling a contract (e.g. source files) into a compilable CheckedContract, compiling, and verifying it. You can pass checkFiles your contract source code and metadata.json to pack compilable CheckedContracts.

const pathBuffers: PathBuffer[] = [];
pathBuffers.push({
  path: filePath,
  buffer: fs.readFileSync(filePath),
});
const checkedContracts: CheckedContract[] = await checkFiles(pathBuffers);

Then you can verify this CheckedContract against a contract that is deployed on a chain at an address.

const goerliChain =   {
  name: "Goerli",
  rpc: [
    "https://locahlhost:8545/"
    "https://goerli.infura.io/v3/${INFURA_API_KEY}",
  ],
  chainId: 5,
},

const match = await verifyDeployed(
  checkedContract[0],
  goerliChain,
  '0x00878Ac0D6B8d981ae72BA7cDC967eA0Fae69df4'
)

console.log(match.status) // 'perfect'

Creator Tx Hash

We can also verify contracts by looking at the tx.input of the transaction that created the contract. If this matches the creation bytecode of the compiled contract AND the address resulting from the tx.from and tx.nonce matches the given address, we can verify the contract.

const match = await verifyDeployed(
  checkedContract[0],
  goerliChain,
  "0x00878Ac0D6B8d981ae72BA7cDC967eA0Fae69df4".undefined,
  "0xe75fb554e433e03763a1560646ee22dcb74e5274b34c5ad644e7c0f619a7e1d0" //tx hash
);

(In the server API, find the field creatorTxHash)

CREATE2

You can also verify CREATE2 created contracts:

const match = await verifyCreate2(
  checkedContract[0],
  deployerAddress,
  salt,
  create2Address,
  abiEncodedConstructorArguments
);

console.log(match.chainId); // '0'. create2 matches return 0 as chainId
console.log(match.status); // 'perfect'

Questions? Feedback?

As usualy feel free to reach us out on Twitter, Matrix chat, or Gitter.

✅ Happy verifying!

Verify Contracts Perrrrrfectly: Why and How?

September 2, 2022 · 7 min read

Kaan Uzdogan

In an ecosystem with the core values of transparency, security, and trust (and trustlessness); it is expected from all contract developers to publish their source code. If you're even slightly familiar with Ethereum, there is no need for further explaination.

But if I give you a source code, how do you make sure the published source code really is the source code of the contract? That's where source code verification comes into play.

note

Throughout this article and 99% of the time in Sourcify context, by verification we will be referring to smart contract verification. Verification sometimes also refers to formal verification.

What is source code verification?

First thing first, all the smart contracts on blockchain are stored in bytecode. Just like our physical machines that only speak bits and bytes, Ethereum Virtual Machine also only understands bytes. If you ask the Ethereum blockchain the code of a contract, you only get a byte string.

So, let's say I give you a contract in Solidity and claim that this is the code behind the contract at "0xabcdef...". To verify, you need to make sure this code compiles to the same bytecode as the claimed contract at "0xabcdef...". This is the basic idea behind the smart contract verification: we compile a contract and check if the bytecode matches the one on blockchain.

Visualization of the compilation of a contract Checking if bytecodes match

You have probably made use of contract verification before. For many users this is the green checkmark in Etherscan:

Green checkmark in a verified contract page on Etherscan

You see the green checkmark and you are happy!

But is it really exactly the same code that is deployed?

The answer is, you don't know 🤷

In fact, no one else would be able to know except the contract developer, and he/she can't really prove it. The reason is, when compiling the contract i.e. translating the human-readable source code (in Solidity or any other higher-level language) to machine-readable bytecode, some information is lost. These include internal variable names, internal function names, names of contracts etc.

So yes this is functionally the same code as deployed: it compiles to the same bytecode as the original mysterious source code 🕵.

And you might be thinking, sure this is good enough. But:

Someone can insert misleading comments, (internal) function or variable names
Whoever verifies a contract first is chosen as the matching result, not the "authentic" one
We can't verify things other than the contract's code itself (i.e. metadata)

In fact when not verified properly, it is possible to inject code that would be shown in the verified source code.

Enough bad news... There's actually a way to verify Solidity contracts that would cryptographically ensure the exactness of the source files and it is already here: It's called Sourcify!

This way of verifying contracts is what we call a perfect verification, (in contrast to partial verification). This is enabled by the Solidity contract metadata, and that the hash of it is appended to the contract's bytecode. The metadata hash acts as a fingerprint of the whole compilation and with the information in the metadata file we can completely reproduce the contract compilation.

Contract Metadata

The Solidity compiler by default appends some information to the contract's bytecode in CBOR encoding. This special field, I like referring to as auxdata, usually contains the "Solidity version", the "metadata hash", and occasionally the "experimental" flag. The encoded data and it's decoding looks like this:

Decoding of the auxdata appended to the bytecode

You can actually inspect this field and see the decoding in action for any contract in playground.sourcify.dev.

To see how this cryptographically ensures the exactness of the source files we need to look into the contents of the metadata file. The metadata file is a JSON document that looks like this and contains information on two things:

How to interact with the contract: ABI, documentation
How to reproduce a contract compilation: compiler version and settings, source file information

The latter is the relevant field for our purposes. Specifically, the fact that the metadata file contains source file hashes. To illustrate this, let's walk through what happens when you compile a contract and what happens when you change a source file.

When you compile a contract, the compiler computes the hashes of the source files and embeds this information in the metadata file. On the right side, you see the relevant fields of the metadata file:

Embedding of the hash of the source files inside the metadata file

Then the compiler takes the hash of this whole file:

Taking the IPFS hash of the metadata file

And encodes it in the auxdata at the end of the bytecode:

Encoding of the metadata hash at the end of the bytecode

So if you were to decode the auxdata you'd see:

Decoding of the metadata hash at the end of the bytecode

What happens when we change something in the source files? Say we change a variable name or a comment in the new MyContract-diff.sol file. In turn the hash of the file changes, as well as the hash in the metadata:

The change of the hash when the source file changes

...and of course the hash of the metadata file changes:

The change of the hash when the metadata file changes

...and the auxdata changes:

The change of the auxdata when the hash changes

Sooo, if we match both the bytecode + the appended auxdata, we have byte-by-byte exactly the same source code and compilation settings of the original deployed contract. This is a perfect verification.

The perfect verification

If the bytecode matches but not the auxdata (which includes the metadata hash), we have a partial verification.

The partial verification

Did you notice?

If you are familiar with IPFS and paid attention, you might ask: Can't we already get everything from the bytecode itself?

And yes, if published on IPFS, you can actually fetch the source code from the bytecode of a contract, because all the information is already there:

The metadata IPFS hash is appended to the bytecode so (if published) you can fetch the metadata file.
The metadata file contains (alongside the normal keccak256) the IPFS hashes of the source files so you can fetch the complete source code from IPFS.

So there's only one thing that you need to do as a contract developer: Publish your source files and metadata on IPFS.

Why do you need verification then? Isn't the source file already out there?

Although unlikely since the compiler does it automatically, someone can change the auxdata of the contract before deploying it and show you a different random source code. We make sure it really is the same code by doing a whole recompilation of the provided files and comparing the resulting bytecodes. Plus, we share all verified contracts in our repository on IPFS to make sure it's available.

Conclusion

Perfect verification enables more secure and transparent verification on contracts, as well as other useful things such as decoding tx's and enabling human-readable contract interactions, but this is a topic for another article.

Next level smart contract verification is already here. We just need to adopt this way of verifying contracts as a community. Obviously, we need a lot of tooling, integrations, and more awareness. Let's step up and make this the standard way of verifying contracts!

(This article is a summary of my recent talks about Sourcify. If you are interested in learning more, check out one of the latest talks)

The problem​

Workarounds​

The vulnerability​

The solution(s)​

Blockscout solution​

Sourcify solution​

An Alternative​

Making life easier for verifiers​

Introduction​

Runtime code vs Creation code​

Which bytecode?​

Verifying with Creation Bytecode​

Verifying with the Runtime Bytecode​

Verifying with the Runtime Bytecode + Metadata Hash​

Problems with the Metadata Hash​

Conclusion​

Motivation​

Scope​

Goals​

Structure​

Interested?​

Motivation​

lib-sourcify​

Creator Tx Hash​

CREATE2​

Questions? Feedback?​

What is source code verification?​

Contract Metadata​

Did you notice?​

Conclusion​

The problem

Workarounds

The vulnerability

The solution(s)

Blockscout solution

Sourcify solution

An Alternative

Making life easier for verifiers

Introduction

Runtime code vs Creation code

Which bytecode?

Verifying with Creation Bytecode

Verifying with the Runtime Bytecode

Verifying with the Runtime Bytecode + Metadata Hash

Problems with the Metadata Hash

Conclusion

Motivation

Scope

Goals

Structure

Interested?

Motivation

lib-sourcify

Creator Tx Hash

CREATE2

Questions? Feedback?

What is source code verification?

Contract Metadata

Did you notice?

Conclusion