Skip to main content

One post tagged with "Signatures"

View All Tags

Spam or Legit? Analyzing 4byte Selector Collisions

· 6 min read

Recently Sourcify took over openchain.xyz's 4byte signature APIs as well as the domain itself, maintained by @samczsun. We also built the database and I wanted to run a quick analysis on the selector collisions and see how many collisions are legit vs. deliberately generated (spam).

samczsun's tweet

We built the dataset and a service to serve the data. The dataset contains:

  1. Data from openchain's dataset
  2. Data from 4byte.directory
  3. Signatures from verified contracts in Sourcify.

You can see the database schema in the database docs and the service in services/4byte in the repo. As of now we have 4.7 million signatures, of which 1.9 million are not from verified contracts, and the rest of the majority appear in at least one verified contract (stats).

While it's possible to submit signatures to the database via the /import endpoint, we also add the signatures to the database automatically when a contract is verified. The 4byte databases are known to be spam prone, as function signatures are only 4 bytes and it's trivial to find a collusion to an otherwise legit signature.

For example, see the ERC20 transfer(address,uint256) function's collisions under its selector 0xa9059cbb in our 4byte.sourcify.dev page: https://4byte.sourcify.dev/?q=0xa9059cbb

4byte.sourcify.dev transfer(address,uint256) collisions

Seeing this and having the data I wanted do a quick analysis on the selector collisions and see how many collisions are legit vs. deliberately generated (spam).

Analysis

Running a simple query to find the signatures that share the same 4byte selector:

Query
SELECT
concat('0x', encode(signature_hash_4, 'hex')) AS signature_hash_4,
COUNT(*) AS num_signatures,
ARRAY_AGG(signature ORDER BY signature) AS signatures
FROM public.signatures
GROUP BY signature_hash_4
HAVING COUNT(*) > 1
ORDER BY num_signatures DESC;

In the end we find 2789 4byte selectors that have more than one signature. Here are the top 5 with most collisions:

collisions.csv

collisions.json

signature_hash_4num_signaturessignatures
0x0000000061AaANwg8((address,address,address,uint136,uint40,uint40,uint24,uint8,uint256,bytes32,bytes32,uint256))
abcei51243fdgjkh(bytes)
adfepixw()
...
0x0000000115account_info_rotate_tine(uint256)
exec_606BaXt(bytes[])
f00000001_bdmvamqo()
...
0xa9059cbb8_____$_$__$___$$$___$$___$__$$(address,uint256)
fakeTransfer_4570999670(bytes)
func_2093253501(bytes)
...
0x095ea7b38__$$$$___$$_$_$$__$_$_$$__$$$$(address,uint256)
approve(address,uint256)
as9q06we_7x8z(uint256,address[],address[],uint256)
...
0x70a082317$_$$$_$$$$$_$_$____$$$$_$$_$__(address)
balanceOf(address)
branch_passphrase_public(uint256,bytes8)
...

Looking at the top collisions, it might look a lot. But still the spamming seems not excessive and spammers generally find a single funny signature and call it a day. Out of the 2789 selectors, 2740 have only 2 signatures (ie. a single collusion) and only 49 with more than 2 signatures.

The interesting question is, how many of these collisions are actually unintended collisions vs. how many are deliberately generated (spam)?

Looking at them one by one would take some time. First I want to actually see only the collisions that have a verified contract. Ie. if f00000001_bdmvamqo() is not seen on a verified contract, let's assume it's a spam.

Query
SELECT
concat('0x', encode(s.signature_hash_4, 'hex')) AS signature_hash_4,
COUNT(*) AS num_signatures,
ARRAY_AGG(DISTINCT s.signature ORDER BY s.signature) AS signatures
FROM public.signatures s
WHERE EXISTS (
SELECT 1
FROM public.compiled_contracts_signatures ccs
WHERE ccs.signature_hash_32 = s.signature_hash_32
)
GROUP BY s.signature_hash_4
HAVING COUNT(*) > 1
ORDER BY num_signatures desc;

Now we're left with 1023 "verified" collisions:

collisions_verified.csv

collisions_verified.json

signature_hash_4num_signaturessignatures
0x0000000028AaANwg8((address,address,address,uint136,uint40,uint40,uint24,uint8,uint256,bytes32,bytes32,uint256))
arb_wcnwzblucpyf()
batchLock_63efZf()
buyAndFree22457070633(uint256)
call_g0oyU7o(address,uint256,bytes32,bytes)
...
......(4 rows with 4-5 signatures skipped)
0x415565b03JunionYoutubeXD_clgqmmkfvuba()
Sub2JunionOnYouTube_wuatcyecupza()
transformERC20(address,address,uint256,uint256,(uint32,bytes)[])
0x000000023callWithPlaceholders4845164670(address,uint256,bytes32,bytes,(address,bytes,uint64,uint64,uint64)[])
wipeBlockchain_EkJWPe()
yoov6(address,address,uint256)
0x6c5b47d23addDegree(uint256,string)
isBlacklisted5(address)
RenounceFungibleOwnership()
0x9aa7c0e53gain_network883718828((address,uint256,uint256,uint256,uint256,uint256,bool,uint256,uint256,uint256),uint8,uint256,uint256,address)
openTrade((address,uint256,uint256,uint256,uint256,uint256,bool,uint256,uint256,uint256),uint8,uint256,uint256,address)
TigrisTrade(int8,int56,uint80,bytes15,int88,int16)
0x014ed8d22CannotChangePaymentToken()
ModelRegistered(uint256,address,string,uint256)
0x0161a64a2cleanupExpiredListing(uint256)
MissingRole(address,bytes32)
0x0182a6da2initiateWalletTransfer(address)
withdrawStakingAmount(uint256)
0x01a754a32AutoSwap()
updateTeamFeeContract(address)
0x025313a22getACLRole5999294130779334338()
proxyOwner()

Now it starts to get interesting. Again the selectors with many collisions have mostly spam. But for 3 and less collisions we have some legitimate collisions.

For example for 0x01a754a3 we have AutoSwap() and updateTeamFeeContract(address). It's really difficult to tell if this is a spam or not.

But for the last row, 0x025313a2, we have getACLRole5999294130779334338() and proxyOwner(). Here the former is clearly a spam and the latter is not.

Next, we can actually ask an LLM to filter the ones looking like a spam! Since the data is not excessive, I shoved all of it into Claude and asked it to filter the ones looking like a spam. In the end it gave me a list of 648 collisions that it thinks are legitimate. I peeked in the list and it seems to be mostly accurate:

legitimate_collisions.csv

Here are 10 interesting examples of legitimate unintended collisions:

signature_hash_4num_signaturessignatures
0x04d742dc2adminResetRank()
startSale(uint256,uint256,uint256)
0x0536f7552FreeMintTokenSent(address,uint256)
NFTReward(address)
0x092338cc2maxPurchasableInOneTx()
usdcGHSTOracle()
0x17915e8d2getCluster(address)
getTotalFeeBps()
0x2025e52c2createSaleTokensVault()
mintWithERC721(uint256)
0x220613792getBaseStakeAmountForPlay()
vaultFees(uint256)
0x55fcd0272DepositAmountTooLow()
masterLogicAddress()
0x667022fd2bought(address)
iceCreamVan()
0x67bf975c2NotAllowedToRecover()
RewardThresholdReached(uint256)
0x706b87222pauseAtId()
USDTBorrowed(address,uint256)

These are all legitimate functions and events from different smart contracts that happen to share the same 4-byte selector purely by chance. This demonstrates that while 4-byte collisions are rare, they do happen naturally in the wild!

At this stage I just ran this for fun. We only have a list of popular signatures that we know the "correct" signature for (canonical-signatures.json). We filter out non-canonical ones by default and have a filtered field (can turn off filtering in the API response ). But if the community thinks this is useful, we can do a more thorough analysis and filter out the spam via LLMs.