Content addressable hashing summary¶
redun uses hashing to create deterministic ids for many records in the redun data model. The design of this hashing scheme has several properties:
Every hash is unique across all records.
While a cryptographically secure hash function can be assumed to not produce hash collisions, we can still have collisions on pre-image data if we are not careful. For example, a CallNode could have the same hash as a file that contains a pickle of a CallNode, if the hashing schema is not carefully designed.
To achieve this property, every pre-image is prefixed with a record type, such as ‘Task’, ‘CallNode’, etc.
It is also easy to have pre-image collisions if arguments to a hash are not well escaped (e.g. if tab is used as an argument delimiter, you need to always escape tabs in string arguments). We solve this systematically by defining a consistent way of efficiently serializing structures using bittorrent encoding format (bencode).
When one parental record refers to a child record, the child record’s hash is included in the parent’s hash.
This forms a Merkle Tree that gives unique hashes to nodes in the resulting DAG.
Hashes are represented by hex strings instead of bytes to aid easy debugging and display.
If greater hashing efficiency is desired, hashes could be kept in their binary byte representation.
Hashes are truncated to 40 hex bytes for easier display.
If the reduced hash space is a concern this truncation could be removed.