Content addressable hashing summary#

redun uses hashing to create deterministic ids for many records in the redun data model. The design of this hashing scheme has several properties:

Every hash is unique across all records.
- While a cryptographically secure hash function can be assumed to not produce hash collisions, we can still have collisions on pre-image data if we are not careful. For example, a CallNode could have the same hash as a file that contains a pickle of a CallNode, if the hashing schema is not carefully designed.
- To achieve this property, every pre-image is prefixed with a record type, such as ‘Task’, ‘CallNode’, etc.
- It is also easy to have pre-image collisions if arguments to a hash are not well escaped (e.g. if tab is used as an argument delimiter, you need to always escape tabs in string arguments). We solve this systematically by defining a consistent way of efficiently serializing structures using bittorrent encoding format (bencode).
When one parental record refers to a child record, the child record’s hash is included in the parent’s hash.
- This forms a Merkle Tree that gives unique hashes to nodes in the resulting DAG.
Hashes are represented by hex strings instead of bytes to aid easy debugging and display.
- If greater hashing efficiency is desired, hashes could be kept in their binary byte representation.
Hashes are truncated to 40 hex bytes for easier display.
- If the reduced hash space is a concern this truncation could be removed.