File hashing

To increase the transparency of my work, I was interested in file hashing. It allows to assign a unique value to a file. Such a value is calculated with a defined algorithm.

In R, the rlang library provides the hash_file() function and uses the XXH128 hash algorithm to generate a 128-bit hash.

This can be used to uniquely identify a data file. For reproducible research, you can add hash values for the datafiles used in a project to uniquely identify them.

Here, I have created two files for data sets from the datasets library and calculate the hash values of these files.

Code

library(readr)

write_csv(datasets::iris, "iris.csv")
write_csv(datasets::mtcars, "mtcars.csv")

library(tidyverse)

tibble(files = fs::dir_ls(glob = "*.csv")) |>
  mutate(hash = rlang::hash_file(files))

files	hash
iris.csv	dbdc1846dff7fba30a88d5b23e15ea80
mtcars.csv	1d350737ac40dc6fb6ae8f5ad616fc4e