Integrate longitudinal high-dimensional data in a data warehouse using the git commit graph to store temporal information and git annex to store large data in a content-addressable fashion.

Maxime Wack 323b6be7d2 Change branch naming convention !! Separate type and name with '/' instead of '\|' '\|' is not a widely accepted character, whereas '/' is natural to group branches together		2 months ago
docs/figures	Update figures	2 months ago
tests	Fix tests	2 months ago
.gitignore	Update gitignore	2 months ago
COPYING	Add GPL3 licence	3 months ago
README.org	Update README	2 months ago
example.conf	Make PROVDIER optional and qualify all entities	3 months ago
functions	Change branch naming convention !!	2 months ago
git-ommix	Add GPL3 licence	3 months ago
gitommix-completions	Correct diagnostic to diagnosis	2 months ago
makefile	Portable makefile with creating directories	2 months ago

README.org

Installation
Operations
- Add
  - Patient
  - Sample
  - Data
  - Result
  - Diagnosis
- List
  - Patient
  - Sample/Data/Result/Diagnosis
- Get
  - PROV
  - Graph
  - Timeline
  - Last
  - Object
  - File
  - Log
  - SPARQL

Git ommix helps managing high-dimensional data (eg: omics, imagery, pathology) in a longitudinal manner, coupled to a representation of the provenance using the PROV ontology.

Git ommix creates patient-level repositories to store sample references, versionned data obtained from the samples, and the versionned result of the data analysis and ensuing diagnoses.

Large files are only retrieved on demand thanks to git annex, decorrelating navigating the history from actually downloading all of it.

Git ommix also stores a representation of the provenance of each of those entities using the PROV ontology. Git ommix allows querying the repository structure, implementing multiple useful operations. These operations can apply to the whole patient's history or be constrained to one or multiple specific objects (sample/data/result/diagnosis)

list the objects contributing to the target (the data contributing to a result or to a diagnosis, samples contributing to diagnosis)
get the most recent version of the target
get the PROV-O provenance of the target, as turtle triplets or as a visual graph
display a timeline of diagnoses
execute any sparQL query on a repo

Installation

Requirements

GitOmmix is implemented as a bash script. It relies mostly on git, but also uses:

git annex (https://git-annex.branchable.com/) to handle large files (10.20230926)
rapper (https://librdf.org/raptor/rapper.html) to manage RDF stores (2.0.15)
roqet (https://librdf.org/rasqal/roqet.html) to query RDF stores (0.9.33)
graphviz (https://graphviz.org/) to generate visual representations (2.42.2)
bash-completion (https://github.com/scop/bash-completion/) to benefit from autocompletions in bash (2.11)

Git ommix has been tested on ubuntu 22.04.3 LTS (Jammy). Install raptor2-utils and rasqal-utils to get rapper and roqet. Bash-completion should already be installed, and graphviz can also be found on the official repos. However, the version of git-annex provided by ubuntu is too old (8) and version 10 should be installed. The latest version can be obtained from this repo : http://neuro.debian.net/pkgs/git-annex-standalone.html

OSX users can find all the required dependencies on homebrew.

Installation

Run sudo make install to install git ommix on your computer.

Running tests

From the root directory of this repository, run tests/<name>.test to run test name.

Operations

The git ommix commands all follow the same pattern : git ommix {verb} {object} [–options] [–message] [rest] git ommix does not have to and should not be called from the git ommix store git ommix can be run from any directory containing files to add to a patient's history

Add

Group of operations used to create the patients stores.

All operations accept these options:

–id the new object's id if needs to be provided, or a randomly generated id –method an optionnal PROV Activity used to generate the new object –provider an optionnal PROV Agent involved in generating the new object –date the date of creation ef the object, defaults to the current date

Patient

git ommix add patient

Sample

git ommix add sample -p|–patient <patient>

Add a sample to <patient>

Data

git ommix add data -p|–patient <patient> -s|–sample <sample> [–revision_of <data>] [–invalidate <data>] [FILES]

Add [FILES] to a data object in <sample> of <patient> FILES defaults to all the files in the current directory All data in a sample derive from (use) the <sample> New data files can be a revision of previous <data> in the same <sample>, and can also invalidate previous <data> in the same <sample> –invalidate can be specified multiple times to invalidate multiple <data> in the same <sample> with the new data

Result

git ommix add result -p|–patient <patient> -s|–sample <sample> –use <data> [–revision_of <result>] [–invalidate <result>] [FILES]

Add [FILES] to a result object in <sample> of <patient> FILES defaults to all the files in the current directory A result derives from (use) <data> in the same <sample> –use can be specified multiple times to derive the new result from multiple <data> in the same <sample> New result files can be a revision of previous <result> in the same <sample>, and can also invalidate previous <result> in the same <sample> –invalidate can be specified multiple times to invalidate multiple <result> in the same <sample> with the new result

Diagnosis

git ommix add diagnosis -p|–patient <patient> –use <result|diagnosis> [–revision_of <diagnosis>] [–invalidate <diagnosis>]

Diagnoses live outside of samples and can be used to tie multiple results from different samples into a clinically coherent history A diagnosis derives from (use) a <result> or a previous <diagnosis> –use can be specified multiple times to derive the new diagnosis from multiple <result> or <diagnosis> A new diagnosis can be a revision of a previous <diagnosis> and can also invalidate previous <diagnosis> –invalidate can be specified multiple times to invalidate multiple <diagnosis> with the new diagnosis

List

Patient

git ommix list patient

List all the patients known in the local store

Sample/Data/Result/Diagnosis

git ommix list sample|data|result|diagnosis -p|–patient <patient> [ref]

List all the sample|data|result|diagnosis objects in <patient> [ref] limits the list to the history of [ref] [ref] can be expressed as a commit hash or an object name (type:id or id) Multiple [ref] can be provided IDs matching multiple objects expand to multiple [ref]

Get

(nearly) All the get commands accept or even require a [ref] As previously, [ref] constrains the result to the context of [ref] [ref] can be expressed as a commit hash or an object name (type:id or id) Multiple [ref] can be provided IDs matching multiple objects expand to multiple [ref]

PROV

git ommix get prov -p|–patient <patient> [ref]

Output the RDF graph as turtle triplets

Graph

git ommix get graph -p|–patient <patient> [ref]

Output a graphical representation of the RDF graph

Timeline

git ommix get timeline -p|–patient <patient> [ref]

Output a graphical representation of clinical history of the patient, omitting samples, data, and results

Last

git ommix get last -p|–patient <patient> <ref>

Get the up to date version of the pointed ref, as well as the most recent diagnosis it participates to

Object

git ommix get object -p|–patient <patient> <ref>

Checkout the patients' repo at the given object

File

git ommix get file -p|–patient <patient> [ref]

List the files added by the given object

Log

git ommix get log -p|–patient <patient> [ref]

Print the git log of the patients' repo

SPARQL

git ommix get sparql -p|–patient <patient> "SPARQL query"

Output the result of the sparql query as turtle triplets

README.org

Installation

Requirements

Installation

Running tests

Operations

Add

Patient

Sample

Data

Result

Diagnosis

List

Patient

Sample/Data/Result/Diagnosis

Get

PROV

Graph

Timeline

Last

Object

File

Log

SPARQL