carrot-transform

Quick Start

Carrot transform is run from the command line. It now supports poetry to control the python dependencies. To run from the command line, enter:

poetry run python carrot_transform.py [args]

For example, you can get the version number with:

poetry run python carrot_transform.py -v

There are many mandatory and optional arguments for carrot transform. In the quick start, we will demonstrate the mandatory arguments on a test case (taken from carrot-CDM) included in the repository. Enter the following (as one command):

poetry run python carrot_transform.py run mapstream carrottransform/examples/test/inputs\
--rules-file\
carrottransform/examples/test/rules/rules_14June2021.json\
--person-file\
carrottransform/examples/test/inputs/Demographics.csv\
--output-dir\
carrottransform/examples/test/test_output\
--omop-ddl-file\
carrottransform/config/OMOPCDM_postgresql_5.3_ddl.sql\
--omop-config-file\
carrottransform/config/omop.json

This should create a set of output files in this directory:

carrottransform/examples/test/test_output

Arguments

Required:

input-dir,  
	Directory containing input files.	      

--rules-file  
	json file containing mapping rules

--person-file  
	File containing person_ids in the first column  

--output-dir,  
	define the output directory for OMOP-format tsv files  

Either:

--omop-ddl-file,  
	File containing OHDSI ddl statements for OMOP tables. Instead of specifying the file explicitly, it can be found automatically if --omop-version is specified instead. See --omop-version for further details.

AND

--omop-config-file,  
    File containing additional/override json config for omop outputs. Instead of specifying the file explicitly, it can be found automatically if --omop-version is specified instead. See --omop-version for further details.

OR:

--omop-version
	Omop version - e.g., "5.3". Required if neither -omop-ddl-file nor --omop-config-file are set. If this is the case, the software will look for carrottransform/config/omop.json 
	and 
carrottransform/config/OMOPCDM_postgresql_ XX_ddl.sql
to import, where XX is the version number entered as the argument.

Optional:

--write-mode,  
              default = w  
              options: w, a  
	select whether to write new output files, or append to existing output files  
	  
--saved-person-id-file,  
	Full path to person id file used to save person_id state and share person_ids between data sets
	  
--use-input-person-ids,    
              default = N
              options: Y, N   
	If set to anything other than "N", person ids will be used from the input files. If set to "N" (default behaviour), person ids will be replaced with new integers.
	  
--last-used-ids-file,  
	Full path to last used ids file for OMOP tables. The file should be in a tab separated variable format: 
tablename	last_used_id 
where last_used_id must be an integer.
	  
--log-file-threshold,    
              default = 0
Change the limit for  output count limit for logfile output. Logfile will contain the threshold number of output results.  

Reduction in complexity over the original CaRROT-CDM version for the Transform part of ETL - In practice Extract is always performed by Data Partners, Load by database bulk-load software.