Carrot-Transform Quick Start Guide

Installation

Carrot-Transform is available on PyPI, so you can install it with:

pip install carrot-transform

If you are working with the source code, refer to the Development Notes

Running Carrot-Transform

To execute Carrot-Transform, run:

carrot-transform [command] [options]

For example, you can get the version number with:

carrot-transform -v

There are many mandatory and optional arguments for carrot transform. In the quick start, we will demonstrate the mandatory arguments on a test case (taken from carrot-CDM) included in the repository. Enter the following (as one command):

Basic Example

To process a test dataset included in the repository, run:

carrot-transform run mapstream \
  --input-dir @carrot/examples/test/inputs \
  --rules-file @carrot/examples/test/rules/rules_14June2021.json \
  --person-file @carrot/examples/test/inputs/Demographics.csv \
  --output-dir carrottransform/examples/test/test_output \
  --omop-ddl-file @carrot/config/OMOPCDM_postgresql_5.3_ddl.sql \
  --omop-config-file @carrot/config/omop.json

The ‘@carrot’ is an alias to the folder containing the carrot-transform module, which can be used with either installation method. When using your own files, use your file path, and omit this.

This will generate a set of output files in this directory:

carrottransform/examples/test/test_output

If it doesn’t exist, this directory should be created for you.

Arguments

Required Arguments

FlagDescription
--input-dirDirectory containing input files
--rules-fileJSON file defining mapping rules
--person-fileCSV file (or table name) with person IDs and DOB
--output-dirDirectory to write OMOP-format TSV files

Person ID File

Carrot Transform uses a single Person ID file (or table), which must be specified using --person-file.

  • The first column must contain person IDs (these will be anonymized).
  • A column named "date" must hold each person’s date of birth.
  • Person IDs not found in this file will be excluded from all OMOP tables.
  • This file must also reside in the same directory as the other input files (when using CSV mode).
  • This file can itself be an input to transformation rules.

OMOP Configuration (Choose One Approach)

ApproachRequired Arguments
Specify Files--omop-ddl-file (DDL statements for OMOP tables) and --omop-config-file (override JSON config)
Specify Version--omop-version (e.g., 5.3, which will automatically find carrottransform/config/omop.json and carrottransform/config/OMOPCDM_postgresql_XX_ddl.sql)

Optional Arguments

FlagDefaultDescription
--write-modewSet to w (overwrite) or a (append) for output files
--saved-person-id-fileNonePath to a file to save and share person_id state
--use-input-person-idsNUse input person IDs (Y) or replace with new integers (N)
--last-used-ids-fileNonePath to a file tracking last used IDs (tab-separated format)
--log-file-threshold0Change output limit for log files
--input-db-urlNoneSQLAlchemy connection string for database input

Database Input (Experimental)

Instead of reading from .csv files, Carrot Transform can read directly from a database using SQLAlchemy.

  • Provide a connection string via --input-db-url.
  • All input tables must match the format expected by Carrot Transform.
  • The --rules-file must still point to a local file on disk.
  • The --person-file will be interpreted as a table name, not a file path.

Examples:

  • --person-file C:/foo/bar/all_the_people.csv will run:
    SELECT * FROM all_the_people;
  • --person-file demographics_makeup.csv will run:
    SELECT * FROM demographics_makeup;
  • --input-db-url "sqlite:///./testing.db" will connect to an SQLite database in the file ./testing.db
Database Workflow

Carrot-Transform can read input tables from SQLAlchemy. This is experimental, and requires specifying a connection-string as --input-db-url instead of an input dir folder. The person-file parameter and carrot-mapper workflow should still be used, as if working with .csv files, but carrot-transform can read from an SQLAlchemy database.

  1. Extract/export some rows from the various tables
  • something like SELECT column_name(s) FROM patients LIMIT 1000; is written to patients.csv
  1. the usual scan reports are performed on these subsets
  2. when carrot-transform is invoked instead of --input-dir one specifies --input-db-url with a database connection string
  • the --person-file parameter should still point to the equivalent of person_tablename.csv
  • the --rules-file parameter needs to refer to a file on the disk as usual
  1. Carrot Transform will still write data to --output-dir and otherwise operate as normal
  • The following parameters have undefined behaviour with this functionality
    • --write-mode
    • --saved-person-id-file
    • --use-input-person-ids
    • --last-used-ids-file