Cell Maps for AI Data Release

This section describes how to generate a new CM4AI data release. The intended audience is for data generation sites and anyone interested in knowing how the datasets on CM4AI site are created.

There are four steps to the data release:

  1. Each individual dataset must be run through cellmaps_utilscmd.py XX command to generate RO-Crate directories.

  2. These RO-Crate directories must be compressed

  3. cellmaps_utilscmd.py rocratetable must be given these compressed RO-Crate files to generate table

  4. Compressed RO-Crate files must be uploaded to CM4AI

  5. The table generated must be sent to admin of CM4AI site so they can load it and display the new data release

1) Perturbation/CRISPR data release (step 1 above)

The command line tool cellmaps_utilscmd.py crisprconverter takes a h5ad file and copies that file along with other meta data files into a RO-Crate suitable for persistance to FAIRSCAPE and ultimately publication on CM4AI

The example below generates a RO-Crate directory under the 0.1alpha folder using h5ad file named foo.h5ad passed in via the --h5ad flag

echo "completely fake h5ad file" > foo.h5ad

cellmaps_utilscmd.py -vv crisprconverter 0.1alpha --h5ad foo.h5ad --author 'Mali Lab' \
                     --name 'CRISPR' --organization_name 'Mali Lab' \
                     --project_name CM4AI --release '0.1 alpha' --treatment untreated \
                     --dataset 4channel --cell_line KOLF2.1J --gene_set chromatin \
                     --tissue undifferentiated --num_perturb_guides 6 \
                     --num_non_target_ctrls 109 --num_screen_targets 108

Example contents generated by above command:

0.1alpha/
└── cm4ai_chromatin_kolf2.1j_undifferentiated_untreated_crispr_4channel_0.1_alpha
    ├── perturbation.h5ad
    ├── dataset_info.json
    ├── readme.txt
    └── ro-crate-metadata.json

Note

Invoke cellmaps_utilscmd.py crisperconverter -h for usage information

Warning

This tool does not currently validate .h5ad, but when it does the above example will fail

1) Affinity Purification Mass Spectrometry (AP-MS) data release

The command line tool cellmaps_utilscmd.py apmsconverter consumes one or more tsv files that are combined and stored into a RO-Crate suitable for persistance to FAIRSCAPE and ultimately publication on CM4AI

The example below generates a RO-Crate directory under the 0.1alpha folder using tsv file named DNMT3A.tsv that is passed in via the --inputs flag

echo -ne 'Bait\tPrey\tPreyGene.x\tSpec\tSpecSum\tAvgSpec\tNumReplicates.x\t' > DNMT3A.tsv
echo -ne 'ctrlCounts\tAvgP.x\tMaxP.x\tTopoAvgP.x\tTopoMaxP.x\tSaintScore.x\t' >> DNMT3A.tsv
echo -e 'logOddsScore\tFoldChange.x\tBFDR.x\tboosted_by.x' >> DNMT3A.tsv
echo -ne 'DNMT3A\tO00422\tSAP18_HUMAN\t6|7|8|10\t31\t7.75\t4\t0|0|0|0|0|0|0|0\t' >> DNMT3A.tsv
echo -e '1\t1\t1\t1\t1\t13.51\t77.5\t0\tNA' >> DNMT3A.tsv
echo -ne 'DNMT3A\tO00571\tDDX3X_HUMAN\t3|7|11|9\t30\t7.5\t4\t0|1|3|3|0|0|0|0\t' >> DNMT3A.tsv
echo -e '0.99\t1\t0.99\t1\t0.99\t3.63\t8.57\t0\tNA' >> DNMT3A.tsv

cellmaps_utilscmd.py apmsconverter 0.1alpha --inputs DNMT3A.tsv \
                     --author 'Krogan Lab' --name 'AP-MS' \
                     --organization_name 'Krogan Lab' --project_name 'CM4AI' \
                     --release '0.1 alpha' --treatment untreated \
                     --cell_line 'MDA-MB-468' --gene_set 'chromatin'

Example contents generated by above command:

0.1alpha/
└── cm4ai_chromatin_mda-mb-468_untreated_apms_0.1_alpha
    ├── apms.tsv
    ├── dataset_info.json
    ├── readme.txt
    └── ro-crate-metadata.json

Note

Invoke cellmaps_utilscmd.py apmsconverter -h for usage information

1) Size Exclusion Chromatography with Mass Spectrometry (SEC-MS) data release

TODO

1) Immunofluorescent Image (IFImage) data release

The command line tool cellmaps_utilscmd.py ifconverter consumes a csv file that contains image links and other information to download and stored into a RO-Crate suitable for persistance to FAIRSCAPE and ultimately publication on CM4AI

The example below generates a RO-Crate directory under the 0.1alpha folder using csv file named example.csv that is passed in via the --inputs flag

# be sure to download this file: https://github.com/idekerlab/cellmaps_utils/raw/main/examples/iftool/example.csv
# and name it example.csv
wget https://github.com/idekerlab/cellmaps_utils/raw/main/examples/iftool/example.csv

cellmaps_utilscmd.py ifconverter 0.1alpha --input example.csv \
                     --author 'Lundberg Lab' --name 'IF images' \
                     --organization_name 'Lundberg Lab' --project_name 'CM4AI' \
                     --release '0.1 alpha' --treatment paclitaxel \
                     --cell_line 'MDA-MB-468' --gene_set 'chromatin'

Example contents generated by above command:

0.1alpha/
└── cm4ai_chromatin_mda-mb-468_paclitaxel_ifimage_0.1_alpha
    ├── antibody_gene_table.tsv
    ├── blue
    │   └── B2AI_1_Paclitaxel_C1_R1_z01_blue.jpg
    ├── dataset_info.json
    ├── green
    │   └── B2AI_1_Paclitaxel_C1_R1_z01_green.jpg
    ├── readme.txt
    ├── red
    │   └── B2AI_1_Paclitaxel_C1_R1_z01_red.jpg
    ├── ro-crate-metadata.json
    └── yellow
        └── B2AI_1_Paclitaxel_C1_R1_z01_yellow.jpg

Note

Invoke cellmaps_utilscmd.py ifconverter -h for usage information

2) Compress RO-Crate from step one

In this step the RO-Crate directories are compressed into files.

Note

The code fragment below assumes all RO-Crate directories were put into 0.1alpha directory.

# assuming all RO-Crates above were put into 0.1alpha directory
cd 0.1alpha

for Y in `find . -name "*_*" -maxdepth 1 -type d` ; do
  echo $Y
  tar -cz $Y > ${Y}.tar.gz
done

If examples above were run then the 0.1alpha directory will look like this:

.
├── cm4ai_chromatin_kolf2.1j_undifferentiated_untreated_crispr_4channel_0.1_alpha
├── cm4ai_chromatin_kolf2.1j_undifferentiated_untreated_crispr_4channel_0.1_alpha.tar.gz
├── cm4ai_chromatin_mda-mb-468_paclitaxel_ifimage_0.1_alpha
├── cm4ai_chromatin_mda-mb-468_paclitaxel_ifimage_0.1_alpha.tar.gz
├── cm4ai_chromatin_mda-mb-468_untreated_apms_0.1_alpha
└── cm4ai_chromatin_mda-mb-468_untreated_apms_0.1_alpha.tar.gz

3) Run cellmaps_utilscmd.py rocratetable on RO-Crate files

In this step, the RO-Crate files are examined and a table is generated that can be sent to the CM4AI site admin to show the new data release

Note

The code fragment below assumes all RO-Crate directories were put into 0.1alpha directory.

# assuming all RO-Crates above were put into 0.1alpha directory
# along with gzip files

cd 0.1alpha

cellmaps_utilscmd.py rocratetable table --downloadurlprefix 'https://cm4ai.org/Data/' --rocrates `/bin/ls | grep -v ".gz"`

The above command will create a directory named table and within that directory will be a tsv file named data.tsv

table
└── data.tsv

Contents of tsv data.tsv file:

FAIRSCAPE ARK ID    Date    Version Type    Cell Line       Tissue  Treatment       Gene set        Generated By Software   Name    Description     KeywordDownload RO-Crate Data Package   Download RO-Crate Data Package Size MB  Generated By Software   Output Dataset  Responsible Lab
d4d80b1d-8d49-4204-8c0d-209c5b9ccdf2:cm4ai_chromatin_kolf2.1j_undifferentiated_untreated_crispr_4channel_0.1_alpha  2024-04-29      0.1 alpha       Data    KOLF2.1J        undifferentiated        untreated       chromatin               CRISPR  CM4AI 0.1 alpha KOLF2.1J untreated CRISPR undifferentiated 4channel chromatin   CM4AI,0.1 alpha,KOLF2.1J,untreated,CRISPR,undifferentiated,4channel,chromatin   https://cm4ai.org/Data/cm4ai_chromatin_kolf2.1j_undifferentiated_untreated_crispr_4channel_0.1_alpha.tar.gz     1       Mali Lab
134e01c8-90ea-457d-9e6e-ca046ecc860f:cm4ai_chromatin_mda-mb-468_paclitaxel_ifimage_0.1_alpha        2024-04-29      0.1 alpha       Data    MDA-MB-468      breast; mammary gland   paclitaxel      chromatin               IF images       CM4AI 0.1 alpha MDA-MB-468 paclitaxel IF microscopy images breast; mammary gland chromatin      CM4AI,0.1 alpha,MDA-MB-468,paclitaxel,IF microscopy,images,breast; mammary gland,chromatin      https://cm4ai.org/Data/cm4ai_chromatin_mda-mb-468_paclitaxel_ifimage_0.1_alpha.tar.gz   1       Lundberg Lab
7240c7d7-327c-423c-834d-1e99ab8a417b:cm4ai_chromatin_mda-mb-468_untreated_apms_0.1_alpha    2024-04-29      0.1 alpha       Data    MDA-MB-468      breast; mammary gland   untreated       chromatin               AP-MS   CM4AI 0.1 alpha MDA-MB-468 untreated breast; mammary gland AP-MS edgelist chromatin     CM4AI,0.1 alpha,MDA-MB-468,untreated,breast; mammary gland,AP-MS edgelist,chromatin     https://cm4ai.org/Data/cm4ai_chromatin_mda-mb-468_untreated_apms_0.1_alpha.tar.gz       1                       Krogan Lab

Note

cellmaps_utilscmd.py rocratetable runs way faster if the uncompressed RO-Crate directories are passed in. The script does need the .gz files in the same directory to get file sizes output in the generated table.

4) Upload RO-Crate files

For this step the RO-Crate files ending with .gz should be uploaded to path matching prefix set via --downloadurlprefix in Step 3

Note

Be sure to verify URLs resolve for uploaded files

5) Send table from Step 4 to admin of CM4AI site

In this step send the table/data.tsv file to CM4AI admin and let them know if this table is to append or overwrite existing data