Cell Maps for AI Data Release
This section describes how to generate a new CM4AI data release. The intended audience is for data generation sites and anyone interested in knowing how the datasets on CM4AI site are created.
There are four steps to the data release:
Each individual dataset must be run through
cellmaps_utilscmd.py XX
command to generate RO-Crate directories.These RO-Crate directories must be compressed
cellmaps_utilscmd.py rocratetable
must be given these compressed RO-Crate files to generate tableThe table generated must be sent to admin of CM4AI site so they can load it and display the new data release
1) Perturbation/CRISPR data release (step 1 above)
The command line tool cellmaps_utilscmd.py crisprconverter
takes a h5ad file
and copies that file along with other meta data files into a RO-Crate suitable
for persistance to FAIRSCAPE and ultimately publication on CM4AI
The example below generates a RO-Crate directory under the 0.1alpha
folder using
h5ad file named foo.h5ad
passed in via the --h5ad
flag
echo "completely fake h5ad file" > foo.h5ad
cellmaps_utilscmd.py -vv crisprconverter 0.1alpha --h5ad foo.h5ad --author 'Mali Lab' \
--name 'CRISPR' --organization_name 'Mali Lab' \
--project_name CM4AI --release '0.1 alpha' --treatment untreated \
--dataset 4channel --cell_line KOLF2.1J --gene_set chromatin \
--tissue undifferentiated --num_perturb_guides 6 \
--num_non_target_ctrls 109 --num_screen_targets 108
Example contents generated by above command:
0.1alpha/
└── cm4ai_chromatin_kolf2.1j_undifferentiated_untreated_crispr_4channel_0.1_alpha
├── perturbation.h5ad
├── dataset_info.json
├── readme.txt
└── ro-crate-metadata.json
Note
Invoke cellmaps_utilscmd.py crisperconverter -h
for usage information
Warning
This tool does not currently validate .h5ad, but when it does the above example will fail
1) Affinity Purification Mass Spectrometry (AP-MS) data release
The command line tool cellmaps_utilscmd.py apmsconverter
consumes one or more tsv files
that are combined and stored into a RO-Crate suitable
for persistance to FAIRSCAPE and ultimately publication on CM4AI
The example below generates a RO-Crate directory under the 0.1alpha
folder using
tsv file named DNMT3A.tsv
that is passed in via the --inputs
flag
echo -ne 'Bait\tPrey\tPreyGene.x\tSpec\tSpecSum\tAvgSpec\tNumReplicates.x\t' > DNMT3A.tsv
echo -ne 'ctrlCounts\tAvgP.x\tMaxP.x\tTopoAvgP.x\tTopoMaxP.x\tSaintScore.x\t' >> DNMT3A.tsv
echo -e 'logOddsScore\tFoldChange.x\tBFDR.x\tboosted_by.x' >> DNMT3A.tsv
echo -ne 'DNMT3A\tO00422\tSAP18_HUMAN\t6|7|8|10\t31\t7.75\t4\t0|0|0|0|0|0|0|0\t' >> DNMT3A.tsv
echo -e '1\t1\t1\t1\t1\t13.51\t77.5\t0\tNA' >> DNMT3A.tsv
echo -ne 'DNMT3A\tO00571\tDDX3X_HUMAN\t3|7|11|9\t30\t7.5\t4\t0|1|3|3|0|0|0|0\t' >> DNMT3A.tsv
echo -e '0.99\t1\t0.99\t1\t0.99\t3.63\t8.57\t0\tNA' >> DNMT3A.tsv
cellmaps_utilscmd.py apmsconverter 0.1alpha --inputs DNMT3A.tsv \
--author 'Krogan Lab' --name 'AP-MS' \
--organization_name 'Krogan Lab' --project_name 'CM4AI' \
--release '0.1 alpha' --treatment untreated \
--cell_line 'MDA-MB-468' --gene_set 'chromatin'
Example contents generated by above command:
0.1alpha/
└── cm4ai_chromatin_mda-mb-468_untreated_apms_0.1_alpha
├── apms.tsv
├── dataset_info.json
├── readme.txt
└── ro-crate-metadata.json
Note
Invoke cellmaps_utilscmd.py apmsconverter -h
for usage information
1) Size Exclusion Chromatography with Mass Spectrometry (SEC-MS) data release
TODO
1) Immunofluorescent Image (IFImage) data release
The command line tool cellmaps_utilscmd.py ifconverter
consumes a csv file
that contains image links and other information to download and stored into a RO-Crate suitable
for persistance to FAIRSCAPE and ultimately publication on CM4AI
The example below generates a RO-Crate directory under the 0.1alpha
folder using
csv file named example.csv
that is passed in via the --inputs
flag
# be sure to download this file: https://github.com/idekerlab/cellmaps_utils/raw/main/examples/iftool/example.csv
# and name it example.csv
wget https://github.com/idekerlab/cellmaps_utils/raw/main/examples/iftool/example.csv
cellmaps_utilscmd.py ifconverter 0.1alpha --input example.csv \
--author 'Lundberg Lab' --name 'IF images' \
--organization_name 'Lundberg Lab' --project_name 'CM4AI' \
--release '0.1 alpha' --treatment paclitaxel \
--cell_line 'MDA-MB-468' --gene_set 'chromatin'
Example contents generated by above command:
0.1alpha/
└── cm4ai_chromatin_mda-mb-468_paclitaxel_ifimage_0.1_alpha
├── antibody_gene_table.tsv
├── blue
│ └── B2AI_1_Paclitaxel_C1_R1_z01_blue.jpg
├── dataset_info.json
├── green
│ └── B2AI_1_Paclitaxel_C1_R1_z01_green.jpg
├── readme.txt
├── red
│ └── B2AI_1_Paclitaxel_C1_R1_z01_red.jpg
├── ro-crate-metadata.json
└── yellow
└── B2AI_1_Paclitaxel_C1_R1_z01_yellow.jpg
Note
Invoke cellmaps_utilscmd.py ifconverter -h
for usage information
2) Compress RO-Crate from step one
In this step the RO-Crate directories are compressed into files.
Note
The code fragment below assumes all RO-Crate directories were put into 0.1alpha
directory.
# assuming all RO-Crates above were put into 0.1alpha directory
cd 0.1alpha
for Y in `find . -name "*_*" -maxdepth 1 -type d` ; do
echo $Y
tar -cz $Y > ${Y}.tar.gz
done
If examples above were run then the 0.1alpha
directory will look like this:
.
├── cm4ai_chromatin_kolf2.1j_undifferentiated_untreated_crispr_4channel_0.1_alpha
├── cm4ai_chromatin_kolf2.1j_undifferentiated_untreated_crispr_4channel_0.1_alpha.tar.gz
├── cm4ai_chromatin_mda-mb-468_paclitaxel_ifimage_0.1_alpha
├── cm4ai_chromatin_mda-mb-468_paclitaxel_ifimage_0.1_alpha.tar.gz
├── cm4ai_chromatin_mda-mb-468_untreated_apms_0.1_alpha
└── cm4ai_chromatin_mda-mb-468_untreated_apms_0.1_alpha.tar.gz
3) Run cellmaps_utilscmd.py rocratetable
on RO-Crate files
In this step, the RO-Crate files are examined and a table is generated that can be sent to the CM4AI site admin to show the new data release
Note
The code fragment below assumes all RO-Crate directories were put into 0.1alpha
directory.
# assuming all RO-Crates above were put into 0.1alpha directory
# along with gzip files
cd 0.1alpha
cellmaps_utilscmd.py rocratetable table --downloadurlprefix 'https://cm4ai.org/Data/' --rocrates `/bin/ls | grep -v ".gz"`
The above command will create a directory named table
and within that directory
will be a tsv file named data.tsv
table
└── data.tsv
Contents of tsv data.tsv
file:
FAIRSCAPE ARK ID Date Version Type Cell Line Tissue Treatment Gene set Generated By Software Name Description KeywordDownload RO-Crate Data Package Download RO-Crate Data Package Size MB Generated By Software Output Dataset Responsible Lab
d4d80b1d-8d49-4204-8c0d-209c5b9ccdf2:cm4ai_chromatin_kolf2.1j_undifferentiated_untreated_crispr_4channel_0.1_alpha 2024-04-29 0.1 alpha Data KOLF2.1J undifferentiated untreated chromatin CRISPR CM4AI 0.1 alpha KOLF2.1J untreated CRISPR undifferentiated 4channel chromatin CM4AI,0.1 alpha,KOLF2.1J,untreated,CRISPR,undifferentiated,4channel,chromatin https://cm4ai.org/Data/cm4ai_chromatin_kolf2.1j_undifferentiated_untreated_crispr_4channel_0.1_alpha.tar.gz 1 Mali Lab
134e01c8-90ea-457d-9e6e-ca046ecc860f:cm4ai_chromatin_mda-mb-468_paclitaxel_ifimage_0.1_alpha 2024-04-29 0.1 alpha Data MDA-MB-468 breast; mammary gland paclitaxel chromatin IF images CM4AI 0.1 alpha MDA-MB-468 paclitaxel IF microscopy images breast; mammary gland chromatin CM4AI,0.1 alpha,MDA-MB-468,paclitaxel,IF microscopy,images,breast; mammary gland,chromatin https://cm4ai.org/Data/cm4ai_chromatin_mda-mb-468_paclitaxel_ifimage_0.1_alpha.tar.gz 1 Lundberg Lab
7240c7d7-327c-423c-834d-1e99ab8a417b:cm4ai_chromatin_mda-mb-468_untreated_apms_0.1_alpha 2024-04-29 0.1 alpha Data MDA-MB-468 breast; mammary gland untreated chromatin AP-MS CM4AI 0.1 alpha MDA-MB-468 untreated breast; mammary gland AP-MS edgelist chromatin CM4AI,0.1 alpha,MDA-MB-468,untreated,breast; mammary gland,AP-MS edgelist,chromatin https://cm4ai.org/Data/cm4ai_chromatin_mda-mb-468_untreated_apms_0.1_alpha.tar.gz 1 Krogan Lab
Note
cellmaps_utilscmd.py rocratetable
runs way faster if the uncompressed
RO-Crate directories are passed in. The script does need the .gz
files in the same directory to get file sizes output in the generated
table.
4) Upload RO-Crate files
For this step the RO-Crate files ending with .gz
should be uploaded to path matching
prefix set via --downloadurlprefix
in Step 3
Note
Be sure to verify URLs resolve for uploaded files
5) Send table from Step 4 to admin of CM4AI site
In this step send the table/data.tsv
file to CM4AI admin
and let them know if this table is to append or overwrite existing
data