Analyze¶
Enter the output directory.
cd out
Extract scores¶
extract_scores.py
This should create a couple of files including all_scores_sorted_uniq.csv
Create a CIF file of the top N models¶
rebuild_atomic.py --top 10 --project_dir <full path to the original project directory, e.g. CR_Y_complex> config.json all_scores_sorted_uniq.csv
Assess sampling exhaustiveness¶
Run sampling performance analysis with imp-sampcon tool (described by Viswanath et al. 2017)
Warning
For the global optimization, the sampling exhaustiveness is not always applicable. For some cases, the optimization at this stage can actually work so well that it leads to all or most models being the same, resulting in very few clusters. In such cases, the sampling is exhaustive under the assumptions in the json but the estimation of sampling precision won’t be possible. In such cases we recommend to intensively refine (e.g. with high initial temperatures in simulated annealing) the top (or all models) to create a diverse set of models for analysis.
Prepare the
density.txt
filecreate_density_file.py --project_dir ../ config.json --by_rigid_body
Note
Example
density.txt
file is provided inCR_Y_complex/
Run
setup_analysis.py
script to prepare input files for the sampling exhaustiveness analysis.setup_analysis.py -s all_scores.csv -o analysis -d density.txt --score_thresh <score to use as threshold>
--score_thresh
is optional and used to filter out some rare very poorly scoring models (the threshold can be adjusted based on thescores.pdf
generated above)Note
For further descriptions of settings for
setup_analysis
please see Sampling exhaustiveness and precision with AssemblineRun
imp-sampcon exhaust
tool (command-line tool provided with IMP) to perform the actual analysis:cd analysis imp_sampcon exhaust -n <prefix for output files> \ --rmfA sample_A/sample_A_models.rmf3 \ --rmfB sample_B/sample_B_models.rmf3 \ --scoreA scoresA.txt --scoreB scoresB.txt \ -d ../density.txt \ -m cpu_omp \ -c <int for cores to process> \ -gp \ -g <float with clustering threshold step> \
Note
For further descriptions of settings for
imp_sampcon
please see Sampling exhaustiveness and precision with AssemblineIn the output you will get, among other files:
<prefix for output files>.Sampling_Precision_Stats.txt
Estimation of the sampling precision.
Clusters obtained after clustering at the determined (By imp-sampcon) sampling precision in directories and files starting from
cluster
in their names, containing information about the models in the clusters and cluster localization densities<prefix for output files>.Cluster_Precision.txt
listing the precision for each clusterPDF files with plots with the results of exhaustiveness tests
See Viswanath et al. 2017 for detailed explanation of these concepts.
Optimize the plots
The fonts and value ranges in X and Y axes in the default plots from
imp_sampcon exhaust
are frequently not optimal. For this you have to adjust them manually.Copy the original
gnuplot
scripts to the currentanalysis
directory by executing:copy_sampcon_gnuplot_scripts.py
This will copy four scripts to the current directory:
Plot_Cluster_Population.plt
for the<prefix for output files>.Cluster_Population.pdf
plotPlot_Convergence_NM.plt
for the<prefix for output files>.ChiSquare.pdf
plotPlot_Convergence_SD.plt
for the<prefix for output files>.Score_Dist.pdf
plotPlot_Convergence_TS.plt
for the<prefix for output files>.Top_Score_Conv.pdf
plot
Edit the scripts to adjust according to your liking
Run the scripts again:
gnuplot -e "sysname='<prefix for output files>'" Plot_Cluster_Population.plt gnuplot -e "sysname='<prefix for output files>'" Plot_Convergence_NM.plt gnuplot -e "sysname='<prefix for output files>'" Plot_Convergence_SD.plt gnuplot -e "sysname='<prefix for output files>'" Plot_Convergence_TS.plt
Extract cluster models for visualization
extract_cluster_models.py \ --project_dir <full path to the original project directory> \ --outdir <cluster directory> \ --ntop <number of top models to extract> \ --scores <path to the score CSV file used as input for analysis> \ Identities_A.txt Identities_B.txt <list of cluster models> <path to the json>
For example, to extract the 5 top scoring models from cluster 0:
extract_cluster_models.py \ --project_dir ../../ \ --outdir cluster.0/ \ --ntop 5 \ --scores ../all_scores.csv \ Identities_A.txt Identities_B.txt cluster.0.all.txt ../config.json
The models are saved in the CIF format to
cluster.0
directoryIf you want to re-cluster at a specific threshold (e.g. to get bigger clusters), you can do:
mkdir recluster cd recluster/ cp ../Distances_Matrix.data.npy . cp ../*ChiSquare_Grid_Stats.txt . cp ../*Sampling_Precision_Stats.txt . imp_sampcon exhaust -n <prefix for output files> \ --rmfA ../sample_A/sample_A_models.rmf3 \ --rmfB ../sample_B/sample_B_models.rmf3 \ --scoreA ../scoresA.txt --scoreB ../scoresB.txt \ -d ../density.txt \ -m cpu_omp \ -c 4 \ -gp \ --skip \ --cluster_threshold <float greater than already calculated precision>