Simulated Overexpression Screens
Building a Bioinformatics Pipeline Without Real Data
1. Intention
I wanted to build a Nextflow DSL2 pipeline for analyzing barcoded overexpression libraries using the Bobaseq workflow. Unfortunately, no publicly available sequencing datasets existed for either the mapping or barseq elements of Bobaseq, so both needed to be simulated from scratch.
This project involved creating analysis workflows as well as synthetic datasets that mirror real experimental designs, enabling full end-to-end testing of the pipeline. It is presented in greater depth on the following portfolio page: Simulating and Analyzing Barcoded Overexpression Libraries with Nextflow.
2. Designing the Simulation Workflows
To cover the full barseq experiment lifecycle, I created four modular workflows:
- Library Simulation: Generate a synthetic plasmid library with random genomic inserts and random barcodes using custom Python scripts and PBSIM3.
- Library Mapping: Map barcodes to inserts using CCS and Boba-seq.
- Barseq Simulation: Simulate barcode amplicon sequencing from experimental conditions with custom python scripts.
- Barseq Analysis: Quantify barcodes (MultiCodes) and calculate fitness scores (BobaseqFitness).
The result was a fully containerized, portable pipeline able to simulate experiments and process results exactly as if real experimental data had been collected.
3. Lessons Learned Along the Way
Docker Image Design
Initially, I built extremely lean Docker images, stripping out any unnecessary software to reduce size. Later, I discovered that Nextflow and AWS Batch expected certain utilities (like bash
and procps
), and were unreliable even when they were added to minimal images like Alpine. This led to rebuilding several images for full compatibility.
Unexpected AMI Challenges
Even though newer AWS Batch AMIs have awscli
installed in them, I found that Nextflow could not interact with it properly. Mysterious failures during S3 syncing led me to rebuild a custom AMI with a clean awscli installation.
Simulating Experiment Scale
Running the pipeline with different library sizes showed a clear effect: reducing the plasmid library size by a factor of 10 made it significantly harder to identify true winning gene regions in simulated selection experiments. Simulation provided a useful lens for thinking about experimental design.
4. Future Efforts
Although the project began with synthetic data, the pipelines are designed to process real-world barseq and mapping experiments as well.
Improved Simulations
If simulated data are to be used for the basis of experimental design, the barcode abundance distributions could be better modeled to represent real (possibly skewed) data and bring the models closer to realism. One option would be to reverse-engineer the distribution from the MultiCodes outputs from the Boba-seq paper. Another option would be to base distributions on those form public RB-TnSeq datasets.
AWS Batch Tuning
Many processes finish within two minutes. During testing, Nextflow started 12 different instances — one per sample — each taking several minutes to spin up but only about one minute to run. Setting concurrency limits for fast steps would increase workflow efficiency.
Spot Instance Failures
The workflow was configured to run using AWS spot instances, which typically cost about one-third of on-demand instances. However, spot instances carry the risk of being reclaimed, causing pipeline failure. Adding error handling or retry strategies would improve resilience.
5. Closing Thoughts
Building a genomics pipeline without real data forced a deeper understanding of both experimental and computational design. Simulating library creation, mapping, selection, and analysis helped reveal important lessons about workflow robustness, scalability, and real-world feasibility.
You can view the full project and example runs in the portfolio or on GitHub.