class: center, middle, inverse, title-slide # Research Project Management ### Jinliang Yang and Gen Xu ### Jan. 23, 2025 --- # Challenges in the research project management - Many steps and many rounds of revisions - Spanning over years - Tracing the changes - __version control__ -- - Disseminating to your co-workers, collaborators, etc. - Reproducible, transparent, visualization - __Project implementation__ -- - Heavy computational demand - Computational resources and software - Data backup - __Computation__ --- # Challenges in the research project management ### Version control - Employ __git__ or __GitHub__ -- ### Project implementation - Construct a clear __dir system__ within __RStudio Project__ -- ### Computation - Use __HCC cluster__ system --- # Version control __Git__ is a [free and open source](https://git-scm.com/) distributed __version control system__ designed to handle everything from small to very large projects with speed and efficiency. -- - GitHub: is a git based repository hosting platform - GitHub Education: [student pack](https://education.github.com/students) - GitLab: is another repository manager which lets teams collaborate on code - GitLab UNL edition: https://git.unl.edu/ -- #### Git Usage Type `git` to find out the most commonly used git commands in your terminal. ```bash git ``` -- Git [cheat-sheet](https://github.github.com/training-kit/downloads/github-git-cheat-sheet.pdf) --- # Version control ### Create Repositories ```bash # Clone (download) a repo that already exists on GitHub git clone [url] git clone git@github.com:jyanglab/2022-agro932-lab.git ``` -- ### Make changes ```bash # snapshots all the file in preparation for versioning git add --all # records file snapshots permanently in version history git commit -m "descriptive message" ``` --- # Version control ### Synchronize changes ```bash # uploads all local branch commits to Github git push # updates your current local working branch with # all new commits from the corresponding remote branch git pull ``` -- ## GitHub Glossary - `git`: an open source, distributed version-control system - `GitHub`: a platform for hosting and collaborating on Git repositories - `fork`: make a copy of a repository on GitHub owned by a different user. - `git clone`: clone a local version of a repository, including all commits and branches. - `remote`: a common repository on GitHub that all team member use to exchange their changes --- # Project Implementation ### Construct your own project __directory system__ In a typical research project, I will copy the following folders into the project dir. The layout of the dir is based on the idea from [ProjectTemplate](http://projecttemplate.net/architecture.html). - __cache/__: Here we store intermediate datasets that are generated during the preprocessing steps. - __data/__: Here we store our raw data of small size. - __graphs/__: Graphs produced during the analysis. - __lib/__: Some functions used within this project. - __profiling/__: Contain main scripts for the project, including code documentations. It contains some sub-directories. --- # Project Implementation ### Construct your own project __directory system__ - __.gitignore__: specifies intentionally untracked files to ignore - __.git/__: git related files. - __TODO__: A todo list, markdown file. - __README__: readme file. -- - __largedata/__: Untracked folder contains files with large size, e.g., sequencing data. - __slurm-log/__: Log file for slurm script. - __slurm-script/__: Script for submitting slurm job. - __*.Rproj__: RStudio projects make it straightforward to divide your work into multiple contexts, each with their own working directory, workspace, history, and source documents. --- # Project Implementation ### Some tips regarding best practice for project management A __path__ specifies a unique location in a file system. -- - An __absolute path__ points to the same location in a file system, regardless of the current working directory. > `/Users/jyang/Documents/courses/2022-agro932-lab` - A __relative path__ is a way to specify the location of a directory relative to another directory. > `courses/2022-agro932-lab` --- # Project Implementation ### Some tips regarding best practice for project management I employ a numbering system to sort the research code. - Scripts were named by number, letter, and other numbers that separated by dots. For example: - `profilling/1.pheno/A.1_pheno_processing.Rmd` - `profilling/1.pheno/A.2_pheno_plot.Rmd` -- > A commit message shows whether a developer is a good collaborater (to others or to the future of yourself) Use informative commit messages. Read the following suggestions: - [How to write a git commit message](https://chris.beams.io/posts/git-commit/) - [On commit messages](http://who-t.blogspot.com/2009/12/on-commit-messages.html) --- # Computation ### Quick Introduction to HCC - HCC cluster, i.e., __crane__ or __rhino__, has a head node, which controls the cluster and compute nodes which is where the action happens. -- - __DO NOT__ run processes on the head node. -- - The only tasks that acceptable on the head node are: - Compiling/building files - Installing software or R packages - Submitting or checking on jobs -- - Pre-installed modules - `module avail` - `module load R/3.6` --- # Computation ### File systems on `crane` #### `$HOME`: - `$HOME` directories are backed up daily. - You can read and write. - But the size is small (20GB per user). - Normally used for configure files, user defined functions, user installed software packages. - `cd $HOME; mkdir bin` -- #### `$WORK`: - `$WORK` is large, 50TB per user. - NOT backed up - But purge policy exists on `$WORK` - For computing and working. But, __DO NOT use it to store RAW Data__. --- # Computation ### File systems on `crane` #### `$COMMON`: - New file system. 1TB per user for free. - __No backups are made__! Don’t be silly! - No purge policy. - Used to store things (i.e. code, git repo) that are routinely needed on multiple clusters - `cd $COMMON; mkdir courses` - `git clone git@github.com:jyanglab/2022-agro932-lab.git`