One of the tasks I seem to be spending a lot time thinking about these days is how to name files and structure them in the appropriate directories so that they follow a consistent logic. This is because my current research involves development of analysis pipelines of Next Generation Sequencing Data where the output file(s) of a program(s) is the input to the next. These processing steps allow raw data straight out of the machine to help answer the biological questions for which the experiments were run on the first place.
File and directory naming conventions may sound like a trivial thing to do but I have found that their complexity increases exponentially when many components are run. To illustrate my current approach to tackling this problem, I present here a simple example. Suppose a project (‘project_name’) that runs two programs, ‘program_1’ and ‘program_2’. Each time the pipeline is run, input files may vary and so I create a new ‘job_name’ for each run. I have come up with this directory architecture:
/project_name /project_name/data /project_name/data/job_name_1 /project_name/data/job_name_1/input_data_type_1 /project_name/data/job_name_1/input_data_type_2 /project_name/data/job_name_1/input_data_type_3 /project_name/results /project_name/results/job_name_1/program_1 /project_name/results/job_name_1/program_1/output_1 /project_name/results/job_name_1/program_1/output_2 ... /project_name/results/job_name/program_2/output_1 /project_name/results/job_name/program_2/output_1 ...
What would happen if instead of running 2 programs as I did above I run 5 or 6? And what if for each input data file I had replicates? What about maximising the number steps taken in parallel? You can start to see that the thing really gets complicated.
File and directory naming conventions is something that I am teaching myself, but any directives or systematic methods taught during my computer science student years would have come in handy now. In future bioinformatics lectures I teach I will definitively challenge my students to think about this issue very carefully.
One thing I’m trying to remind myself of more often is that you need not limit yourself to just one file naming / directory structure for the same content. The directory tree is after all just an index of the files and you can, through hard or soft linking, have more than one of them at the same time.
Nick Loman (@pathogenomenick)
I try and do this, and invariably fail. The best advice I give to myself is start simple and re-organise logically as complexity mounts up. Also document run every command with a Makefile or similar. A bit like writing an article. Sometimes trying to get this right at the start is futile as bioinformatics experiments don’t progress in the linear way you might expect, constantly branching off into dead-ends or useful avenues.
Nice metaphor, Nick, the one about writing an article. Yes, I also compare this problem to the way I organize my files as I create a software project: you never know where you will end and you want to avoid having to spend a lot of time trying to remember what you did 6 months ago…
. Your favorite juttificasion appeared to be on the net the easiest thing to be aware of. I say to you, I certainly get irked while people consider worries that they plainly do not know about. You managed to hit the nail upon the top and defined out the whole thing without having side effect , people can take a signal. Will likely be back to get more. Thanks
I had the same problem while doing a pipeline for NGS-data with lots of different tasks (assembly and annotation using pfam & gene ontology). And the problem gets worse if you have to handle multiple users.
@cariaso: what is the benefit of the xml-file? You’d still have to make sure that you have a 1:1 relationship between entries in the xml and the filename or am I missing something?
Yes you still need a 1:1. Its minor extra work, but adds these gains.
It avoids the fragile dependencies between filenames. (“I’ll parse X from filename Y, and use that for filename Z”)
It follows the same idea that database primary keys should be meaningless.
it makes it much easier for additional tools to play nice together. (Program X changed its format. Program Y now reparses X’s output and generates a new file while follows the old format.)
I wish I could point to some authoritative source, I’m just sharing a hard learned principle I’ve adopted over the years. This is what I do for code I actually care about maintaining for a while.
What about just having a plain simple config file with the names of files?
if you prefer .ini, fine. I’d certainly stick with something standard to avoid the bad habits that come from rolling your own.
Ah, okay. I agree on keeping away meta-data out of the file names. I haven’t worried to much about this so far as all files have some representation inside a database table in my applications. So it boiled down to keep the files structured without worrying about meta-data at this level.
File naming conventions are fine for small tasks, but when it matters, never encode any data in filenames. For anything of size or importance, generate a single xml file which contains the the filenames.