There have been three discussions on public datasets - the reason being that we could not find a single date suitable for all. Always present was Peter Kunszt and Adam Srebniak.
Discussion at IMSB, January 19th 2010
The first discussion we had with Andreas Quandt at the IMSB, Aebersold lab. They would very much like a mechanism to unify download and distribution of public dataset files. Issues raised by Andreas
- DAT format preferred if available for download as they contain the most info. Andreas uses A. Masselot's perl scripts (insilicospectro, available through cpan) to convert them.
- Fasta file formats. The formats are not uniform but vary depending in which context they are being used. Converters are needed if workflows are to be built.
- File names. Mascot has a file length restriction of 13 chars, so Andreas names his files according to a convention like
sp_9606.fasta where sp stands for swissprot in this case, and 9606 is the taxonomy.
- Some binary formats add or change the fasta extension, for example
- Versioning is non-uniform. Some have a version coming with the package. People usually interested in the latest version. Storing old versions is expensive, reanalysis with latest data is often cheaper. To do comparisons between old and new data usually the smaller protxml format is used
A future system must have features for Andreas
- Download scripts, modular and configurable. Possibility to choose DAT or fasta
- Split into taxonomy, but configurable how
- Convert also into fasta, beware of naming. Agree on a naming scheme! Somewhat configurable.
- Optional binary formats, extension ADDED to .fasta for simplicity
He suggested using RPM repositories for distribution and auto-updating.
Discussion at the Biozentrum, January 20th 2010
The second discussion was at the Biozentrum. Present were from the Biozentrum: Michael Podvinec, Rainer Poehlmann, Konstantin Arnold, Torsten Schwede; and Can Tuerker, Hubert Rehrauer (FGCZ) and Hans-Rudolf Hotz (FMI). For starters, we have discussed the basic problem and who is dealing with it how. We have two very different approaches:
- The Biozentrum runs a weekly update, switching from the old to the new data monday very early morning. This way the public datasets are always up to date, ie. max 1 week out of sync.
- The FGCZ let each project grab their own dataset as they want, process them in a format as they need and update on their own, arguing that every project will do different things. Some want the latest greatest, some do not want to change for a year to keep their code steady. If users ask the supporters, they can also download datasets. But this does not happen automatically, only on demand.
- The FMI has a central (public) data repository managed by the Computational Biology group. Individual research groups share this data. New datasets are downloaded as they become available (e.g. a new human genome assembly) or created as required by the project (e.g. an updated annotation database) and put into versioned directories. Since all those datasets are small, old data is kept in its original location.
The Biozentrum automatic system works as follows
- Mirror a public resource to some local drive (wget, rsync, mirror or ftp or whatever tool of the day) - this is done with a cron, executing a makefile every day
- Copy this to a central directory called /pending
- Here processing is done, like unzipping, indexing, joins.. depending on the package and needs
- The copying and processing is logged and displayed in a status web page during the week
- /pending will be renamed to /database if everything is ok early monday, and the previous /database is now the /pending (it is a switch). Then the cron update starts again.
A future system must have the following features or improve on the following
- Modularize the update. Some databases need more frequent updates or on a different day, say Wednesday. Others do not need to be checked every week.
- Clear and modular versioning. Just switching back and forth between latest and pending is not good enough, sometimes more versions need to be kept.
- The makefile method is bad because if something fails in the middle, the rest of the processing never happens (even if it was independent of the failure)
- Dependencies need to be clear, ie. which download or which processing happens before another - this is where make was good
- Download methods for each set should be available such that people can just get the script or code, use it to download directly on their own
- Processing scripts must also be versioned by the system, as modules that anyone could get and use
- A framework of configurable triggers to get and process data would suit all
- Ability to get each others processed data
We have discussed various implementation options, including RPM and SVN. And Adam and I got a quick run-through of the makefile and the web status monitor.
Discussion at VitalIT, January 21st 2010
The third discussion was held at VitalIT. Present were: Jacques Rougemont from the BBCF, and Heinz Stockinger, Marco Pagni, Roberto Fabbretti from VitalIT. I have summarized quickly what was said at the previous meetings.
At the VitalIT there is a system which is quite successful. Everyone sees a mount on the VitalIT cluster called
/db and people can create their own directory, downloading data and processing it on the spot. This is then visible to everyone else, so they can just send a message to their collaborators stating what data is available. Issues:
- One problem is that the supporters of the cluster have no idea which data is where, there is certainly a lot of duplication.
- They have no idea which data is used by whom, how are data updated, except for the ones they use or update themselves.
- Sometimes the quality of the available data is very low, sometimes scripts are available that were used for processing, but mostly not.
- Format of the datasets is an issue here too
- Unknown tools being used, their differences...
- There is no self-documenting page listing what is available, nobody really has a full picture
We have discussed versioning extensively. The same conclusions were drawn as in Basel, ie the versions can be kept by naming the directories appropriately, with a symlink to the latest version. Some databases have incremental updates that are not versioned (EMBL), a method is needed to deal with that.
We have discussed several use cases. In all cases, the end-user must have a super easy intuitive interface to just work with the data. Marco said it must not be more difficult than the svn clients. Heinz also argued for web-based tools, which would be easy to provide on top of command-line tools.
1. Reusing older data
- Check out database
- Work with a given version
- Drop it
- Be able to check out the same version again much later
2. Publishing your own data
- Send the data somewhere
- Provide metadata and documentation
3. Automate update of data
- Check whether newer data is available
- if yes, get new data but as new version
- Keep old version according to policy or drop it
4. Select databases using a web form
- A web page should list available datasets
- People should be able to choose which one they need, based on what criteria
- Either they should be able to download it directly
- Or trigger the correct processing, giving a link to the result as an email
We have also discussed that in order to achieve any kind of automation, the structure of the filesystem hosting the data must be agreed on. One idea jotted on the whiteboard looked like this
/raw data with md5 checksums of download
This is just an idea, we had many variations. We have discussed implementations, like with RPM and SVN, but always found cases where the chosen existing system was not suitable. The agreement was that to the client, it must be an easy-to-use system.
Peter and Adam will clarify open questions as needed, get the existing codes and come up with a suggestion how to proceed.
This will be posted on this wiki and discussed online.