GENOMIC UNIFIED SCHEMA (GUS) 3.5
GUS has support for ORACLE, and PosgreSQL. But as we chose to use PostgreSQL, our manual only covers PostgreSQL installation. Check GUS home or GUS Project on Sourceforge for further orientation on installing and using GUS with ORACLE.
– Perl 5.8.1 or above;
– Perl Modules: DBI, DBD::Oracle and/or DBD::Pg, Parse::Yapp, XML::Simple, and XML::Parser or XML::SAX;
– JAVA 1.5 or above, including the SDK. We experienced a problem with jbdc driver included in JAVA 1.4_06 with GUS installed in Fedora Core 4 installations. You could try this version, but we recommend using JAVA 1.5.
– PostgresSQL 7.4 or above;
– Apache Ant 1.6.2 or above;
– Tomcat 5.0 or above (if you wish to install WDK).
The command lines started by # will need root access, and those started by $ may be executed by regular users. We will use $YOUR_PATH as a notation to “any directory you desire”. We strongly suggest to install all dependencies in one directory, i.e. /usr/local/.
$ ./configure $ su # gmake # gmake install # adduser postgres
From this point starts the creation of the database itself. Depending on the work your group develops, the database may need a grand amount of hard disk space. Given that prerequisites, your database may be allocated anyplace in your hard disk.
# mkdir /$YOUR_PATH/data # chown postgres /$YOUR_PATH/data # su - postgres $ /usr/local/pgsql/bin/initdb -D /$YOUR_PATH/data
Now, you can start postgreSQL
$ /usr/local/pgsql/bin/postmaster -D /$YOUR_PATH/data -l logfile 2>&1 &
or
$ export PGDATA=/$YOUR_PATH/data $ /usr/local/pgsql/bin/pg_ctl start -l filename.log &
pg_ctl starts postmaster
Variable PGDATA can be set in the user profile file.
Check the logfile to see if the database was successfully started.
You can also check the status of postgreSQL typing:
$ export PGDATA=/$YOUR_PATH/data $ /usr/local/pgsql/bin/pg_ctl status
To stop postgreSQL, type:
$ export PGDATA=/$YOUR_PATH/data $ /usr/local/pgsql/bin/pg_ctl stop
To view other options of pg_ctl
$ /usr/local/pgsql/bin/pg_ctl --help
Create a test database for it.
$ /$YOUR_PATH/pgsql/bin/createdb test $ /$YOUR_PATH/pgsql/bin/psql test
Edit .bash_profile or /etc/profile (where [POSTGRES_PATH] means where you installed PostgreSQL on your server)
export PSQL_HOME=/PATH/[POSTGRES_PATH]/ export PATH=$PSQL_HOME/bin:$PATH
For better GUS' administration, the dependencies' path should be included either in each user BASH (/home/username/.bash_profile) for each user using/administrating GUS, in /etc/profile as a global for all users, or you might create a group for users using GUS and add this permissions for the users using/administrating GUS.
I needed a way to grant permissions on all tables of a given schema. Didn't really know how to do it, so I followed someone's advice. In the script below I add "SELECT" permission to Dots, Core and Sres to a role called "aluno". I later added three users to that role.
SELECT 'grant select on ' || n.nspname || '.' || c.relname || ' to \"aluno\";' FROM pg_class c, pg_namespace n WHERE c.relnamespace = n.oid AND c.relname NOT LIKE 'pg_%' and c.relkind in ('r','v','s') AND n.nspname in ('dots','core','sres');
# tar xzvf apache-ant-x.x.x-bin.tar.gz
Add the following to .bash_profile or /etc/profile:
export ANT_HOME=/$YOUR_PATH/apache-ant-x.x.x/ export PATH=$ANT_HOME/bin:$PATH
# sh jdk-x_x_x_x-linux-i586.bin
Add the following to .bash_profile or /etc/profile:
export JAVA_HOME=/$YOUR_PATH/jdkx.x.x_xx/ export PATH=$JAVA_HOME/bin:$PATH
In Ubuntu 7.10, the DBD-Pg module should be installed like this:
sudo apt-get install libdbd-pg-perl Reading package lists... Done Building dependency tree Reading state information... Done The following NEW packages will be installed: libdbd-pg-perl 0 upgraded, 1 newly installed, 0 to remove and 4 not upgraded. Need to get 129kB of archives. After unpacking 385kB of additional disk space will be used. Get:1 http://br.archive.ubuntu.com gutsy/universe libdbd-pg-perl 1.49-2build1 [129kB] Fetched 129kB in 1s (88.0kB/s) Selecting previously deselected package libdbd-pg-perl. (Reading database ... 33537 files and directories currently installed.) Unpacking libdbd-pg-perl (from .../libdbd-pg-perl_1.49-2build1_i386.deb) ... Setting up libdbd-pg-perl (1.49-2build1) ...
CLASSPATH=$CLASSPATH:$JAVA_HOME/lib/postgresql-x.X-xxx.jar export CLASSPATH
Old manual generated by Poliana
Create /GUS/ directory in your chosen path:
$ mkdir GUS
Create database “$GUS_DB” (in PostgreSQL) [$GUS_DB as “any name you might want”]
$ createdb $GUS_DB
Create database user “$GUS_USER” (in PostgreSQL, as well) [$GUS_USER as “any username you might be creative enough to think of]
$ createuser $GUS_USER
Our research group wrote a script to facilitate the research proccess. Copy this script to $GUS_HOME, and fill in the following lines in build.properties:
# Propriedades do buildfile build.xml # Diretorios a serem criados dir.project_home = $YOUR_GUS_PATH/project_home dir.gus_home = $YOUR_GUS_PATH/gus_home
Where $YOUR_GUS_PATH is the directory where GUS will be installed (i.e., /GUS/). And below, in the same file:
# Valores das propriedades do arquivo gus.propriedade.dbVendor.valor = Postgres gus.propriedade.dbiDsn.valor = dbi:Pg:dbname=gustest gus.propriedade.jdbcDsn.valor = jdbc:postgresql:gustest gus.propriedade.databaseLogin.valor = usertest gus.propriedade.databasePassword.valor = usertest gus.propriedade.userName.valor = dba gus.propriedade.group.valor = dba gus.propriedade.project.valor = Database administration
Where all lines come filled in as listed above in the script by default. This values should be filled in according to the database and the user you have created in Postgres, as shown above. Next, change the /GUS path permission to allow GUS' admins
# chown -R username:group /usr/local/GUS
Add the following lines to .bash_profile for each user administrating GUS (or as root in /etc/profile as a global pattern):
export GUS_HOME=/$YOUR_PATH/GUS/gus_home export PROJECT_HOME=/$YOUR_PATH/GUS/project_home export GUS_CONFIG_FILE=/$YOUR_PATH/GUS/gus.properties export PERL5LIB=$GUS_HOME/lib/perl export PATH=$GUS_HOME/bin:$PROJECT_HOME/install/bin
To register the basic plugins, use the following command:
$ ga +meta –commit
Bug: [amanda1@kineto4 ~]$ ga +create GUS::Supported::Plugin::LoadGusXml FATAL: Database version does not match required version for Plugin. Database Version: Plugin Version: 3.5 FATAL: Failed running 'ga +meta --commit' with stderr: Solution: INSERT INTO "databaseversion" ("database_version_id", "version", "modification_date", "user_read", "user_write", "group_read", "group_write", "other_read", "other_write", "row_user_id", "row_group_id", "row_project_id", "row_alg_invocation_id") VALUES (1, 3.5, '2007-08-02 16:02:31.694305', 1, 1, 1, 1, 1, 0, 1, 1, 1, 1)
The GUS WDK is designed to accelerate the creation of "data mining" websites. It can work on any relational database system and on any schema
Download latest WDK version;
Decompress WDK_version_x-x.tar.gz in /$PROJECT_HOME;
Execute:
build WDKToySite install -append;
Edit toyModel-config.xml the following lines located in /$GUS_HOME/config:
<modelConfig login="usertest" password="usertest" connectionUrl="jdbc:postgresql:gustest" queryInstanceTable="QueryInstance" maxQueryParams="50" platformClass="org.gusdb.wdk.model.implementation.PostgreSQL" initialSize="3" maxActive="20" maxIdle="3" minIdle="3" maxWait="500"/>
Create a Postgres schema for the GUS DB:
$ psql -U usertest gustest gustest=# CREATE SCHEMA usertest gustest=#\q
Execute the following commands: wdkCache, wdkTestDb, wdkSanityTest
$ wdkCache -model toyModel -new
$ wdkTestDb -model toyModel -create
ERROR: column "row_project_id" is of type numeric but expression is of type character varying java.sql.SQLException: ERROR: column "row_project_id" is of type numeric but expression is of type character varying at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:1471) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1256) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:175) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:392) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:330) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:282) at org.apache.commons.dbcp.DelegatingPreparedStatement.executeUpdate(DelegatingPreparedStatement.java:233) at org.gusdb.wdk.model.implementation.SqlUtils.getResultSet(SqlUtils.java:152) at org.gusdb.wdk.model.test.TestDBManager.main(TestDBManager.java:116)
To get around it, you can manually add data to the tables.
1. Manually edit and remove the first line of each of the files in WDKToySite/Model/data/testTables. 2. Run the command "wdkTestDb -model toyModel -create" 8 times to create the 8 tables 3. Load data using the \copy from Postgres as described below
cd /home/kary1/data/modifiedTestTables [kary1@kineto4 modifiedTestTables]$ psql Password: Welcome to psql 8.2.4, the PostgreSQL interactive terminal. Type: \copyright for distribution terms \h for help with SQL commands \? for help with psql commands \g or terminate with semicolon to execute query \q to quit kary1=> \copy wdktestassembly from 'WDKTestAssembly' with null as '' kary1=> \copy wdktestest from 'WDKTestEst' with null as '' kary1=> \copy wdktestassemblysequence from 'WDKTestAssemblySequence' with null as '' kary1=> \copy wdktestlibrary from 'WDKTestLibrary' with null as '' kary1=> \copy wdktestzipcode from 'WDKTestZipCode' kary1=> \copy wdktestgoterm from 'WDKTestGoTerm' kary1=> \copy wdktesttaxonname from 'WDKTestTaxonName' kary1=> \copy wdktestexternalnasequence from 'WDKTestExternalNaSequence'
Now you can proceed to the sanity tests.
$ wdkSanityTest -model toyModel
The last command should print the following message:
14 queries passed, 0 queries failed (the number of queries passed may change, from version to version) 2 records passed, 0 records failed 6 questions passed, 0 questions failed Sanity Test PASSED
For further documentation on WDK, check GUS Web Development Kit website.
Download latest Apache Tomcat version;
Decompress in your chosen directory;
Edit /$YOUR_PATH/jakarta-tomcat-x.x.xx/conf/server.xml, starting in line 376:
<Context path="/wdktoysite" docBase="/usr/local/GUS/gus_home/webapp" debug="1" privileged="false" allowLinking="true" > <Logger className="org.apache.catalina.logger.FileLogger" prefix="wdktoysite-log." suffix=".txt" timestamp="true" docBase="/usr/local/GUS/gus_home/webapp/logs" /> </Context> </Host>
In /$YOUR_PATH/jakarta-tomcat-x.x.xx/conf/tomcat-users.xml add the following line;
<user username="admin" password="admin" roles="manager,role1,tomcat"/> </tomcat-users>
Inside /$GUS_HOME create a directory named /webapp, and inside it, create a directory named /logs;
Still in /$GUS_HOME create an archive named webapp.conf, using the following command line:
$ echo “webappTargetDir=/usr/local/GUS/project_home/webapp/” >webapp.conf
Then, build WDKToySite:
$ build WDKToySite webinstall -append -webPropFile webapp.conf
Inside $GUS_HOME/webapp execute the following commands:
$ cd WEB-INF $ cp web.xml.toy web.xml $ cd tags/site $ cp footer.tag.toy footer.tag $ cp header.tag.toy header.tag
Initialize TOMCAT:
# /$YOUR_PATH/jakarta-tomcat-x.x.xx/bin/startup.sh
Make sure port 8080 is accepting connections:
# /sbin/iptables -I INPUT -p tcp --dport 8080 -j ACCEPT
To test you WDK, click here. Or, open a browser and access http://localhost:8080/wdktoysite.
As GUS Schema comes with several plugins, and different groups have different needs, the plugins you might need could not be listed below. Look in the mailing list for help with plugin not listed here, or email gusdev mailing list to get help from the community. The plugins are listed in the following order because some entries made by one plugin might be needed by the following. The descriptions are currently copied from the help from each plugin. Any value marked with a $ (as in $YOUR_DB_NAME) means this field should be filled by personal information from your project, or from yourself.
DESCRIPTION
Insert data in from a simple GUS XML format into GUS.
$ ga +create GUS::Supported::Plugin::LoadGusXml --commit $ ga GUS::Supported::Plugin::LoadGusXml --filename /$YOUR_PATH/ExternalDatabase.xml --commit $ ga GUS::Supported::Plugin::LoadGusXml --filename /$YOUR_PATH/ExternalDatabaseRls.xml --commit $ ga GUS::Supported::Plugin::LoadGusXml --filename /$YOUR_PATH/SequenceType.xml --commit
The XML files can be found at http://www.gusdb.org/wiki/index.php/Bootstrap%20data You may map your own XML files, but that may require some experience, not to mention some knowledge about the GUS Schema. We suggest however that you add the following entries in ExternalDatabase.xml and ExternalDatabaseRls.xml:
This entries refer to several database repositories that are needed when LoadNRDB plugin is used (below). In the XML files provided by GUS Schema (*where? that' s a good question*), and the entries needed to it are marked by (nrDB). However a few entries are missing: genpept, lcl, tpd, tpe, tpg. There are two ways to solve this:
1) Edit ExternalDatabase and ExternalDatabaseRls.xml files. Both these files originals' hold several entries for external databases that you probably not use. So there's no real need to load them. What can be done is editing both these files so that will load only the databases you will actually use. Below is a model for both files:
This is the model from ExternalDatabase.xml
<SRes::ExternalDatabase> <external_database_id>$DATABASE_ID</external_database_id> <name>$DATABASE_NAME</name> </SRes::ExternalDatabase>
Where $DATABASE_NAME stands for the name od the external database you might need, and the $DATABASE_ID, its number. The plugin GUS::Supported::Plugin:: InsertExternalDatabase (described below) will check for repeated external database id's, and will crash if it finds one.
The following XML structure refers to ExternalDatabaseRls.xml:
<SRes::ExternalDatabaseRelease> <external_database_id>$DATABASE_ID</external_database_id> <id_type>$ID_TYPE</id_type> <description>$DESCRIPTION</description> <version>$VERSION</version> </SRes::ExternalDatabaseRelease>
Where $DATABASE_ID must be the same one present in ExternalDatabse.xml (or the plugin will crash); $ID_TYPE describes the type of entry (like accession for genbank entries, etc. *need to clarify this*), $DESCRIPTION where you can type a short description for your release, and $VERSION stands for the version of your release. Filling each one of these fields for each external database you will be using, and loading a XML file at once, with all the external database needed by your group might feel more handy than ...
2) Using InsertExternalDatabase and InsertExternalDatabaseRls to load your External Databases. Both these plugins exist to do the work of loading this information to the correct tables in GUS. However, if you use a lot of external databases, you might feel that loading all that information at once in a XML file (as described above) might feel a lot more handy than repeating a lot of command lines (both plugins described below). This plugins are useful however, if you need to load like 1 or 2 extra databases and don't want to bother mapping them on a XML file.
DESCRIPTION
Simple plugin that is the easiest way to create a row representing a new database that exists in the outside world that will be imported into GUS. This entry serves as a stable identifier across multiple releases of the database (which are stored in SRes.ExternalDatabaseRelease and point back to the entry created by this plugin). Protects against making multiple entries in GUS for an external database that already exists there. Creates a new entry in table SRes.ExternalDatabase to represent a new source of data imported into GUS.
$ ga +create GUS::Supported::Plugin::InsertExternalDatabase --commit $ ga GUS::Supported::Plugin::InsertExternalDatabase --name $EXT_DATABASE_NAME --commit
DESCRIPTION
Simple plugin that is the easiest way to create a row representing a new release of a database from an external source. Protects against making an entry for a version that already exists for a particular database. Creates new entry in table SRes.ExternalDatabaseRelease for new external database versions.
$ ga +create GUS::Supported::Plugin::InsertExternalDatabaseRls --commit $ ga GUS::Supported::Plugin::InsertExternalDatabaseRls --databaseName $YOUR_DATABASE_NAME --databaseVersion $VERSION --commit
DESCRIPTION
Populate the SRes::Taxon, SRes::GeneticCode, and SRes::TaxonName tables using the names.dmp, gencode.dmp and merged.dmp files downloaded from NCBI. The NCBI taxonomy files can be found here.
$ ga +create GUS::Supported::Plugin::LoadTaxon --commit $ ga GUS::Supported::Plugin::LoadTaxon --nodes /$YOUR_PATH/nodes.dmp --names /$YOUR_PATH/names.dmp --gencode/$YOUR_PATH/gencode.dmp --merged /$YOUR_PATH/merged.dmp –commit
DESCRIPTION
This plugin either inserts or updates a row in the database, depending upon whether the you provide a primary key as one of the attributes. If you do, that row will be read, updated with the attribute values you supply, and written. Otherwise, a new row will be inserted.
$ ga +create GUS::Supported::Plugin::LoadRow –-commit $ ga GUS::Supported::Plugin::LoadRow --tablename Core::UserInfo --attrlist "login,password,first_name,last_name,e_mail" --valuelist $A_LOGIN_HERE^^^$A_PASSWORD_HERE^^^$YOUR_1ST_NAME^^^$YOUR_LAST_NAME^^^ $A_EMAIL --commit $ ga GUS::Supported::Plugin::LoadRow --tablename Core::GroupInfo --attrlist name --valuelist $A_GROUP_NAME --commit
If you want to test, you may use our XML file below:
$ echo "<Core::ProjectInfo> <name>test</name> <description>No projects were hurt or killed during this test</description> </Core::ProjectInfo> //" >testproject.xml
Using the following command:
$ ga GUS::Supported::Plugin::LoadGusXml --filename testproject.xml --commit
DESCRIPTION
Extracts the id, name, and def fields from a so.obo file and inserts new entries into into the SRes.SequenceOntology table in the form so_id, ontology_name, so_version, so_cvs_version, term_name, definition. The plugin InsertSequenceOntologyOBO came to substitute the old InsertSequenceOntology, because the old .definition format became out-dated. The new plugin uses .obo format, found here.
The following change in the database has to be made in order to sequence ontology be properly loaded:
gustest=# ALTER TABLE sres.sequenceontology ALTER definition TYPE character varying(2000);
Also, several primary keys constraints have to be dropped for the proccess to work properly. The primary keys are located in sres.goevidencecode, sres.gorelationship and sres.go_relationshiptype, go.synonym, and sres.goterm tables. Following is an example, using sres.gorelationship table:
gustest=# \d sres.gorelationship Table "sres.gorelationship" Column | Type | Modifiers -------------------------+-----------------------------+----------- go_relationship_id | numeric(10,0) | not null parent_term_id | numeric(10,0) | not null child_term_id | numeric(10,0) | not null go_relationship_type_id | numeric(10,0) | not null modification_date | timestamp without time zone | not null user_read | numeric(1,0) | not null user_write | numeric(1,0) | not null group_read | numeric(1,0) | not null group_write | numeric(1,0) | not null other_read | numeric(1,0) | not null other_write | numeric(1,0) | not null row_user_id | numeric(12,0) | not null row_group_id | numeric(4,0) | not null row_project_id | numeric(4,0) | not null row_alg_invocation_id | numeric(12,0) | not null Indexes: "gorelationship_pk" PRIMARY KEY, btree (go_relationship_id) "gorelationship_ind01" btree (parent_term_id, child_term_id) "gorelationship_ind02" btree (child_term_id, parent_term_id) "gorelationship_ind03" btree (go_relationship_type_id) Foreign-key constraints: "gorelationship_fk02" FOREIGN KEY (child_term_id) REFERENCES sres.goterm(go_term_id) "gorelationship_fk03" FOREIGN KEY (go_relationship_type_id) REFERENCES sres.gorelationshiptype(go_relationship_type_id) "gorelationship_fk01" FOREIGN KEY (parent_term_id) REFERENCES sres.goterm(go_term_id)
The FOREIGN KEYS should be dropped, with the following command (i.e., dropping gorelationship_fk02):
gustest=# alter table sres.gorelationship drop constraint gorelationship_fk02;
These are the commands for InsertSequenceOntologyOBO:
$ ga +create GUS::Supported::Plugin::InsertSequenceOntologyOBO --commit $ ga GUS::Supported::Plugin::InsertSequenceOntologyOBO --inputFile /home/thiago/work/GO/so.definition --soCvsVersion '1.42' --soVersion '1.419' --user $USER --group $GROUP --commit
DESCRIPTION
Module used to parse GenBank records into the DB. Replaced by InsertSequenceFeatures.
$ ga +create GUS::Community::Plugin::GBParser --commit $ ga GUS::Community::Plugin::GBParser --file $YOUR_PATH/$ANY_FILE.gb --db_rel_id 22 --gbRel '148.0' --commit
Where db_rel_id refers to column “external_database_id” in sres.externaldatabse table, and gbRel is the Genbank Release number found at NCBI FTP. GBParser is the old plugin used by GUS to upload sequences in GenBank format in its tables. The GBParser plugin is no longer supported by GUS community because it can no longer handles all features present in Genbank format. Its successor plugin, InsertSequenceFeatures, handles features using a XML to map them.
DESCRIPTION
Insert or update sequences from a FASTA file or as set of FASTA files. A set of regular expressions provided on the command line extract from the definition lines of the input sequences various information to stuff into the database.
$ ga +create GUS::Supported::Plugin::LoadFastaSequences --commit $ ga GUS::Supported::Plugin::LoadFastaSequences --externalDatabaseName '$YOUR_EXT_DB_NAME' --externalDatabaseVersion '$YOUR_EXT_DB_VERSION' -–regexSourceId 'gb\|(\w+)' --sequenceFile /$YOUR_PATH/$YOUR_SEQUENCE_FILE.fasta --regexChromosome 'chromosome\s(\w+)' --regexDesc '\|\s(.*$)' --sequenceTypeName 'DNA' --tableName 'DoTS::ExternalNASequence' --regexSeqVersion 'gb\|\w+\.(\d+)' --commit
DESCRIPTION
Plug_in that inserts and updates rows in ExternalAASequence and NRDBEntry using data in the nr protein files downloaded from NCBI.
$ ga GUS::Supported::Plugin::LoadNRDB --externalDatabaseName "$YOUR_NR_NAME" --externalDatabaseVersion "$NR_VERSION" --gitax gi_taxid_prot.dmp --nrdbFile /$YOUR_PATH/$YOUR_NR_FILE --sourceDB "$YOUR_DATABASE_ID:$YOUR_DATABASE_NAME" --tempLogin $YOUR_LOGIN --tempPassword $YOUR_PASSWD --dbiStr 'dbi:Pg:dbname=$YOUR_DB_NAME' --maketemp --restartTempTable --plugin --verbose --commit
Note: nr database needs a external database reference, that should be created with InsertExternalDatabase plugin or loaded in ExternalDatabase.xml (as described before). The several databases that integrate nr also need external database references. They are: gb (GenBank), emb (EMBL Data Library), dbj (DNA Database of Japan), pir (NBRF PIR), prf (Protein Reasearch Foundation), sp (Swiss Prot), pdb (Brookhaven Protein Data Bank), pat (Patents), bbs (GenInfo Backbone Id), gnl, ref (General Database Identifier) , lcl (Local Database Identifier), genpept, tpe, tpd, tpg (*the last ones, except for genpept I have no clue what are they*).
DESCRIPTION
Insert files containing NA sequence and features, that are in a format handled by bioperl's Bio::SeqIO parser package (eg, genbank, embl,TIGR). See that package for more details about what the bioperl parser can handle. Also supports files in GFF2 and GFF3 format. (Note that, like all plugins that start with Insert, this plugin only inserts rows and never updates them. This means that the plugin cannot be used to update an existing set of sequences in the database. It assumes a database management strategy that rebuilds from scratch the set of sequences instead of attempting the much more difficult task of updating. This strategy will work for most datasets, but, not for huge ones such as the human genome.) The sequence level attributes that are currently supported are: taxon, SO assignment, comments, keywords, secondary accessions and references. The sequence may be absent from the input file. But, in that case it must have been loaded into GUS already. (The seqIdColumn argument indicates which column of the subclass specified by the naSequenceSubclass argument should be matched against the input to locate the proper sequence in the database.) This plugin is designed to be flexible in its mapping from the features and qualifiers found in the input to the tables in GUS. The flexibility is specified in the "XML mapping file" provided on the command line (--mapFile). A default mapping is provided as a reference (see $GUS_HOME/config/genbank2gus.xml). That file provides a mapping for the complete genbank feature table. It is often the case that your input may come from unofficial sources that, while conforming to one of the bioperl supported formats, invent features or qualifiers, and stuff unexpected values into already existing qualifiers. In that case, you will need to write your own mapping file. Use the provided default as an example. The main purpose of the file is to specify which subclass of NAFeature should store the input feature, and, which columns should store which qualifiers. Another way to generate a template mapping file is by using the command reportFeatureQualifiers. This command analyzes a set of input files, and reports the feature/qualifier structure found in them. It can output the report in simple text form, or in XML that is a template for the mapping file. There is an additional level of configurability provided by "plugable" qualifier handlers. As you will see in the default XML file, some qualifiers do not fit neatly into a column in the feature table. Those that don't are called "special cases." In the XML file you can declare the availability of one or more special case qualifier handlers to handle those special cases The default file declares one of these at the top of the file. Your XML file can declare additional handlers that you write. (See the code in the default handler to learn how to write your own.) In the XML file, qualifiers that need special handling specify the name of the handler object and a method in it to call to handle the qualifier. You can also use a qualifier handler to modify or even ignore the feature that the qualifier belongs to. An argument to the handler is $feature, so if you need to modify it directly you can. Also, if the handler returns undef, that is a signal to force the entire feature to be ignored. An advanced use of the handler exploits the fact that the handler method is an instance method on a handler object that lives throughout the plugin run. So, you can store state in the handler object during one call to a handler method, and that state will be available to other calls. This is a way to combine data from more than one qualifier.
$ ga +create GUS::Supported::Plugin::InsertSequenceFeatures --commit $ ga GUS::Supported::Plugin::InsertSequenceFeatures --mapfile $YOUR_PATH/genbank2gus.xml --seqFile YOUR_GB_SEQUENCE --fileFormat $FILE_FORMAT --extDbName "$EXT_DB_NAME" --extDbRlsVer "$EXT_DB_RLS_VERSION" –soCvsVersion 1.41
This plugin Load blast results from a condensed file format into the DoTS.Similarity table. It was a common practice in several GUS user groups to first load all sequences into GUS, and only then download them into a fasta file for blasting. When sequences are loaded into the database, they receive a gus_id (na_sequence_id, aa_sequence_id), that is a primary key to a table in the GUS schema. The fasta file dumped from the database ($GUS_HOME/bin/dumpSequencesFromTable.pl) would have these ids and only these ids in the defline (the header of the fasta file). A more modern approach comprises the use of source_id (e.g. Genbank accession) as the main identifier, and regular expressions to extract those from the deflines.
ga GUS::Supported::Plugin::InsertBlastSimilarities --file file --subjectTable tablename [--subjectTableSrcIdCol string] [--subjectExtDbRlsVer string] [--subjectExtDbName string] --queryTable tablename [--queryTableSrcIdCol string] [--queryExtDbRlsVer string] [--queryExtDbName string] [--batchSize integer] [--noHSPs ] [--testnumber integer] [--restartAlgInvs string-list] [--subjectsLimit integer] [--hspsLimit integer] [--minSubjects integer] [--maxSubjects integer] [--subjectPvalue float] [--subjectPctIdent float] [--subjectMatchLength integer] [--hspPvalue float] [--hspPctIdent float] [--hspMatchLength integer] [--commit ] [--debug ] [--sqlVerbose ] [--verbose ] [--veryVerbose ] [--user string] [--group string] [--project string] [--comment string] [--algoinvo integer] [--gusconfigfile file]
The required commandline arguments are the following:
--file *file* (Required) Input file containing BLAST results in condensed form format: The format produced by the blastSimilarities command (a line describing the query, one line per Subject hit, followed by one for each HSP)
This "BLAST condensed" file should look like this:
>97336279 (2 subjects) Sum: smart00177:270:2e-24:1:72:63:278:1:72:49:54:0:+3 HSP1: smart00177:49:54:72:270:2e-24:1:72:63:278:0:+3 Sum: pfam00025:85:4e-17:1:70:72:278:1:70:41:51:0:+3 HSP1: pfam00025:41:51:70:85:4e-17:1:70:72:278:0:+3 >97336342 (0 subjects)
We represented each value, and its meaning, separately for better visualization:
>97336344 (1 subjects) -> >querySourceId (# subjects) Sum: -> Sum: COG5096: -> subjectSourceId: 48: -> score: 4e-06: -> pValue: 65: -> minSubjectStart: 99: -> maxSubjectEnd: 257: -> minQueryStart: 361: -> maxQueryEnd: 1: -> numberOfMatches: 35: -> totalMatchLength: 20: -> numberIdentical: 28: -> numberPositive: 0: -> isReversed: +2 -> readingFrame
The same follows for HSP:
HSP1: -> HSP: COG5096: -> subjectSourceId: 20: -> numberIdentical: 28: -> numberPositive: 35: -> matchLength: 48: -> score: 4e-06: -> pvalue: 65: -> subjectStart: 99: -> subjectEnd: 257: -> queryStart: 361: -> queryEnd: 0: -> isReversed: +2 -> readingFrame
This condensed file may be generated from default BLAST output files using the script parseBlastFilesForSimilarity.pl, that can be found in $GUS_HOME/bin.
An example follows:
echo /home/john_doe/random_blast_run.blast | parseBlastFilesForSimilarity.pl --regex="(\S+).*" --outputFile random_blast_run.condensed
The regex arguments aims to extract the source_ids from the Blast output file.
--subjectTable *tablename* (Required) Subjects are taken from this table (schema::table for mat, eg, DoTS::ExternalNaSequence
This command tells the script in which table are the subject sequences inside GUS schema - assuming you already have inserted it - either with InsertSequenceFeatures or LoadFastaSequences.
--queryTable *tablename* (Required) Queries are taken from this table (schema::table form at, eg, DoTS::ExternalNaSequence)
This command tells the script in which table are the query sequences inside GUS schema - which, again, should be inserted previously, either with InsertSequenceFeatures or LoadFastaSequences.
The subjectTable and queryTable arguments indicate which tables the sequences are in, respectively. (This is because the Similarity table forms similarities between table/rows in GUS; therefore, the plugin must specify which tables the sequences are in.) The plugin assumes all query sequences to be in the same table, which might be different from the table where all the subject sequences must be stored.
DESCRIPTION
This plugin populates the tables sres.GoTerm, sres.GoRelationShip and sres.GoSynonym. At first, is necessary to download GO ontology data from NCBI site found here and save them in /usr/local/src/gus-related/ontology/. The files needed are:
* process.ontology * component.ontology * function.ontology
Register the LoadGoOntology plugin:
$ ga +create GUS::GOPredict::Plugin::LoadGoOntology --commit
Add data at sres.GoRelationShipType by creating a XML archive with the code:
<GUS::Model::SRes::GORelationshipType> <go_relationship_type_id>1</go_relationship_type_id> <name>isa</name> </GUS::Model::SRes::GORelationshipType> <GUS::Model::SRes::GORelationshipType> <go_relationship_type_id>2</go_relationship_type_id> <name>partof</name> </GUS::Model::SRes::GORelationshipType>
Finally, upload the ontology data with command:
ga GUS::GOPredict::Plugin::LoadGoOntology --file_path=/usr/local/src/gus-related/ontology/ \ --process_db_id=3003 \ --process_ext_db_rel=165 \ --function_db_id=3001 \ --function_ext_db_rel=164 \ --component_db_id=3002 \ --component_ext_db_rel=166 \ --commit \ >& /home/apps/GUS/dev/uploads/go-ontology-upload.out
Minha sugestão inicial é que você dê uma olhada nos plugins LoadDepreciatedFeatureIds e GenerateKeywords. O primeiro é mais simples e o segundo tem umas funções que podem ser úteis.
Estão em: /home/pablo1/GUS/project_home/TcruziDBData/Load/plugin/perl/LoadDepreciatedFeatureIds.pm /home/pablo1/GUS/project_home/BiowebDB/Load/plugin/perl/GenerateKeywords.pm
O plugin que você vai escrever tem a mesma estrutura.
package BiowebDB::Load::Plugin::GenerateKeywords;
Plugin: use GUS::PluginMgr::Plugin;
use GUS::Model::DoTS::IndexWord;
Vamos descrever o problema através de um exemplo, que vai mapear identificadores alternativos ao identificador principal do nosso banco (source_id).
mapFile (um arquivo "tab-delimited" com source_id, alternative_id) sourceExtDbName (nome do external database de onde os source_ids vieram. exemplo: pfa_gb) sourceExtDbRlsVer (versão do banco acima) altExtDbName (nome do external database de onde os alternative_ids vieram. esse vai ter que ser criado. exemplo: plasmo_db) altExtDbRlsVer (versão do banco acima)
my $iw = GUS::Model::DoTS::IdentifierGroup->new({'source_id' => $source_id,}); $iw->retrieveFromDB();
Rode a instalacao para copiá-lo para o $GUS_HOME
build TcruziDBData/Load install -append
O proposito do plugin InsertExternalDatabaseLink é carregar na tabela sres.externaldatabaselink a informacao necesaria para mapear nossas sequencias para sequencias de bancos externos e poder fazer um link ao url do Banco externo correspondente com a informacao da coluna id_url da tabela sres.externaldatabaserelease
1-) No banco atual nao existem entradas para os Bancos: GeneDB_Tcruzi, TcruziDB, PlasmoDB, Supfamily.
Então será precisso carrega-los nas tabelas sres.externaldatabase e sres.externaldatabaserelease
O banco GeneDB_Pfalciparum já esta carregado no sres.externaldatabase entao so falta carregar a entrada correspondente em sres.externaldatabaserelease
Para fazer isso utilize os plugins do GUS InsertExternalDatabase e InsertExternalDatabaseRls
Exemplo:
ga GUS::Supported::Plugin::InsertExternalDatabase --name GeneDB_Tcruzi --commit ga GUS::Supported::Plugin::InsertExternalDatabaseRls --databaseName GeneDB_Tcruzi --databaseVersion 2.1 --idUrl 'http://www.genedb.org/genedb/Search?submit=Search+for&organism=tcruzi&desc=yes&wildcard=yes&name=' --commit ga GUS::Supported::Plugin::InsertExternalDatabase --name TcruziDB --commit ga GUS::Supported::Plugin::InsertExternalDatabaseRls --databaseName TcruziDB --databaseVersion 5.1 --idUrl 'http://tcruzidb.org/tcruzidb/showRecord.do?name=GeneRecordClasses.GeneRecordClass&primary_key=' --commit ga GUS::Supported::Plugin::InsertExternalDatabase --name PlasmoDB --commit ga GUS::Supported::Plugin::InsertExternalDatabaseRls --databaseName PlasmoDB --databaseVersion 5.3 --idUrl 'http://www.plasmodb.org/plasmo/showRecord.do?name=GeneRecordClasses.GeneRecordClass&primary_key=' --commit ga GUS::Supported::Plugin::InsertExternalDatabase --name Supfam_pfal --commit ga GUS::Supported::Plugin::InsertExternalDatabaseRls --databaseName Supfam_pfal --databaseVersion 1.69 --idUrl 'http://www.supfam.org/SUPERFAMILY/cgi-bin/gene.cgi?genome=pl;seqid=' --commit ga GUS::Supported::Plugin::InsertExternalDatabase --name Supfam_tcruzi --commit ga GUS::Supported::Plugin::InsertExternalDatabaseRls --databaseName Supfam_tcruzi --databaseVersion 1.69 --idUrl 'http://www.supfam.org/SUPERFAMILY/cgi-bin/gene.cgi?genome=uz;seqid=' --commit ga GUS::Supported::Plugin::InsertExternalDatabaseRls --databaseName GeneDB_Pfalciparum --databaseVersion 2.1 --idUrl 'http://www.genedb.org/genedb/Search?submit=Search+for&organism=malaria&desc=yes&wildcard=yes&name=' --commit
2-) Agora podemos rodar o plugin que carrega a tabela sres.externaldatabaselink
Definicao do plugin
$ ga TcruziDBData::Load::Plugin::InsertExternalDatabaseLink --mapFile $MAPFILE --sourceExtDbName $DATABASENAME --extDbRlsVer $VERSIONEXTERNALDB --tableName $TABLENAME --commit
$MAPFILE um arquivo "tab-delimited" que contem os dados que mapeam o source_id com o identificador do banco de dados externo. $DATABASENAME nome do banco de dados externo. $VERSIONEXTERNALDB versao do banco de dados externo mencionado arriba. $TABLENAME nome da tabela de onde proviene o sequence_id gerado por GUS (ex. dots.TranslatedAASequence)
Exemplo:
ga TcruziDBData::Load::Plugin::InsertExternalDatabaseLink --mapFile pfal_accession_mapping.txt --sourceExtDbName GeneDB_Pfalciparum --extDbRlsVer 2.1 --tableName TranslatedAASequence --commit ga TcruziDBData::Load::Plugin::InsertExternalDatabaseLink --mapFile tcruzi_accession_mapping.txt --sourceExtDbName GeneDB_Tcruzi --extDbRlsVer 2.1 --tableName TranslatedAASequence --commit
A tabela SRes.ExternalDatabaseLink contem informação para mapear as sequencias carregadas no banco de dados ProtozaDB com outro bancos de dados externos (como Superfamily, GeneDB, TcruziDB, PlasmoDB)
As colunas estão definidas como segue:
link_table_id => Id da tabela referenciada para obter nosso sequencesId (ex. Id de dots.traslatedaasequence) link_primary_id => Id das sequencias de nosso Banco (ex. coluna dots.translatedaasequence.aa_sequence_id) external_database_release_id => foreign key que referencia sres.externaldatabaserelease external_primary_identifier => identifier das sequencias no banco externo (ex. para tcr_gb un exemplo seria Tc00.1047053506173.10) external_secondary_identifier => accession number de GenBank (ex. dots.translatedaasequence.source_id)
O proposito do plugin InsertCdhit é carregar na tabela sres.externaldatabaselink o resultado do cdhit, ou seja o porcentagem de identidade de proteinas agrupadas por cluster
1-)Inserir na tabela sres.externaldatabase uma entrada para cada cdhit executado. Esta denominação deve describir o nome do banco e a porcentagem de identidade (ex. tcruzi100) e sua correspondente entrada no sres.externaldatabaserelease
Exemplo:
ga GUS::Supported::Plugin::InsertExternalDatabase --name tcruzi100 --commit ga GUS::Supported::Plugin::InsertExternalDatabaseRls --databaseName tcruzi100 --databaseVersion 1.0 --commit
2-)Rodar o plugin InsertCdhit com os seguintes parametros ga TcruziDBData::Load::Plugin::InsertCdhit --mapFile $MAPFILE --sourceExtDbName $DATABASENAME --extDbRlsVer $DATABASEVERSION --tableName $TABLENAME --commit
Exemplo:
ga TcruziDBData::Load::Plugin::InsertCdhit --mapFile ptnas_GUS_gb_nr.100.fasta.clstr --sourceExtDbName tcruzi100 --extDbRlsVer 1.0 --tableName TranslatedAASequence --commit
link_table_id => Id da tabela referenciada para obter nosso sequencesId (ex. Id de dots.traslatedaasequence) link_primary_id => Id das sequencias de nosso Banco (ex. coluna dots.translatedaasequence.aa_sequence_id) external_database_release_id => foreign key que referencia sres.externaldatabaserelease external_primary_identifier => accession number de GenBank (ex. dots.translatedaasequence.source_id = XP_843163.1) external_secondary_identifier => numero de cluster ao qual pertence a sequencia.
Exemplo:
select edbl.external_primary_identifier from sres.externaldatabaselink edbl, sres.externaldatabaserelease edbr, sres.externaldatabase edb where edb.name = 'tcruzi100' AND edb.external_database_id = edbr.external_database_id AND edbr.external_database_release_id = edbl.external_database_release_id AND edbl.external_secondary_identifier = 0; external_primary_identifier ----------------------------- AAZ14281.1 XP_843163.1 (2 rows)
Refer to this older material: http://www.gusdb.org/documentation/3.5/developers/html/ch02.html
CREATE TABLE Dots.Identifier ( identifier_id numeric(3,0) NOT NULL, alternative_id character varying(32) NOT NULL, identifier_group_id numeric(3,0), external_database_release_id numeric(10,0), -- will allow us to link back to the source modification_date timestamp without time zone NOT NULL, user_read numeric(1,0) NOT NULL, user_write numeric(1,0) NOT NULL, group_read numeric(1,0) NOT NULL, group_write numeric(1,0) NOT NULL, other_read numeric(1,0) NOT NULL, other_write numeric(1,0) NOT NULL, row_user_id numeric(12,0) NOT NULL, row_group_id numeric(4,0) NOT NULL, row_project_id numeric(4,0) NOT NULL, row_alg_invocation_id numeric(12,0) NOT NULL ); CREATE TABLE Dots.IdentifierGroup ( identifier_group_id numeric(3,0), source_id character varying(32), -- preferred source_id for those IDs in the group modification_date timestamp without time zone NOT NULL, user_read numeric(1,0) NOT NULL, user_write numeric(1,0) NOT NULL, group_read numeric(1,0) NOT NULL, group_write numeric(1,0) NOT NULL, other_read numeric(1,0) NOT NULL, other_write numeric(1,0) NOT NULL, row_user_id numeric(12,0) NOT NULL, row_group_id numeric(4,0) NOT NULL, row_project_id numeric(4,0) NOT NULL, row_alg_invocation_id numeric(12,0) NOT NULL ); alter table Dots.Identifier add primary key (identifier_id); alter table Dots.IdentifierGroup add primary key (identifier_group_id); alter table Dots.Identifier add constraint identifier_group_id foreign key (identifier_group_id) references Dots.IdentifierGroup (identifier_group_id); CREATE SEQUENCE Dots.identifier_sq; CREATE SEQUENCE Dots.identifiergroup_sq;
Make sure everything was created fine.
INSERT INTO Core.tableinfo VALUES (963, 'Identifier', 'Standard', 'IDENTIFIER_ID', 3, 1, 0, NULL, NULL, 1, '2007-09-12 10:52', 1,1,1,1,1,0,1,1,1,1); INSERT INTO Core.tableinfo VALUES (964, 'IdentifierGroup', 'Standard', 'IDENTIFIER_GROUP_ID', 3, 1, 0, NULL, NULL, 1, '2007-09-12 10:55', 1,1,1,1,1,0,1,1,1,1);
build GUS -installDBSchema
Não consegui achar os arquivos SQL, então meu palpite é que agora eles usam um XML encontrado aqui:
/home/pablo1/GUS/project_home/GusSchema/Definition/config
ATENÇÃO: Se esse arquivo não for modificado, é capaz que na próxima vez que instalarmos o GUS, as tabelas não serão criadas e, portanto, o plugin não funcionará.
build GUS install -append
/home/pablo1/GUS/project_home/GUS/Model/lib/perl/DoTS
Some queries are executed frequently and are worth being kept in a handy place.
SELECT d.name, r.version, r.external_database_release_id FROM Sres.ExternalDatabase d, Sres.ExternalDatabaseRelease r WHERE d.external_database_id = r.external_database_id
SELECT tas.source_id, tas.aa_sequence_id FROM Dots.TranslatedAASequence tas, Sres.ExternalDatabaserelease r, Sres.ExternalDatabase d WHERE d.name = 'Pfalciparum' AND r.version = '159.t' AND d.external_database_id = r.external_database_id AND r.external_database_release_id = tas.external_database_release_id AND tas.source_id LIKE '%001351080%'
SELECT distinct gf.source_id FROM dots.ExternalNaSequence enas, dots.GeneFeature gf, dots.Transcript t, dots.TranslatedAaFeature taf, dots.rnatype r, core.TableInfo ti WHERE enas.na_sequence_id = gf.na_sequence_id AND gf.na_feature_id = r.parent_id AND r.na_feature_id = t.parent_id AND t.na_feature_id = taf.na_feature_id AND ti.name = 'TranslatedAASequence' ORDER BY gf.source_id