Introduction

This is an advanced reference document on rebuilding the triple store database.

- Some of this information is outdated, an update is needed. 
+ This document is provided to provide an outline to those interested.

ncbo-cron

Stop all ncbo-cron jobs in your system. Make sure nobody restarts them during the process of rebuilding the RDF store.

Purge spam ontologies

It is a good idea to purge spam ontologies/notes/projects/users before reprocessing everything since those ontologies should be deleted anyway.

You can perform this step after you start generating the new UMLS RDF files, if you plan to do that step.

Generate and upload new UMLS RDF files

The UMLS steps are only required if we have a new version of UMLS.

See Handling UMLS section for directions on generating and uploading UMLS ontologies using umls2rdf.

Note 1: This step takes a lot of time, during which you can perform useful maintenance.

Note 2: When new UMLS ontologies are going through the OWL parser, you will see a lot of logging.

Validate UMLS ontologies

Go to several UMLS ontology summary pages (for example, SNOMED and NCBITAXON) and make sure that there are new submissions with status upload, and with their new version properly recorded.

Create new 4store instance

If you do not want to remove your existing triple store, perform this step on a separate copy of your Appliance. We will call this the ‘staging’ system.

Location: /srv/ncbo/share/scratch/metadata/
User: 4store

Steps

Run reset_rebuild_kb.sh
4s-size rebuild_kb should return stats with all segments set to 0.

Metadata migration

Location: /srv/ncbo/share/scratch/metadata/
User: any

We will move the metadata from the production system to the newly created rebuild_kb 4store instance.

Steps

Delete any data folder with previous metadata dump.
Run ./dump_metadata_graphs.sh (on the production system)
Run ./restore_metadata_graphs.sh (on the system with your rebuilt instance)
Make sure no errors were generated as part of the restore.

For example, there could be RDF parsing errors.
If errors are encountered, then after fixing them in your metadata_graphs file, you need to re-create the 4store instance you created under “Create new 4store instance’ before you restore the metadata graphs in step 3 above.

If errors are found (example: raptor error - Turtle URI error - illegal Unicode escape \u0020 in URI ), find data files with errors using grep -R u0020 data/*. Replace the illegal characters in each of the file that contains them: sed -i 's/\\u0020/ /g' data/*/*/*

Check Notes section for additional tips on how to deal with problematic metadata exports/imports

Start 4s-httpd for rebuild_kb in port 8083 with:

4s-httpd -p 8083 -s-1 -D rebuild_kb

NOTES

It is better to start this in debug mode to easily kill it with Ctrl-c.
Start this process in a tmux o screen session to return to it later.

Test: go to http://ncbostage-4store1:8083/status/ and make sure that has around 500K quads. It will grow overtime but not very quickly.

Syncronise your file repository from production file repo

host: staging
User: ncbo-deployer
Location:

! This step's instructions are missing

Reparse ontology content

- FIXME.  we no longer use ncbo_migration to reparse ontologies. The following section is outdated.
! New process involves using ncbo_cron to reparse all ontologies.

outdated content:

Host: ncboprod-parser1
Location: /srv/ncbo/share/scratch/migrations/prd/ncbo_migration
User: ncbo-deployer

Steps

First locally in a dev-box:

clear existing logs in ./logs and ./parsing
In master branch for project ncbo_migration: git pull
bundle update
Commit Gemfile.lock
push the lock file to master.

In the parsing box:

Pull the latest code. git pull
Pull the latest dependencies bundle install --deployment
Run FLUSHALL at the stage Goo redis instance dedicated to rebuilding KB redis-cli -h localhost -p 6380 FLUSHALL; also clear out logs in the parsing directory
Make sure config file (settings.rb) points to the sparql endpoint at ncbostage-4store1:8083/

 LinkedData.config do |config|
   config.goo_host           = "ncbostage-4store1"

Run time bundle exec ruby workflows/reprocess.rb

Note This process takes approx. 7h make sure it runs in screen or tmux session. It will show progress as it goes. Press CTRL+d when script pauses with the following message.

[ncbo-deployer@ncbo-prd-app-25 ncbo_migration]$ time bundle exec ruby workflows/reprocess.rb
(LD) >> Using rdf store ncbostage-4store1:8083
(LD) >> Using search server at http://ncbostage-solr1.stanford.edu:8080/solr/
(LD) >> Using HTTP Redis instance at ncbostage-redis2:6380
(LD) >> Using Goo Redis instance at ncbostage-redis1:6380
(LD) >> Enable SPARQL monitoring with cube ncbo-cube1:1180
Using cube options in Goo {:host=>"ncbo-cube1", :port=>1180}
(AN) >> Using ANN Redis instance at ncboprod-redis1:6379
Gathering ontologies of type all
Found 567 all ontologies

From: /srv/ncbo/share/scratch/migrations/prd/ncbo_migration/workflows/reprocess.rb @ line 41 :

    36: #acronyms.each do |acr|
    37: #  submissions << LinkedData::Models::Ontology.find(acr).first.latest_submission(status: :any)
    38: #end
    39: submissions = get_submissions(:all)
    40: 
 => 41: binding.pry
    42: puts "", "Parsing #{submissions.length} submissions..."
    43: pbar = ProgressBar.new("Parsing", submissions.length)
    44: FileUtils.mkdir_p("./parsing")
    45: submissions.each do |s|
    46:   s.bring_remaining

[1] pry(main)> 

Parsing 567 submissions...

The screen where the 4s-httpd is running should start showing progress right away. The progress it shows is with the graph are being deleted/added to the new KB.
At the end there should be around 110M quads in the triple store.

note: Parsing logs are located at ‘/srv/ncbo/share/scratch/migrations/prd/ncbo_migration/parsing’

Restore REST mappings

Host: staging
Location: /srv/ncbo/share/scratch/migrations/prd/ncbo_migration
User: ncbo-deployer

Steps

time bundle exec ruby ./workflows/restore_backup_mappings.rb

Perform quality assurance using staging UI.

Steps

Host: ncbostage-4store1
User: 4store

Start the rebuilt KB instead of the existing ontologies_api KB

Make sure that ‘ncbo_cron’ process is stopped on ncbostage-parser1 server.
Stop staging ontologies_api triple store KB. a. Exist 4store user and return to your own SSH user b. Run sudo service 4s-httpd-ontologies_api stop
Run sudo su - 4store
Stop 4s-httpd rebuild_kb that listening in 8083 (just SSH to your original tmux/screen session on ncbostage-4store and hit Ctrl-C to stop the running server)
Start rebuild_kb in 8080 with: ``4s-httpd -p 8080 -s-1 -D rebuild_kb`

Flush ALL caches (UI, Memcached, Goo, HTTP)

SSH to a REST endpoint server (ncbostage-rest2 or -rest3 or any existing REST server)
cd /srv/ncbo/ontologies_api/current
Flush Goo and HTTP caches: run bundle exec rake cache:clear:staging (can use your own SSH user)
Use BioPortal Admin UI to flush UI Memcached

QA UI using a browser

Check the number of ontologies and number of classes: these numbers should be a close match between production and staging
Browse 4 ontologies (representative of each format - OBO, OWL, UMLS). Suggested ontologies: SNOMEDCT, Disease Ontology, BRO, RADLEXand compare the class tree to what it is in production.
Browse to the mapping count pages and make sure they work.

Rename `rebuild_kb` KB to `ontologies_api` KB

If QA went fine then the rebuild_kb can be changed to ontologies_api in staging: a. Shutdown existing 4store rebuild_kb KB (Ctrl-C on ncbostage-4store1); and make sure other KBs are stopped: ‘4s-admin stop-stores rebuild_kb ontologies_api’ b. SSH to 4store slave boxes. To find slave boxes, run grep nodes /etc/4store.conf on ncbostage-4store1. You should see an output similar to: nodes = ncbo-stg-app-06;ncbo-stg-app-07;ncbo-stg-app-08;ncbo-stg-app-09. c. On each of the servers, execute the following commands:
```
* sudo su - 4store
* cd /srv/4store/data/
* mv ontologies_api/ ontologies_api_old
* mv rebuild_kb/ ontologies_api
* sed -i 's/rebuild_kb/ontologies_api/g' ontologies_api/metadata.nt
```
Start ontologies_api KB on the staging server: a. SSH to staging system b. Run sudo service 4s-httpd-ontologies_api start

Synchronize Staging repo with Production

SSH to ncbostage-parser1
sudo su - ncbo-deployer
/srv/ncbo/share/tools/scripts/reposync_staging.sh

Move to Production

At this point move the staging data back to production.

Switch Production 4store to the re-processed KB

Steps

Host:
User:

! To be provided.

Perform metrics calculation

Host: {my-appliance-hostname}
User: ncbo-deployer
pwd: /srv/ncbo/ncbo_cron/

run bin/ncbo_ontology_metrics -a -l logs/UMLS_metrics.log in a screen session

In some cases this script can fail calculating metrics for some ontologies so it is nessesary to verify that metrics is calculated for all ontologies. re-run the script if there is a large number of ontologies with missing metrics status or manually update metrics for those ontologies. Missing metrics for any of the ontologies in the “ready” status will cause recommender service to fail untill it is fixed.

Re-Index all ontologies

Host: ncboprod-parsers1
User: ncbo-deployer
pwd: /srv/ncbo/ncbo_cron

reindex terms: run >./bin/ncbo_ontology_index -a -l logs/UMLS_reindex_terms.log in a screen session

reindex properties: run bin/ncbo_ontology_property_index -a -l logs/UMLS_reindex_property.log

Next you will need to swap solr cores

Re-generate annotator cache

Host: ncboprod-parsers1
User: ncbo-deployer
pwd: /srv/ncbo/ncbo_cron

See Annotator Management for details.

NCBO-Cron

At this point after moving to production we can start the cron jobs in stage and productions.

Issues Troubleshooting

Issues with metadata export:

If script breaks with an error similar to the one below you need to check for bad submissions and delete them.

Running on prod platform
Retrieving umls ontologies ...
./bin/create_umls_submissions.rb:44:in `block in <main>': undefined method `umls?' for nil:NilClass (NoMethodError)
  from ./bin/create_umls_submissions.rb:38:in `each'
  from ./bin/create_umls_submissions.rb:38:in `<main>'

Issues with metadata imports

https://github.com/ncbo/ontologies_api/issues/25

<http://data.bioontology.org/metadata/authorProperty> <local:Joseph\u0020Romano> .
replace <local:Joseph\u0020Romano> with <foaf:Person>
<local:A\u0020http://www.w3.org/2000/01/rdf-schema#label> replace with <local:www.w3.org/2000/01/rdf-schema#label>

Issues with Puppet

Stop the puppet daemon so that it does not interfeire with the rebuild process. (Puppet can start ncbo_cron process back up if it doesn’t see it running.)

Issues with indexing

Read logs and look for ERROR and take nessesary actions if nessesary.

Introduction

ncbo-cron

Purge spam ontologies

Generate and upload new UMLS RDF files

Validate UMLS ontologies

Create new 4store instance

Steps

Metadata migration

Steps

Syncronise your file repository from production file repo

Reparse ontology content

Steps

Restore REST mappings

Steps

Perform quality assurance using staging UI.

Steps

Start the rebuilt KB instead of the existing ontologies_api KB

Flush ALL caches (UI, Memcached, Goo, HTTP)

QA UI using a browser

Rename rebuild_kb KB to ontologies_api KB

Synchronize Staging repo with Production

Move to Production

Switch Production 4store to the re-processed KB

Steps

Perform metrics calculation

Re-Index all ontologies

Re-generate annotator cache

NCBO-Cron

Issues Troubleshooting

Issues with metadata export:

Issues with metadata imports

Issues with Puppet

Issues with indexing

Rename `rebuild_kb` KB to `ontologies_api` KB