Welcome to OAStats Backend’s documentation!

This is a command line application that will process the Apache logs for Dspace and generate download statistics for the OA collection.

The current version is an intermediate step in moving off Mongo to PostGres and some of the functionality will be removed once the migration is complete. As such, a separate step is still required to generate a summary collection in Mongo, but this is now done using the data from the PostGres databse.

Installation

Use pip to install into a virtualenv:

(oastats)$ pip install \
    https://github.com/MITLibraries/oastats-backend/zipball/master

This will make an oastats command available when your virtualenv is active.

Usage

The oastats command has four subcommands: db, load, pipeline and summary. The full documentation for each command can be accessed with:

(oastats)$ oastats <subcommand> --help

Each subcommand will need to connect to the PostGres database. This can be done by providing a valid SQLAlchemy Database URI to the --oastats-database option. You can also pass this as an environment variable instead of as a command line option using the OASTATS_DATABASE variable.

Creating the Database

pipeline.cli.db()

Create/drop the PostGres database tables.

This will create or drop the database tables depending on which command is provided (create or drop). Make sure the database exists first.

Example:

(oastats)$ oastats db create

Full command documentation:

(oastats)$ oastats db --help

Migrating the Mongo Data

Important

This subcommand will be removed once the data has been migrated.

pipeline.cli.load()

Load the Mongo requests collection into PostGres.

The entire Mongo requests collection will be iterated over and loaded into PostGres. The collection is sorted by time descending before being iterated. This is done in order to get the most recent (and complete) identitiy data from the denormalized Mongo database. It is recommended to make sure the requests collection has a descending index on the time field before running:


(oastats)$ mongo oastats --eval \
    "db.requests.createIndex({time: -1})"
(oastats)$ oastats load

Full command documentation:

(oastats)$ oastats load --help

Running the Pipeline

pipeline.cli.pipeline()

Process the Apache logs and populate the database with identities.

This command will process the logs and print the output to STDOUT. The output format is CSV suitable for passing to PostGres’s COPY command. The field order is: status, country, url, referer, user_agent, datetime, document_id. Any requests which could not be processed due to malformed log entries will be logged to STDERR.

IP addresses are converted to three letter country codes using the GeoLite2 country database. Make sure to use the binary format (.mmdb) and that it’s current; these are updated regularly. Pass the location of this file using the --geo-ip option.

The pipeline can filter for log entries by date. Use the --month/-m option to specify a month to select. This can be repeated as many times as desired to collect more than one month of requests. The format should be the same as appears in the log entries, specifically, MMM/YYYY. If no month is provided all log entries will be processed.

Identity data is collected from a custom Dspace identity bitstream. This can be specified using the --dspace option.

The path to one or more log files should be passed as arguments to the pipeline. For example:


(oastats)$ oastats pipeline -m Sep/2016 -m Oct/2016 \
    --geo-ip data/GeoLite2.mmdb \
    logs/2016/{09,10}/access.log 2>errors.log | output.csv
(oastats)$ psql -d database -c "COPY requests (status, country, \
    url, referer, user_agent, datetime, document_id) FROM STDIN \
    WITH CSV" < output.csv

Full command documentation:

(oastats)$ oastats pipeline --help

Generating the Summary Collection

Important

This subcommand will be removed once Mongo is no longer needed for the main OAStats website.

pipeline.cli.summary()

Create the summary collection in Mongo.

The current OAStats website uses a summary collection in Mongo which effectively functions as a pregenerated query cache. This command will generate and insert the necessary JSON objects into Mongo.

Though not required, it is recommended to create a temporary summary collection in Mongo and rename it to summary once this command has finished. For example:


(oastats)$ oastats summary --mongo-coll summary_new
(oastats)$ mongo oastats --eval \
    'db.summary_new.renameCollection("summary", true)'

Full command documentation:

(oastats)$ oastats summary --help