Welcome to OAStats Backend’s documentation!¶
This is a command line application that will process the Apache logs for Dspace and generate download statistics for the OA collection.
The current version is an intermediate step in moving off Mongo to PostGres and some of the functionality will be removed once the migration is complete. As such, a separate step is still required to generate a summary collection in Mongo, but this is now done using the data from the PostGres databse.
Installation¶
Use pip
to install into a virtualenv:
(oastats)$ pip install \
https://github.com/MITLibraries/oastats-backend/zipball/master
This will make an oastats
command available when your virtualenv is active.
Usage¶
The oastats
command has four subcommands: db
, load
, pipeline
and summary
. The full documentation for each command can be accessed with:
(oastats)$ oastats <subcommand> --help
Each subcommand will need to connect to the PostGres database. This can be done by providing a valid SQLAlchemy Database URI to the --oastats-database
option. You can also pass this as an environment variable instead of as a command line option using the OASTATS_DATABASE
variable.
Creating the Database¶
-
pipeline.cli.
db
()¶ Create/drop the PostGres database tables.
This will create or drop the database tables depending on which command is provided (
create
ordrop
). Make sure the database exists first.Example:
(oastats)$ oastats db create
Full command documentation:
(oastats)$ oastats db --help
Migrating the Mongo Data¶
Important
This subcommand will be removed once the data has been migrated.
-
pipeline.cli.
load
()¶ Load the Mongo requests collection into PostGres.
The entire Mongo requests collection will be iterated over and loaded into PostGres. The collection is sorted by time descending before being iterated. This is done in order to get the most recent (and complete) identitiy data from the denormalized Mongo database. It is recommended to make sure the requests collection has a descending index on the time field before running:
(oastats)$ mongo oastats --eval \ "db.requests.createIndex({time: -1})" (oastats)$ oastats load
Full command documentation:
(oastats)$ oastats load --help
Running the Pipeline¶
-
pipeline.cli.
pipeline
()¶ Process the Apache logs and populate the database with identities.
This command will process the logs and print the output to STDOUT. The output format is CSV suitable for passing to PostGres’s COPY command. The field order is: status, country, url, referer, user_agent, datetime, document_id. Any requests which could not be processed due to malformed log entries will be logged to STDERR.
IP addresses are converted to three letter country codes using the GeoLite2 country database. Make sure to use the binary format (
.mmdb
) and that it’s current; these are updated regularly. Pass the location of this file using the--geo-ip
option.The pipeline can filter for log entries by date. Use the
--month/-m
option to specify a month to select. This can be repeated as many times as desired to collect more than one month of requests. The format should be the same as appears in the log entries, specifically,MMM/YYYY
. If no month is provided all log entries will be processed.Identity data is collected from a custom Dspace identity bitstream. This can be specified using the
--dspace
option.The path to one or more log files should be passed as arguments to the pipeline. For example:
(oastats)$ oastats pipeline -m Sep/2016 -m Oct/2016 \ --geo-ip data/GeoLite2.mmdb \ logs/2016/{09,10}/access.log 2>errors.log | output.csv (oastats)$ psql -d database -c "COPY requests (status, country, \ url, referer, user_agent, datetime, document_id) FROM STDIN \ WITH CSV" < output.csv
Full command documentation:
(oastats)$ oastats pipeline --help
Generating the Summary Collection¶
Important
This subcommand will be removed once Mongo is no longer needed for the main OAStats website.
-
pipeline.cli.
summary
()¶ Create the summary collection in Mongo.
The current OAStats website uses a
summary
collection in Mongo which effectively functions as a pregenerated query cache. This command will generate and insert the necessary JSON objects into Mongo.Though not required, it is recommended to create a temporary summary collection in Mongo and rename it to
summary
once this command has finished. For example:(oastats)$ oastats summary --mongo-coll summary_new (oastats)$ mongo oastats --eval \ 'db.summary_new.renameCollection("summary", true)'
Full command documentation:
(oastats)$ oastats summary --help