Source code for pg_jts.pg_jts

# Copyright (C) 2015 ibu radempa <ibu@radempa.de>
#
# Permission is hereby granted, free of charge, to
# any person obtaining a copy of this software and
# associated documentation files (the "Software"),
# to deal in the Software without restriction,
# including without limitation the rights to use,
# copy, modify, merge, publish, distribute,
# sublicense, and/or sell copies of the Software,
# and to permit persons to whom the Software is
# furnished to do so, subject to the following
# conditions:
#
# The above copyright notice and this permission
# notice shall be included in all copies or
# substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY
# OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT
# LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
# IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
# BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
# WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
# ARISING FROM, OUT OF OR IN CONNECTION WITH THE
# SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

r"""
Create a generalized JSON-table-schema structure from a live postgres database.

The JSON data structure returned from :func:`get_database` is a generalization
of the `JSON-table-schema`_: The *resources* in our structure comply with the
table definition there (we extend it in allwoed ways). Our structure comprises
the whole database. It is the JSON-encoded form of a dictionary with these keys
(values being strings, if not otherwise indicated):

.. _`JSON-table-schema`:
   http://dataprotocols.org/json-table-schema/

    * **source**: the string 'PostgreSQL'
    * **source_version**: the PostgreSQL version returned by the server
    * **database_name**: the database name
    * **database_description**: the comment on the database
    * **generation_begin_time**: begin datetime as returned from PostgreSQL
    * **generation_end_time**: end datetime as returned from PostgreSQL
    * **datapackages**: a list of dictionaries, one for each PostgreSQL schema,
      with these keys:

      * **datapackage**: the name of the PostgreSQL schema
      * **resources**: a list of dictionaries, each describing a table within
        the current PostgreSQL schema and having these keys:

        * **name**: the name of the table
        * **description**: the table comment (only those components not part of
          a weak foreign key definition)
        * **primaryKey**: the primary key of the table, which is a list of
          column names
        * **fields**: a list of dictionaries describung the table columns and
          having these keys:

          * **name**: the column name
          * **description**: the column comment
          * **position**: 
          * **type**: the PostgreSQL data type, e.g., 'varchar(100)' or 'int4'
          * **defaultValue**: the default value of the column, e.g., '0', or
            'person_id_seq()' in case of a sequence
          * **constraints**: a dictionary describing constraints on the current
            column, with these keys:

            * **required**: boolean telling whether the column has a 'NOT NULL'
              constraint

        * **indexes**: a list of dictionaries, one per index and column,
          having these keys:

          * **name**: name of the index
          * **columns**: a list with the names of the columns used in the
            index and ordered by priority
          * **creation**: the SQL statement for creating the index
          * **definition**: the index definition, e.g., 'btree (id1, id2)'
          * **primary**: boolean telling whether the indexed columns form
            a primary key
          * **unique**: boolean telling whether the indexed columns are
            constrained to be unique

        * **foreignKeys**: a list of foreign keys used by the current table:

          * **columns**: the names of the columns in the current table which
            are referencing a remote relation
          * **enforced**: a boolean telling whether the foreign key constraint
            is being enforced in PostgreSQL (True), or if it is a weak
            reference and the constraint is kept only by the application
            software (False)
          * **reference**: a dict for specifying the reference target, having
            these keys:

            * **datapackage**: the name of the PostgreSQL schema in which the
              referenced table resides
            * **resource**: the name of the referenced table
            * **name**: the name of the foreign key constraint
            * **columns**: a list of the names of the referenced columns
            * **cardinalitySelf**: (optional) the cardinality of the foreign
              key relation (as obtained from a column or table comment)
              on the side of the current table
            * **cardinalityRef**: (optional) the cardinality of the foreign
              key relation (as obtained from a column or table comment)
              on the side of the remote table
            * **label**: (optional) a label describing the foreign key relation
              (as obtained from a column or table comment)

.. _foreign-key-syntax:

Foreign key syntax
~~~~~~~~~~~~~~~~~~

Foreign keys will be recognized where either a (hard) foreign key constraint
is present in PostgreSQL, or a table or column comment describes a foreign key
relation according to these syntax rules (we call this *weak reference*):

  * the comment is split at 1) ``;`` followed by a space character
    or 2) ``\n``, and results in what we call *components*
  * if a component matches one of the *relation_regexps*, we try to find
    a column name, a table name and an optional schema name in it; we match
    existing names in one of these four formats:

      * schema.table.column
      * table.column
      * schema.table(column1, column2, ..., columnN)
      * table(column1, column2, ..., columnN)

  * if a relation is valid, we also extract both cardinalities on the side of
    the table (card1) and on the foreign side (card2); the syntax is
    ``card1`` ``link`` ``card2``, where card1 and card2 are values in
    :any:`cardinalities` and link is one of ``--``, ``-`` with an optional
    space character on both sides (independently).
  * if a relation is valid, we also extract a label for the relation: when
    the component contains a string like ``label="<LABEL>"``, ``<LABEL>`` will
    be extracted. (On both sides of '=' an arbitrary number of white spaces may
    appear.

In cases where both a foreign key constraint and a weak reference are present,
the weak reference information supplements the constraint, in particular by
adding cardinalities (if present).
"""

import re
import json
from .pg_query import db_init
from . import pg_database as pd


re_components = re.compile('; |\n')
cardinalities = [
    '0..1',
    '1',
    '0..N',
    '1..N'
]
"""
Cardinalities.

These values are allowed in weak references.
"""
cards = '|'.join([re.escape(x) for x in cardinalities])
re_cardinalities = re.compile('(^| )(%s) ?--? ?(%s)( |,|$)'
                              % (cards, cards), re.I)
re_label = re.compile('(^| )label\s*=\s*"([^"]*)"( |$)')


[docs]def get_database(db_conn_str,
                 relation_regexps=None,
                 exclude_tables_regexps=None):
    """
    Return a JSON data structure representing the PostgreSQL database.

    Returns a JSON string and a list of notifications.
    The notifications inform about invalid or possibly unwanted
    syntax of the weak references (contained in the comments).

    A valid PostgreSQL connection string (*db_conn_str*) is required for
    connecting to a live PostgreSQL database with read permissions.

    The resulting data structure is missing some details. Currently
    mainly these structures are extracted from the database:

      * tables
      * foreign key relations (both constraints and weak references)
      * indexes

    The optional arguments have these meanings:

      * *exclude_tables_regexps* is a list of regular expression strings;
        if a table name matches any of them, the table and all its relations
        to other tables are omitted from the result
      * *relation_regexps* is a list of regular expression strings;
        if a table comment or a column comment matches any of them, it is
        parsed for a 'weak' foreign key relation
        (cf. :ref:`foreign-key-syntax`)
    """
    db_init(db_conn_str)
    if exclude_tables_regexps is None:
        exclude_tables_regexps = []
    begin_time = pd.get_now()
    database = pd.get_database()
    schemas = pd.get_schemas()
    res = []
    for schema in schemas:
        res_schema = {}
        schema_name = schema['schema_name']
        res_schema['datapackage'] = schema_name
        res_tables = []
        for table in pd.get_tables(schema_name):
            table_name = table['table_name']
            if not _check_exclude_table(exclude_tables_regexps, table_name):
                res_table = {}
                res_table['name'] = table_name
                table_comment = table['table_comment']
                if table_comment is not None:
                    res_table['description'] = table_comment
                constraints = _reshuffle_constraints(schema_name, table_name)
                if constraints['primary_key']:
                    res_table['primaryKey'] = constraints['primary_key']
                res_table['foreignKeys'] = constraints['foreign_keys']
                if constraints['unique']:  ## ????
                    res_table['unique'] = constraints['unique']  ## ????
                res_table['fields'] = _collect_columns(
                    schema_name,
                    table_name,
                    constraints['unique']
                )
                res_table['indexes'] = pd.get_indexes(schema_name, table_name)
                res_tables.append(res_table)
        res_schema['resources'] = res_tables
        res.append(res_schema)
    notifications = _add_annotated_foreign_keys(res, relation_regexps)
    end_time = pd.get_now()
    return json.dumps({
        'source': 'PostgreSQL',
        'source_version': pd.get_server_version(),
        'database_name': database,
        'database_description': pd.get_database_description(),
        'generation_begin_time': begin_time,
        'generation_end_time': end_time,
        'datapackages': res
    }), notifications


def _check_exclude_table(exclude_tables_regexps, table_name):
    """
    Return whether table *table_name* is to be excluded.

    Apply the patterns from *exclude_tables_regexps*.
    If at least one matches, return True.
    """
    for exclude_tables_regexp in exclude_tables_regexps:
        if re.search(exclude_tables_regexp, table_name):
            return True
    return False


def _reshuffle_constraints(schema_name, table_name):
    """
    Return primary key, foreign key and unique constraints for a table.

    See also: :func:`_collect_column_constraints`
    """
    constraint_names = []
    pk_column_names = []
    foreign_keys = {}
    unique = {}
    for constraint in pd.get_constraints(schema_name, table_name):
        # constraints are ordered by column_position
        column_name = constraint['column_name']
        constraint_type = constraint['constraint_type']
        c_oid = constraint['constraint_oid']
        if constraint_type == 'u':
            if c_oid not in unique:
                c_schema = constraint['constraint_schema']
                c_name = constraint['constraint_name']
                unique[c_oid] = {
                    'name': c_name,
                    'fields': []
                }
            unique[c_oid]['fields'].append(column_name)
        elif constraint_type == 'p':
            pk_column_names.append(column_name)  # using proper column ordering
        elif constraint_type == 'f':
            if c_oid not in foreign_keys:
                foreign_keys[c_oid] = {
                    'fields': [column_name],
                    'reference': {
                        'datapackage': constraint['referenced_schema'],
                        'resource': constraint['referenced_table'],
                        'name': constraint['constraint_name'],
                        'fields': []
                    },
                    'enforced': True
                }
            ref_col = constraint['referenced_column']
            foreign_keys[c_oid]['reference']['fields'].append(ref_col)
    return {
        'unique': list(unique.values()),
        'primary_key': pk_column_names,
        'foreign_keys': list(foreign_keys.values())
    }


def _collect_columns(schema_name, table_name, unique):
    """
    Return a column structure for a given table in a given schema.

    Return a list of dicts, each describing a table column.
    """
    res_columns = []
    columns = pd.get_columns(schema_name, table_name)
    for column in columns:  # columns are ordered by ordinal position
        res_column = {}
        res_column['name'] = column['column_name']
        res_column['type'] = column['datatype']
        description = column['column_comment']
        if description:
            res_column['description'] = description
        default_expr = column['column_default']
        if default_expr:
            res_column['default_value'] = _format_default(default_expr)
        constraints = _collect_column_constraints(column, unique)
        if constraints:
            res_column['constraints'] = constraints
        res_columns.append(res_column)
    return res_columns


def _collect_column_constraints(column, unique):
    """
    Collect constraints for a column.

    Use column information as well as unique constraint information.
    Note: for a unique constraint on a single column we set
          column / constraints / unique = True
          (and store all multicolumn uniques in the table realm)
    """
    res = {}
    if 'null' in column:
        res['required'] = column['null']
    for constr_i, constr in enumerate(unique):
        if column['column_name'] in constr['fields']:
            if len(constr['fields']) == 1:
                res['unique'] = True
    return res


def _format_default(expr):
    """
    Return text from a default value expression.

    Return a simplified form of a PostgreSQL default value.
    """
    if expr.lower().startswith('nextval('):
        r = expr.split("'", 1)[1]
        r = r.rsplit("'", 1)[0]
        return r + '()'
    elif expr.startswith("'"):
        r = expr.split("'", 1)[1]
        r = r.rsplit("'", 1)[0]
        return "'" + r + "'"
    else:
        return expr


def _add_annotated_foreign_keys(schemas, relation_regexps):
    """
    Add foreign keys defined in column comments.

    *schemas* must be a list of schemas as in :func:`get_database`.
    *relation_regexps* must be a list of regular expression strings for
    matching a 'weak' foreign key reference.
    """
    all_notifications = []
    schema_table_column = get_schema_table_column_triples(schemas)
    if relation_regexps:
        res_relation = [re.compile(x) for x in relation_regexps]
        for schema in schemas:
            schema_name = schema['datapackage']
            for table in schema['resources']:
                all_relations = []
                table_name = table['name']
                for column in table['fields']:
                    column_name = column['name']
                    if 'description' in column:
                        relations, notifications, comments = \
                            _parse_description(
                                schema_name,
                                table_name,
                                column_name,
                                column['description'],
                                schema_table_column,
                                res_relation
                            )
                        all_notifications += notifications
                        all_relations += relations
                        column['description'] = '; '.join(comments)
                if 'description' in table:
                    relations, notifications, comments = _parse_description(
                        schema_name,
                        table_name,
                        None,
                        table['description'],
                        schema_table_column,
                        res_relation
                    )
                    all_notifications += notifications
                    all_relations += relations
                    table['description'] = '; '.join(comments)
                _merge_foreign_keys(table['foreignKeys'], all_relations)
    return all_notifications


def _parse_description(schema_name, table_name, column_name,
                       description, schema_table_column, relation_regexps):
    r"""
    Extract relation information from a column or table comment.

    Split the description into components at '\n' as well as at '; '.
    Check each component for whether one of the *relations_regexps* does match.
    If so try to match (optionally a schema name,) a table name and
    the name(s) of a (tuple of) column(s) as well as two cardinalities.
    In case of a table comment also match another tuple of column names
    of the current table. For a table comment set *column_name*=None.

    Return a list of the found relations, a list of notifications from
    syntax parsing and a list remaining component (i.e., comment parts
    in which no relation was found).
    """
    current = (schema_name, table_name, column_name)
    current_text = '(schema=%s, table=%s, column=%s)' % current
    relations = []
    notifications = []
    components = re_components.split(description)
    comments = []  # remaining components
    if column_name is None:
        table_column_names = [x[2] for x in schema_table_column
                              if x[0] == schema_name and x[1] == table_name]
    for component in components:
        if not any([regex.search(component) for regex in relation_regexps]):
            comments.append(component)
            continue
        for s, t, c in schema_table_column:
            s_ = re.escape(s)
            t_ = re.escape(t)
            c_ = re.escape(c)
            found = 0
            p = re.search(' %s\.%s\.%s( |$)' % (s_, t_, c_), component)
            if p:
                found = 1
            if not found:
                p = re.search(' %s\.%s ?\( ?%s[, \)]' % (s_, t_, c_),
                              component)
                if p:
                    found = 2
            if not found:
                p = re.search(' %s\.%s( |$)' % (t_, c_), component)
                if p:
                    found = 3
            if not found:
                p = re.search(' %s ?\( ?%s[, \)]' % (t_, c_), component)
                if p:
                    found = 4
            if not found:
                continue
            matched1 = p.group(0)
            related_schema = s
            related_table = t
            related_columns = [c]
            if found in (3, 4):
                related_schema = 'public'
                if (related_schema, related_table) == current[:2]:
                    notifications.append(
                      ('INFO', current_text +
                       ' Dropping reference to same table ("%s")' % component)
                    )
                    continue
            if found in (2, 4):
                s1 = s + '.' if found == 2 else ''
                pattern = re.escape(s1 + t) + ' ?\((' +\
                          re.escape(c) + '[^\)]*)\)'
                m = re.search(pattern, component)
                if m:
                    matched1 = m.group(0)
                    cols = m.group(1).split(',')
                    related_columns = []
                    for col in cols:
                        col_name = col.strip()
                        related = (related_schema, related_table, col_name)
                        if related in schema_table_column:
                            if related[:2] == current[:2]:
                                notifications.append(
                                  ('INFO', current_text +
                                   ' Dropping reference to same table ("%s")'
                                     % component)
                                )
                            else:
                                related_columns.append(col_name)
                        else:
                            notifications.append(
                              ('WARN', current_text +
                               ' Inexistent referenced column "%s" in "%s"'
                                 % (col_name, component))
                            )
                else:
                    notifications.append(
                      ('WARN',
                       current_text + ' No closing bracket: "%s"' % component)
                    )
            m = re_cardinalities.search(component)
            cardinality_self = None
            cardinality_ref = None
            matched2 = ''
            if m:
                matched2 = m.group(0)
                cardinality_self = m.group(2)
                cardinality_ref = m.group(3)
            else:
                matched2 = ''
            m_label = re_label.search(component)
            matched3 = ''
            label = None
            if m_label:
                matched3 = m_label.group(0)
                label = m_label.group(2)
            if found:
                break
        else:
            notifications.append(
              ('WARN', current_text +
               ' No valid reference target found: "%s"' % component)
            )
        if found:
            if column_name is not None:
                # column comment
                relations.append({
                    'fields': [column_name],
                    'reference': {
                        'datapackage': related_schema,
                        'resource': related_table,
                        'fields': related_columns,
                        'cardinalitySelf': cardinality_self,
                        'cardinalityRef': cardinality_ref,
                        'label': label
                    },
                    'enforced': False
                })
            else:
                # table comment
                rest = component.replace(matched1, '')\
                                .replace(matched2, '')\
                                .replace(matched3, '')
                for col_name in table_column_names:
                    r = re.search('(^|\s+)\(\s*(%s\s*,[^\)]+)\)\s'
                                  % re.escape(col_name), rest)
                    if r:
                        col_s = r.group(2)
                        col_names = [s.strip() for s in col_s.split(',')]
                        if not set(col_names) <= set(table_column_names):
                            notifications.append(
                              ('WARN', current_text +
                               ' Invalid source column names "%s" found in:'
                               '"%s"' % (col_s, component))
                            )
                        else:
                            relations.append({
                                'fields': col_names,
                                'reference': {
                                    'datapackage': related_schema,
                                    'resource': related_table,
                                    'fields': related_columns,
                                    'cardinalitySelf': cardinality_self,
                                    'cardinalityRef': cardinality_ref,
                                    'label': label
                                },
                                'enforced': False
                            })
        else:
            comments.append(component)
    return relations, notifications, comments


def _merge_foreign_keys(fk_constraints, fk_relations):
    """
    Merge annotated foreign key relations into foreign key constraints.

    (Both constraints and relations are for the same table.)
    Add all relations, except if a matching constraint exists: then amend
    the constraint by adding cardinality information.
    """
    for rel in fk_relations:
        for constr in fk_constraints:
            if rel['fields'] == constr['fields']:
                r1 = rel['reference']
                r2 = constr['reference']
                if r1['datapackage'] == r2['datapackage'] and \
                   r1['resource'] == r2['resource'] and \
                   r1['fields'] == r2['fields']:
                        r2['cardinalitySelf'] = r1['cardinalitySelf']
                        r2['cardinalityRef'] = r1['cardinalityRef']
                        break
        else:
            fk_constraints.append(rel)


[docs]def get_schema_table_column_triples(database):
    """
    Return a list of all (schema_name, table_name, column_name)-combinations.

    *database* must have the same structure as obtained from
    :func:`get_database`.
    """
    res = []
    for schema in database:
        schema_name = schema['datapackage']
        for table in schema['resources']:
            table_name = table['name']
            for column in table['fields']:
                res.append((schema_name, table_name, column['name']))
    return res