Tuesday, May 15, 2012

Greenplum's distribution keys

Greenplum is a distributed database.  Specifically, when you create a table, it creates the same table on all it's child nodes and distributes out the data based on the key you provide.   If you omit the key, Greenplum uses round-robin to distribute the data, and you can get nodes end up shouldering too much of the load.   At ClickFox we had an issue with omitted and bad distribution keys.   I wrote this 'little' query this afternoon to generate a report on all the distribution keys.  Hopefully someone find it useful:



SELECT pgn.nspname,
       pgc.relname,
       pga.attname
FROM   (SELECT gdp.localoid,
               CASE
                 WHEN ( Array_upper(gdp.attrnums, 1) > 0 ) THEN
                 Unnest(gdp.attrnums)
                 ELSE NULL
               END AS attnum
        FROM   gp_distribution_policy gdp
        ORDER  BY gdp.localoid) AS distrokey
       INNER JOIN pg_class AS pgc
         ON distrokey.localoid = pgc.oid
       INNER JOIN pg_namespace pgn
         ON pgc.relnamespace = pgn.oid
       LEFT OUTER JOIN pg_attribute pga
         ON distrokey.attnum = pga.attnum
            AND distrokey.localoid = pga.attrelid
ORDER  BY pgn.nspname,
          pgc.relname