[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: student project and slo-o-o-o-o-ooo-ow database progress



Stupendous Man wrote:
> 
>   Here's the situation.  I stupidly thought that my "observ" table
> (the one with all the raw observations of each detection) didn't have
> an index by "tass_id" -- although I suspect now that it actually did.
> So, I told postgres to make one:
> 
>    >   CREATE UNIQUE INDEX tass_TASS_ID_ux  ON TASS_Cat (TASS_ID);
> 
>   That was yesterday, about 28 hours ago.  The machine has been cranking
> and cranking away since then, and still hasn't finished with this job.
> My $PGDATA directory has the following entries (among others):
> 
> -rw-------   1 postgres postgres 50929664 Apr  4 10:43 tassm16_id_ux
> -rw-------   1 postgres postgres 50937856 Apr  4 10:43 tassm16_ra_bx
> -rw-------   1 postgres postgres 50970624 Apr  4 10:43 tassm16_dec_bx
> -rw-------   1 postgres postgres 172228608 Apr  4 10:43 tassm16

I discovered and error in my tassm16 importer.  All the RAs are off
by a factor of 10 or 15.  A one line typo.  The new stuff (0.7.1)
works OK
If the RAs in your tassm16 don't cover the full 360 span you can
drop the table as it is useless.  a re-compute takes "only" a few
hours. and is fully automated now with a script.

> -rw-------   1 postgres postgres 230531072 Apr  8 00:02 obs_ra_bx
> -rw-------   1 postgres postgres 274219008 May 25 15:41 tass_cat
> -rw-------   1 postgres postgres 304259072 May 27 16:31 obs_tass_id_hx
> -rw-------   1 postgres postgres 606158848 Apr 22 05:58 observ
> 
>   Even though the index is more than half the size of the "observ" table,
> it's still growing.  Yikes!  What's going on?
> 
>   I plan to leave it overnight (for a second night) and come back
> tomorrow.  If it's STILL running, I'll just kill it and bite the
> bullet.  Sigh.
> 
>   Do you have any idea why this is creating such a large index file?
> And do you also have any idea of why this is taking so long?

My largest table is smaller, about 250MB as I split observ.  It took
only a couple hours to build the index.  The per-record overhead is high
for a btree index but I don't think that high.

How are you starting Postgres?  I am now using a 2048 block buffer cache
and have fsync disabled.  I think I have also bumped the sort space up
by 10x (-S option)  All the advice I get says I should be using multiple
UW SCSI disks and have 250~512 MB of RAM for a database of this size.
That's an $8K system.  I have a computer like this here at work and I'll
try it out
> 
>   At the moment, my routine is a Perl script which
> 
>          1. grabs about 5000 records from the "tass_cat" and "observ"
>                tables, and sticks them into ASCII flat files

Because FETCH must be inside a BEGIN/END pair and you can not nest BEGIN/END
you will need to first get all the tass_cat data and hold it.  All you
should need is the tass_id.  Even 1M of these will fit in RAM.  As your
selection criteris is likely obs_count > n you will likely only get < 500K
rows.
> 
>          2. walks through the ASCII files once, calculating mean and
>                sigma for each star as it goes

For each ID in the above array.  I would do a "SELECT * from ovserv where 
tass_id = ..";
Use the result to compute Sigma the discard the ovserv records
> 
>          3. for each star, uses an "insert" command to add information
>                to a new table, which is very much like "tass_cat",
>                but contains mean and stdev values calculated properly

If you are doing more then a few 1000 INSERTS it is worth writing the data
to a file, I just do binary writes.  then when you are all finished
a) drop the table (in case it still exists)
b) COPY the above file into the table
c) create any index files required.
be sure to do c after b as it is faster overall.

Each INSERT requires a pass through the SQL parser and the query planner
COPY does too but each INSERT only adds one record. COPY can add millions.

I have found C to be faster then Perl but Postgres is always the bottleneck.

That's what I was thinking too.  Make a "tass_mags" tables that contains
magnitudes I, R, V for the well observed stars in tass_cat.  We would 
re-compute this once per week or so.
> 
>          4. loops back to step 1.
> 
>   This is limited by the speed at which information is taken from
> the "tass_cat" and "observ" tables.
> 
>   I'll try using "FETCH" and "COPY" commands, as you suggest.

You can hack code out of .../TDB_merge/catalog.c

  If,
> as you say, the INDEX files are broken or not used in Postgres, then
> I'm just wasting my time waiting for this INDEX to be created, aren't I?

No, INDEXes are not broken.  The query otimizer desides based on your
query, available statistics and available index files how best to do the
query.  Normally, if there are multiple ways to do a search it will estimate
the time required for each and choose the quickest.  Currently Postgres
does not concider using indexes if the "where" condition is a set of
expressions joined with "ORs".  and in the case in "IN".  "ANDs" work fine.

The command "EXPLAIN SELECT ...." will print out the "query execution
plan" but not run the query.  You can create an index file, do a vacuum
and see if that effects the plan.

This is a missing feature not a bug.  Indexes are also not used to
compute the min() and max() functions.  I see this on the development
groups published "to do" list along with writing a simga() function.
All we can do is either wait or help, or gripe.

> Sigh.
> 
>   Thanks very much for your help.
> 
> --
> -----
> Michael Richmond                   "This is the heart that broke my finger."
> mwrsps@rit.edu                     http://a188-l009.rit.edu/richmond/

-- 
--Chris Albertson

  chris@topdog.logicon.com                Voice:  626-351-0089  X127
  Logicon RDA, Pasadena California          Fax:  626-351-0699