[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: size musings



I am not so sure that a completely general data base with 1E9 entries is 
the desired format.

We will really have 100 measurements each on 1E7 stars.  It is more complex 
than that.  We will have a few "good" measurements on each star, and many 
"fair" measurements.

Now when we ask for "all stars with a color index > XX brighter than
mag=12." it is not so difficult to do, even with a flat file.  This assumes 
we have processed the 1E9 measurements to a 1E7 star file.  I think we will 
want to do this in any case.  Once we have the stars of interest, we can go 
to a second flat file to get all the measurements for those stars.

I think my personal goal will be to get out a summary file which contains 
the 1E7 stars that we have measured with codes which tell what can be 
quickly determined.  i.e. color, variable type with some probability, 
magnitude, location, etc..  This should give us two lists.  9.9E6 pretty 
boring stars, and 1E5 stars of possible interest.  Thus I hope to "publish" 
a 1E7 catalog that contains all the stars, and a 1E5 catalog that contains 
the interesting stars with all the measurements for each star.  Looks like 
each catalog will fit on a handful of CDs.  Note that I would not attempt 
to "classify" each star but rather make the data available so that experts 
could "mine" it.  I will also keep all my raw data, and attempt to respond 
to requests for raw data sets.

That is the beauty of the tass organization.  Chris can set up his 200 GB 
sub system and let people search it for tasks that involve complex 
arguments.  You may recall that one of our lurkers is from the Microsoft 
Huge Data Base Group.  So it may be possible to get the hardware set up 
somewhere.  I will want to circulate CD sets.  We shall see which data gets 
used.  I would hope that both produce science.

Chris,  tell me how your file would work if I asked the following question:
"Give me all the stars where most of the measurements are tightly grouped 
but which have 1,2, or 3 outliers of more than 4 sigma."  (Or whatever 
sigma one should specify to make the probability low that you would find 
one in 100 measurements.) Would this not require immense computation from 
your data base?  Would it not be just as easy to flat file compute this 
answer?  I think this is the question that Mike wants to ask to find 
planets.  So it is a serious question.  My guess is that it would be faster 
to linearly grind through the whole data set to get this answer.  There 
must be some overhead related to the index data base structure which would 
make it less efficient.  I think my argument is that there will be some 
tasks that are well defined that we will surely want to do.  These might 
best be done with a linear approach.  Others will be "I wonder if" types of 
questions.  These will want the indexed data base.

Tom Droege

At 06:08 PM 2/16/01 -0800, you wrote:

>When you have LOTS of data you have to choose between
>storing it so that it is compact or so that is can be
>searched quickly.  For example suppose you have 1E9
>photometric measurements (not an unreasonable number)
>the most compact storage method is (say) one 40 byte
>record per observation in one large flat file.  This
>sets a lower limit of 40GB.  Now suppose you need to
>find "all stars with a color index > XX brighter than
>mag=12."  Well, you are sunk.  The only method that would
>work is to read all 1E9 records.  It would take a long
>time.  As soon as you demand fast access to selected
>parts of your data you need index files., you need many index
>files and some method of keeping them up to date.
>Now your data is not so compact.  The index files will
>be at least as big as the data.  More likely three or
>more times the size of the data itself.
>
>The up side is that if you need a subset of the data the
>time to read it out can be proportional to the size of the
>subset and (mostly) independant of the total amount of data.
>
>The good news is the 100 or 200GB disk subsystems and computers
>with 1GB RAM are becomming personably afordable.   Yes I think
>the Mk IV database will come in the 100~200GB range.
>
> > -----Original Message-----
> > From: Creager, Robert S [mailto:CreagRS@LOUISVILLE.STORTEK.COM]
> > Sent: Friday, February 16, 2001 1:48 PM
> > To: 'Tass Mailing List'
> > Cc: 'dgo'
> > Subject: size musings
> >
> >
> >
> > Pardon, but I was explaining the size problem to a co-worker,
> > and starting
> > musing about possibilities of the entire system...  These
> > numbers might be
> > on the low side - any comments?
> >
> > 2000 stars per image
> >
> > 2 images per picture set
> > 100 picture sets per night
> > 100 nights per year
> >
> > 20,000 images to be analyzed a year
> > 8Mb per image
> > 160 Gigabytes of data per site to be analyzed per year
> >
> > 40,000,000 stars per site per year, and if we keep every
> > observation in a
> > database:
> > 26 bytes per star (observation list only)
> > 1 Gigabyte per site per year of star data.
> >
> > 12 sites...
> >
> >
> > Robert Creager
> > Senior Software Engineer
> > Client Server Library
> > 303.673.2365 V
> > 303.661.5379 F
> > 888.912.4458 P
> > StorageTek
> > INFORMATION made POWERFUL
> >
> >
> >