GEEK Speek: January 2007

One of my pet projects is log management. Yeah, I know - log management doesn't sound like fun - and most of the time it's really not. I became interested in log management when the company I work for wanted to consolidate all the firewall, IPS, etc. logs for easy review and, ideally, correlation. This became a pet project of mine simply because of all the MISINFORMATION many of the vendors were selling, and my FRUSTRATION at the lack of clear information. Admittedly, some of this frustration also came from my LACK OF UNDERSTANDING of what I was looking for - and without clear requirements, I've found that I'm rarely satisfied with any result.

So this blog will be dedicated to reviewing the path I followed in my learning - I hope there are a few "gold nuggets of wisdom" that can be gleaned along the way.

Frustration & Misinformation
After looking at a few different products, their capabilities and extensibility, I became frustrated with the lack of information the vendor "technical leads" really had on how the products worked, what their limitations were, and the fact that they all said their products were "infinitely scalable". Does anyone actually believe that a product is "infinitely scalable"? I hope it comes with a silver bullet.

If the "technical leads" can't explain, in terms I can understand, what differentiates their product from another, how can I ever make an educated decision? After all, they're the ones who understand this log management space, right?

Lack of Understanding
I took this on more as a frustrated challenge than actual "fun". I wanted to prove, to myself at least, that I could probably do as much (or nearly so) as any of these vendors could with the $1 Million dollar price-tag software they were pushing. After all, Syslog has been in use for a couple of decades, right? How tough could this be?

As it turns out, you CAN do a lot with free tools - but not everything - or at least not yet. Even if I can't do everything, I've learned a lot (and am still learning). With all the work I've put into this so far, I hope this info is worth sharing - so if you are buying $1Million dollar piece of software, hopefully this will help you know what to look for, determine what you need, and be able to tell the vendor when you think they're full of crap.

As I said, this posting is going to delve into the complexities of Log Management. I'll explain, briefly, what log correlation is, but I won't get into that just yet - that'll be another post. There's a lot of info to share already.

Correlation vs Consolidation
Log correlation is the act of taking logs from disparate sources, and combining the information together in some meaningful way to help determine the underlying problem. For example, a simple failure event on a firewall may be commonplace, and something you'd typically ignore as background noise; but if it's coming from a UNIX box that just had someone successfully logon remotely, it may be more meaningful - perhaps even a compromised box. That's rather simplified, but I hope you get the idea.

Log consolidation is simply that - getting all logs into a common repository.

They are complementary technologies, but the primary goals are vastly different since. With correlation, not all of the informatin is necessarily maintained, just the correlated result. Most IDS/IPS systems work in a similar manner - entire packets aren't maintained, just the fact that the packed matched some signature.

A common misconception is that log correlation is a superset of log consolidation. After all, if correlating logs, the system doing the correlation has to get data from all the sources it's correlating, so it's very similar to a repository of data. Unfortunately, it's not that simple - but I'll have to address that in my future correlation blog. I will say that I believe correlation to be much more complex than collection and leave it at that for now.

Consolidating the Logs
My journey along the the log management path started with the relatively simple (though non-trivial) task of log consolidation. In it's simplest terms, all that's necessary for log consolidation is to get the data from a remote source and write it to disk.

From a performance standpoint, a ridiculous amount of data can be handled by pushing all of this data to a flat file. I can easily handle tens of gigs per day, even on a relatively slow server using this simplified architecture. The primary bottleneck in this scenario is Disk I/O - assuming the disk is not big$$ disk arrays. Typical server RAID arrays can handle Megs of data per second. From a performance standpoint, no big problems so far.

It's when trying to READ this data, however, things start to go downhill; and, let's face it, the data is being collected for a reason. With this volume of data in a flat file, though, it's like looking for a needle in a haystack if to answer something as simple as "who caused this account to get locked out", or "are all these SSH sessions coming from one (misbehaving) host, or is someone trying to break in", etc.

So, being the rational sort, I ask "Why not just push all the data into a database to allow for easy querying and reporting?". This should resolve most of my READ issues by limiting the amount of data I need to pour through.

Now, with databases, a few more issues came up:

transaction times
indexes
parsing
data archival
maintenance

As I already mentioned, relative to the rest of the system, disk is slow - databases are even slower. Why? Well, this gets even more convoluted to understand, so please bear with me as I explain some technical details while I go through the description of my trials.

Transactions
Because of the transactional nature of databases, all events involved in a single "transaction" against the database either all succeed or all fail. So, to keep it simple, if I'm INSERTing a new record into the database, the database needs to first "stage" the data to a temporary location - called the transaction log - in case the insert fails (cancelled, out of disk space, database not available, etc.). Then, the database software will periodically say "all the transactions that have successfully completed in the transaction log will now be written to the actual database and expired from the transaction log". That's a lot of disk management and data movement, especially since most event records will be less than 1024 bytes; at least 3 disk writes take place, one to the transaction log, one to the actual database, and one to expire the transaction log data once written to the database. The nature of the beast means that this is SLOW, so databases reduce the number of events per second the system can handle.

Indexes
The whole point of writing the data to a database was so that we could search through it quickly rather than scanning Gigs of log files for individual events and then trying to pick out the relevant ones. That means indexes are needed for fast search, which translates to analyzing the data for sort-order, and writing the index. Index updates are also transactional, and all of this takes up CPU time and disk I/O.

Parsing
Most events found in logs are full-text items and look similar to the following:
"June 10 2006 3:02pm Event ID 32: User XXX logged on from Y.Y.Y.Y IP address" or
"July 11 2006 4:04am Event ID 108: Firewall allowed TCP port 80 to destination A.A.A.A from source B.B.B.B".

Short of performing substring searches for every query run, or creating a full-text index (not recommended when dealing with millions or billions of records), the relevant pieces of information need to be parsed out and store in the database.

So I'd want to parse out the following values for the first fictitious event above:
June 10 2006, 3:02pm, XXX, Y.Y.Y.Y
For the second event, I'd want to parse out:
July11 2006, 4:04am, 108, TCP, 80, A.A.A.A, B.B.B.B

This takes up CPU time for each and every event, and also requires architecing a database format that allows for multiple strings associated with a single event - not a flat database table. And I'd have to parse these out FAST, because there are more events coming in all the time.

There are some trade-offs that can be made, like parsing out just the Event ID number, and date/time stamps, and indexing just those. Then, when I need to look something up, I can at least narrow the criteria, and then do substring searches. In fact, I do exactly this all the time - it works very well, but does have limits when I'm creating a report making use of a very large number of data records.

Data Archival
Archival caused me quite a few problems with database management - mostly because I was using MS SQL Server 2000. I didn't understand, at the time, how important a function called "partitioning" could really be - something many database systems have, but not introduced into MSSQL until SQL 2005.

Obviously, I couldn't just let my database grow forever. I had to delete old data after a while, and I wanted to be able to put it to backup tape so I could restore it later if needed. Restoring database data without restoring an entire database is a whole new art-form that DBAs and backup software companies have been dealing with for a long time, but as it turns out, that's not something I needed to get involved with. I simply wrote the data to a flat file (remember, there's very little overhead to doing this) as I wrote it to the database. No more need for archival since I'm creating the archive as I insert data into the database.

I still had the problem of expiring/deleting the old data, though. Remember, databases are transactional - more disk writes, more CPU, more memory used - even for delete activities. What's worse, I'm trying to delete a LOT of data all at once, not just simple transactions like writing individual events to the database - I'm talking 20 million records per day.

Whenever this task ran, all the events coming in from my remote devices would get queued up (because the database is busy with a huge delete event) and I'd eventually run out of physical memory due to all the queued events and the collection process would crash. There was also a disk space issue since my transaction log was storing millions of records during this delete process. So, breaking this down into smaller chunks (let's say, one hour at a time) helped, but this could still take several minutes - causing more event queueing. I couldn't guarantee memory wouldn't run low especially if a system was really under attack - and I really didn't want to lose the data during those instances.

Here's the advantage of "partitioning". Partitioning allows chunks of data to be stored into separate database files rather than one large file like traditional databases. Then, an entire partition file can be dropped rather than trying to delete chunks of data from a single file. So, if by creating daily partitions, it's pretty easy to drop any single day of data - there's still some performance considerations, but not nearly as bad.

SQL 2005 wasn't available at the time, and I didn't have the luxury of moving to Oracle. So, I created a batch job that was scheduled to run every hour, which deleted all the records older than 30 days in chunks of 100,000 records. Even though it specified data older than 30 days, since this runs hourly, it's really only deleting 1 hour worth of data during any typical run. Further, since I'm deleting it in 100,000 event chunks, then pausing for a few seconds, it gives other processes time to access the database (less queueing). There are a few other optimizations to this technique, but I've already documented most of that here on the Adiscon site (I was using WinSyslog at the time I developed the T-SQL Script).

So far, so good - but every now and again, maintenance still needs to be run (update indexes, shrink files, make sure there's no corruption, etc).

Maintenance
Maintenance is a brutal time consumer of SQL databases. MSSQL 2000 promises being able to run maintenance on a live database, which is true, but the same maintenance run against a live database runs 18 hours vs. 2.5 hours if the same database is off-line. These times were what I experiences on my hardward - these time will vary depending on how many and which maintenance tasks are being run. I'll be honest that I haven't revisited this aspect of MSSQL 2005 yet to see if there are any significant improvements, but they would have to be very significant to improve by 15.5 hours.

The best I could do
The best I could ever get was a sustained rate of about 1,800 events/second with everything else happening. I would get periodic spikes which would cause backlog, but these would finish up during low-usage times and the system would continue running.

But this wasn't enough for me. Trying to insert data, delete data, and run maintenance with the volume of events, all in real time, did me in. I simply couldn't live with just 1,800 events/second on average. I would get hours at a time where I'd see 2,000+ events/second.

Time to Re-evaluate my Priorities
Big question: Do I need to have all of this data into the database real-time?

For me, the answer was "no". I was doing next-day reporting and forensic analysis with the data. I know I could easily pull all of this to a flat file. So, for me, the solution was to do a nightly batch INSERT of all the events in my flat file. As long as maintenance and delete scripts weren't running, this worked really well, and allowed me to run all the reports I wanted to. I even created a modified version of my import script which would allow me to do a bulk-import of the files restored from tape during an investigation. This method also allowed me to focus on database optimization so queries would run faster, rather than just keeping the database running at all.

So, I had 4 scheduled tasks on my server:
1. Hourly data delete (impact of doing this throughout the day is low, and prevents any nighttime bottlenecks). This would check to ensure none of the other 3 tasks are running first, so sometimes it would actually delete 2-3 hours worth during some cycles.
2. Nightly import script - at the end of which I update the datetime index used regularly by Step 1 - I found this make Step 1 run better each day. The script also makes sure that the next 2 scripts aren't running first.
3. Saturday mornings I do a full reindex (not just the datetime index).
4. Sunday mornings I perform a database defrag & shrink operations. This has a significant impact on improving performance.

There was no need for integrity checking since the database could always be rebuilt from the flat files, so I also set the database to a "simple" recovery model rather than the "full" recovery model when I was doing only direct-to-database logging.

But I still had a really big question...

How do the vendors manage to resolve all of these problems?

I started this article off by highlighting 3 problems I was experiencing: Misinformation, Frustration, and Lack of Understanding.

They say "forewarned is forearmed" - and after all my work - I was ready to talk to vendors again.

At this point, I knew exactly what to ask to ensure I received clear answers; I knew what I needed the product to be capable of (including data volumes which is important); I knew how to get an answer that made sense to me rather than "smoke and mirrors" answers that I'd received before. My belief, at this point, is that any database-driven solution won't handle my 10+GB of data per day (and growing) without some VERY serious hardware on the back end which would make the solution cost prohibitive.

But some vendors claim they can collect tens-of-thousands of events per second - and still allow for fast querying & reporting - even though they're just and appliance. How could that be?

They maintain flat files, with database indexes.

This makes perfect sense to me. Speed of flat files for storage, and using database-like indexes for fast searches. This also requires very little maintenance since, in a worst-case scenario, the index can simply be rebuilt. I'm sure though, that there would need to be some special handling routines for detection of corrupted files, etc, but overall a sound solution. This also makes restoring archived data very fast since it means just re-creating an index once the flat files are restored.

Some vendors - probably all of them handling large volumes - have been down the same path as I have, and learned the same mistakes I have, and have developed this hybrid approach. To my knowledge, there is no open-source solution performing in this manner, but unless very high volume collection is required, one of the above approaches may already work.

I really hope my sojourn down memory lane helps others to avoid some of the same pitfalls that I've experienced - and hope you've learned something that will help you. Keep in mind, I've only talked about log collection this time around.

Don't think, though, that this is the only thing that differentiates the vendors from home-grown, or even from each other. Very few logs are standardized, and definitely no standardization between competing vendor products, so event parsing can be a lot of work. There's also correlation, reporting & analysis, which are all hefty topics.

But, there's plenty of internet whitespace left to discuss these topics in the future...

I've read a lot of articles recently like this one, often quoting (mis-representing?) some very respectable sources such as SANS and Postini. They imply that Vista will somehow magically fix all the security woes that MS has been undergoing and that the internet will be a magically safer place. I'm sorry people, but that's just a pipedream!

Is Vista more secure than previous OSes - for sure. Will it "change the threat landscape" - hell no!

Look, I'm not going to quote any real statistics - mainly because I can't be bothered to look them up, but there are still Win95, Win98, WinME, NT4, Win2000, Win2003, old Linux & Unix boxes that have never been maintained (or not adequately from a security perspective) - and let's not forget that this month is the Month of Apple Bugs.

Let's say, just for arguments sake, that everyone who could afford the whopping $239USD pricetag on Vista, ran out to buy it and installed it on their probably-not-fast-enough-to-really-run-Vista machine. Let's further assume, as is probably reasonable, that a whole bunch of people will also install a pirated copy of this same OS. This would mean that we'd get what...maybe 40-50 percent of the Windows workstations out there converted to Vista? And that, my friends, is in a perfect world where everyone who could do so ran out and got a copy.

This scenario would have some (short term, I think) impact on the number of available machines to infect/take-over, but that doesn't make the Internet a safer place. If you take the analogy of the Internet being like New York city, where there are a lot of wonderful places to visit, and a lot of dark corner where bad people lurk, then this would be akin to installing flood-lights in 40-50% of the New York city alleyways. Does it make New York safer? No. In fact, it means some areas are worse off than they ever were because it forces the bad-guys to congregate (and there's strength in numbers). Same holds true for the Internet.

For the Internet, this probably translates into meaning that primarily 1st-world countries, who can afford the software and/or hardware upgrade, will be "safer"; while poorer countries will be the congregation points for Internet bad-guys.

But it can still be argued that this pushes all of our problems to areas that we can more heavily monitor. In theory, this is true.

Are there any plans to do this monitoring? No. Does "monitoring" this activity make us any safer? No. Can we stop people in these places from sending data to us? Not in our new global economy.

And don't think for a second that Vista is without holes. When XP came out, it took a few months for the first 0Day exploit to hit - and the same will be true for Vista. In fact, we'll never truely know when the first 0Day for Vista comes out because the bad guys are a lot more about stealth these days than ever. So, while we're carefully watching these "dark areas" of the Internet, the domestic security space will slowly revert back to the same state it's in today.

I'm not slagging MS here, you'll note I mentioned several other OSes above - but MS is simply the most prevalent and, as a result of prevalence, the most frequently targetted.

My suggestion to those reading this blog: don't upgrade you personal PC to Vista because of security. If you like some of the features (like multimedia) that Vista offers, go for it. Otherwise, stay with XP (or find someone who has purchased Vista and get their XP license from them). It will probably run faster on your machine anyway.

Whichever one you do, here are 6 easy things to make sure you're not easy-pickings for the bad guys:
1. Make sure to patch aggressively - enable automatic updates for all software (and operating systems) that have the option.
2. Make sure you have a good AV that hasn't expired and update it at least weekly (I update dialy).
3. Enable a personal firewall (hardware-based routers are good too) and disallow all inbound traffic unless you really know what you're doing.
4. Install buffer-overflow protection software (eEye has a free personal edition of Blink)
5. If you're going to spend money, spend it on a good anti-spyware software - some of them even have buffer overflow protection software, and personal firewall built in.
6. Make sure to auto-update (or subscribe to mailing lists) for all the above (yes, I know this was point #1, but it's worth re-iterating).

If you have the know-how, also make sure to subscribe to vendor patch notices for all internet-enabled software (and browser tie-ins) including multimedia software (QuickTime, iTunes, DivX, WinAmp, etc..), Adobe Acrobat, your mail reader, browser, Peer2Peer software, etc. Again, auto-update is your friend if it's built-in.

This won't make you indestructible, but it's "secure enough" for most people and makes you a harder target. Bad guys usually go after the low-handing fruit.

rG0d

GEEK Speek

Tuesday, January 9, 2007

Dealing with Logs (Part 1): What Vendors don't tell you about Log Management

Thursday, January 4, 2007

Month of Apple Bugs

So long Superman... Vista will save the world

Wednesday, January 3, 2007

Gotta start somewhere

Blog Archive

About Me

Just a bunch of links