Open Computing: Standards

Showing posts with label Standards. Show all posts

Tuesday, July 27, 2010

Response to .DOC Attachment

What do you do when you receive an email file attachment in Microsoft's .DOC format? Sure, we could open it in OpenOffice.org and go about our business, but there's no guarantee that this will remain true for future versions of .DOC. Use of proprietary formats are a danger to interoperability and to future innovation. One good thing to do is to reply to the email and explain why they should send the attachment in a standard format. I've seen several canned email responses in the past, but most are too brief to explain the issue, or too harsh in their wording to satisfy my tastes. I was prompted to write a response after receiving such an email attachment today, and I have combined some of the best ideas from elsewhere and added some of my own. Here is what I would recommend:

The document you have sent was not saved in an accepted format for Internet mail.

It was saved in a proprietary format that is unreadable on several types of computers because the method for decoding the document is kept secret by Microsoft and is purposefully changed with each new release of Microsoft Word so that existing users of Microsoft Word will be forced to pay for expensive upgrades in order to continue to read Word Documents sent by others. For example, in 2010, Microsoft Office Home & Business 2010 was priced at $279.99, and Microsoft Office Professional 2010 was priced at $499.99.

Recent versions of Word have started using a newer, patented OOXML format. In many countries, it is actually illegal for other products to decipher this format. This is a lock-in technique used by Microsoft to maintain their monopoly on Office software, and by extension, their monopoly on the operating system market, since they have not released a cross-platform version of Microsoft Office compatible with other operating systems.

In most cases, the size of the file saved in Microsoft's secret, proprietary format is also substantially larger than a standards compliant file containing the same information and the same formatting.

It is also important to note that Microsoft Word documents are often infected with viruses. Excel, Access, and Power Point files are also vulnerable to infection. This potential for infection is largely due to the Macro language and the "Visual Basic for Applications" language which are built into the format to provide powerful programming capabilities. While powerful, these features were not protected with proper security precautions, and the majority of users do not actually use these features or even know that they exist.

What to do instead:

If you continue to use Microsoft Word, please have the courtesy to “Save As” one of the following formats: ODT (if available), DOS Text, HTML, or Portable Document Format (PDF) and after saving, send the resulting file as an attachment.

Alternatively, you could use a product such as AbiWord, KOffice, Google Docs, NeoOffice, or OpenOffice.org that allows you to save your document in the Open Document Text Format (ODT), which is an ISO/IEC International Standard, and is supported by such notable companies as Apple, Adobe, Google, IBM, Intel, Nokia, Novell, and Sun Microsystems. If you don't have one of these programs, I would recommend OpenOffice.org, which may be downloaded free of charge and used for any purpose, personal or commercial.

A third option is to simply type your message directly into mail (instead of typing into Microsoft Outlook or Microsoft Word) so that you won't need to use an attachment at all.

In the highly unlikely event that your document cannot be converted to an open, non-proprietary format, consider printing it and mailing it by post, or scanning it, and sending it in a standard graphic format such as PNG or JPG.

Thank you.

Tuesday, July 13, 2010

Open Database Conventions

I've been using a set of database table and field naming conventions and related standards that have been gradually adopted and refined over the last 10 years. I decided today that it would be a good idea to share these with others, particularly because I decided to bring Jim from UVOG in on a database project I'm starting on. I realized that sharing them with others on an even wider scale might be a good thing on the off-chance that, if someone else out there adopts them, and if our code ever ends up crossing paths, we will gain greater interoperability and understanding on account of using the same techniques.

Feel free to take some or all of these ideas and implement them in your own projects. I am willing to hear feedback and incorporate it back into the standard if you think there's a better way to do something that I've mentioned.

Some of these conventions will apply specifically to MySQL, while others will apply to the greater spectrum of databases. My experience spans MS Access, MS SQL Server, dBase, FoxPro, the Apollo Database Engine, and MySQL, with some very light experience in SQLite and PostgreSQL, but the bulk of my work today happens in MySQL.

Naming the Database

The name of the database itself is the least of my concerns. I would just recommend that in code that operates with or upon the database, this value should be easily configurable along with the hostname, port, and authentication credentials, so that any database could be selected.

Naming Tables

The name of a table should be a succinct name descriptive of what is to be represented by a single record in the table. If each record in the table represents an account, for example, name the table "account" (in the singular, all lowercase.) The reason we use singular forms is so that dot notation naming of fields makes sense when read out loud, for example, account.balance. There are many reasons we use all lowercase. Most production Database servers are case sensitive, and by using all lowercase we can eliminate the possibility of an error relating to case. It also improves code readability when SQL statements are written out with keywords and function names capitalized and with table and field names in lowercase.

There is a special case of naming in the event of a junction table. (A junction table is used to create a many-to-many relationship between two tables, and consists of nothing more than an id field, and foreign keys for the two tables whose records it is tying together.) The name of a junction table should be the name of the other two tables mashed together, with the name of the greater entity first. For example, if I have a company table and an customer table, and customers are potentially associated with multiple companies, I would name the junction table companyemployee, and it would contain id, kcompany, and kemployee, in that order. If there is a toss-up in determining which entity is "greater," place the two table names in alphabetical order.

Naming Fields

Always name your fields as though they are going to be used in dot notation. We want account.balance, not account.accountbalance, so do not redundantly repeat the table name within the field name. Avoid abbreviations that seem like a hack. For example, number is better than "no" or "num." Widely used abbreviations like ipaddress, ssn, or id are ok--be particularly careful to avoid expanding acronyms that are not expanded in common speech.

The Primary Key

In order to be properly maintained, every record needs to be addressable by a unique identifier. This identifier should not be meaningful in any real-world way for reasons that are explained in depth elsewhere, but I will give one brief example. If you were making an employee database, you might be tempted to use a person's SSN as the primary key for the employee table. Three years after the program is written, your company might hire someone who is in the country on a work visa, and therefore does not have an SSN, and now your system is broken. One will follow this line of thinking out and conclude that the key should be completely arbitrary and meaningless so that no such conflicting situation may ever occur. Additionally, for the sake of consistency, the primary key field in each table you make should be of the same name and type. I recommend naming the first field of each table "id", setting it to "auto_increment," and setting it to become the PRIMARY KEY index.

The value of this id field should be used internally for programming purposes and for describing relations between tables, but it should never be shown prominently to the user. If it is shown, it is merely for debugging or reference purposes. The number in this field, once generated, should not be changeable by the user, is not guaranteed to be sequential with the records around it, and is not guaranteed to fall into any range of values more specific than those allowed by an int (11) field.

Foreign Keys

A foreign key can be thought of as a "pointer" referring from the current record out to another record in another table (or possibly another record in the same table.) All foreign keys should be prefixed with the letter "k" (meaning key), and should either be named "parent" if the key is referring back to the same table for the purpose of producing a hierarchical system, or if referring to another table, it should be given the name of the table into which it points. If more than one key in a record must point to the same table for different purposes, the purpose should be appended to the name followed by an underscore. For example, kaddress_from and kaddress_to would be pointers to two records in the table named "address." The relation to one of them is described as "from" and the other as "to." This is the only time an underscore is permitted to be used in a field name, and the purpose of the underscore is to make the table name unambiguous.

When possible, try to put keys to parent records or records which are considered to "own" the current record near the top of the field listing, following just after id.

If these standards are followed for the primary key and foreign key fields, I could look up any reference using the following technique:

Given a field whose name begins with "k": Take everything in the field name following the "k" up to the end of the name, or up to (but not including) the first encountered underscore "_" character, and consider this to be the "lookup table name," if such a table exists. Retrieve the data by issuing a query in the form of:

SELECT * FROM [lookup table name] WHERE id = [value of the field]

Choosing Appropriate Data Types for Fields

Always choose the most restrictive type that can safely store the data, including valid values that you might not anticipate.

KEY FIELDS

For id fields or foreign key fields, use INT (11)

For boolean (yes/no or true/false) fields, use TINYINT (1) A value of zero means false, and any nonzero value means true. You can distinguish between individual nonzero values for record housekeeping if you wish, as long as others reading the data as merely true or false will not be under a misconception from not making such a distinction.

CURRENCY

For currency values less than a million dollars, use DECIMAL (8, 2) --- increase the first size by a sufficient number of digits if you need larger currency values.

NUMERIC DATA

For numeric data involving whole numbers or integers only, use INT (11) --- unless the size is astronomical, in which case, you will need to upgrade to a larger type.

For weights or non-integer quantities less than ten thousand, use DECIMAL (8, 4) --- increase the first size if values beyond ten thousand are needed, or both sizes if more precision than four decimal positions is needed. Four decimal places was chosen so that a number like 12.05% could be fully represented as 0.1205.

DATES AND TIMES

For date or time values or stamps where the "time part" has meaning (particularly in sorting), use the DATETIME type. A value of "0000-00-00 00:00:00" means not entered or unknown.

For date values where the time is irrelevant, use the DATE type. A value of "0000-00-00" means not entered or unknown.

MULTIPLE CHOICE

For multiple choice values: If there are a discrete number of choices presently and in the near future, few choices (about 8 or less), and little or no benefit would be had by allowing the user to customize the set of choices, an enumerated value may be used. Example: alignment ENUM ('left', 'right', 'center') DEFAULT 'left'

For multiple choice values where the choices are numerous or user defined or potentially user managed, a foreign key and a lookup table should be used instead of an enum.

STRINGS OF TEXT (NO MULTI-LINE VALUES)

For any string data: If the options are very well established, choose the smallest size category from the table below that can hold all possible values, or for data that is free-form, choose one category larger than you think is really necessary. I will give some example with each size category:

VARCHAR (4) -- A standard name suffix like Jr., Sr., II., III.

VARCHAR (8) -- Data known for certain to be less than or equal to 8 characters.

VARCHAR (16) -- A zip code. (Minimum to store a US zip code is presently 9 digits plus one hyphen.)

VARCHAR (24) -- A phone number, with decorations. The following number is 23 characters long, including spaces: +1 (541) 375-0448 x8888

VARCHAR (32) -- Product codes, SKU numbers, or generated numbers that are guaranteed to be less than or equal to 32 characters.

VARCHAR (48) -- A first(given) name by itself, or a last name(surname) by itself.

VARCHAR (64) -- A full name field (first and last name together.)

VARCHAR (128) -- An email address.

VARCHAR (248) -- A web address.

PARAGRAPHS OF TEXT, NOTES, MESSAGE CONTENT, CODE SNIPPETS

For anything multi-line, or potentially longer than 248 characters, use the TEXT type. If the data or text will be more than a couple of kilobytes, investigate the LONGTEXT type.

IMAGES OR BINARY FILES

I don't like big files taking up space in my InnoDB tables and slowing down my replication server. Instead, I store the original filename only (and only as a convenience) in a VARCHAR field, and then save the file separately named based upon the table name and record number the file is associated with. If I need to replicate these files, I let rsync take care of that. If the "attached" files MUST be synchronized with the other data in the record at any given moment, then it might be acceptable to resort to BLOB fields.

Order of Fields

When creating tables, the order of the fields really doesn't matter to the database engine. The sequence, therefore, should be chosen for purposes of clarity in documentation.

Well, that's all I have for now. Have at it, tear it up, and send me suggestions.

Thursday, March 13, 2008

Desktop Diet!

A great presentation by Rasterman, the head coder of Enlightenment.

I tried out E17 the other day, and it's not as good as I hoped it would be. Ah well!

Saturday, February 24, 2007

Watching Television on the Internet

After making the switch to Desktop Linux, I have finally got around to figuring out what to do for entertainment. Getting DVD media to play was a breeze, perhaps because it seems to be the first thing anyone asks after getting their OS up and running. Of course, we should do our best to push for open formats to replace DVD, but in the real world a little bit of "closed media" is likely to keep people from actively using Open Source everything-else. If a computer can't fulfill the basic human need for entertainment, it won't last long.

My next place to turn was to Television programs. There are a few series that I have enjoyed in the past, and I just gave away my 23" television to free up some space in the apartmet, leaving me with a 13" TV that I have decided only to use for 8-bit Nintendo. Television over the Internet would be a sweet alternative. Some shows are available for "free" viewing on their official websites, others are not.

The shows I think I'm interested in viewing right now are: 24 (Fox), LOST (ABC), The O.C. (Fox), The Office (NBC), Prison Break (Fox) and Veronica Mars (CW).

I will chronicle my level of success in this endeavor here:

I asked my brother "So how does one watch TV on the Internet?" and off the top of his head he referred me to ABC's website. I want to say "well done" to ABC for their episodes available online. They have done a decent job and it seems like Flash player is all that is required, making it a trivial matter to watch their content in Firefox on Linux. This takes care of my need for LOST, and opens up the possibility to some new programs which at the moment don't look very compelling for me, but may be possibilities. The first episode I watched had the same commercial played in every single break. I wonder about the effectiveness of this, and would have rather seen three separate commercials, even if they were all for the same product, simply because watching the same thing three times within an hour is not more productive than seeing it once.

Next I turned to Fox. Obviously I'm a fan of Fox, I have my eye on three of their shows, and I have purchased more Fox DVD box sets than any other show. In fact, it may well be true that every DVD box set I own is produced by Fox. The programs they put out seem to be of a high quality. I looked up the web site for Fox, clicked on Prison Break, and saw an inviting link that I may be able to watch full episodes for free. It directed me to a MySpace page belonging to Fox, which was a bit surprising: I thought MySpace was home to teenagers, not corporations. I observed right away that the site wasn't as clean and streamlined as ABC's, but then I got confronted with a major problem:

We're sorry, but only the following operating systems are supported at this time:

Microsoft Windows 2000/XP (not Vista)
(Intel) Apple Macintosh OS X or later
(PPC) Apple Macintosh OS X or later

Please check back soon for support for other operating systems.

It seems that Fox also has its programs available on iTunes, but I hear they will only play on the newer iPod devices, and the free iTunes software is only supported on Windows and Macintosh. I read an interesting article about "DVD-Jon" and some of his buddies writing some software that would let you purchase songs at the iTunes store using Linux, but I think it only worked for music, and furthermore, it was abandoned and does not work with the current iTunes Store. I would pay Fox directly to watch these shows if I could. So it sounds like I may have to wait until the current seasons of 24, The O.C., and Prison Break are released on DVD (if they are.)

NBC doesn't offer any apparent way to play full episodes from their websites, and if they are offering them for viewing in other venues they are not making this obvious by advertising them in any way. They do seem to offer short video clips and deleted scenes, etc., but I'm not one for spoilers. I will investigate at a later time to see if they offer them through another outlet, but except for my academic curiosity to investigate this at a later time, they would have lost my viewership.

The CW has a full episode feature, which links to this page, but the center area of the page just remains a white box and selecting a series from the right-hand list doesn't seem to do a thing. Right clicking on the white reveals that it is a Flash player. I don't know why it is broken, but I will presume they aren't supporting the Linux version of Flash (I don't know how that is possible), but I will give it another shot later in case they are just temporarily down.

So it looks like ABC wins. Stay tuned for an upcoming article on why even their solution isn't the best choice and why Television producers need to make some decisions in the upcoming years.

Monday, February 5, 2007

Why do we sudo in Ubuntu, and who is Charlie Root?

It is a tradition in Linux and other Unix-like operating systems to have a superuser account named root. root is named such because "he" has full access to everything in the filesystem from the root directory "/" on down the line (whereas most ordinary users only have limited access with full access in their home directory alone.) root has a home directory too, named after himself, found at /root, not to be confused with "/" which is also called the root directory. :-) You'll notice that on most systems his home directory is not in the same place as the other user's home directories (usually /home/yourname).

In some Linux distributions, you log in to the root account frequently to perform administrative tasks on the machine, but in Ubuntu and a few other modern distros, this is not the case. Instead, we use a tool called sudo.

sudo means "switch user and do." There is another command called su (switch user), which can be used to switch to another user and open a shell, which is great if you know their password (which su prompts for), but on Ubuntu you aren't supposed to know the root password. That's right, that is how it is designed, and you should not change it unless you have a really good idea of what you are doing. The reason you don't know the root password on Ubuntu is that you don't need it at all because the admin group to which all "Administrator" users belong is listed in the sudoers list (/etc/sudoers) If I needed to change the owner of a file (hypothetically called myfile), and I need root access in order to do this, I would type:

sudo chown jeffd myfile

I will be prompted for a password. This is not the root password, but your own password. It is requested as a security precaution before proceeding to run the program with root access. After entering the password, the command "chown jeffd myfile" is executed as root, after which control returns to your ordinary user (subsequent commands are not executed as root, although if you sudo again within about five minutes, the Ubuntu defaults are set to not ask you for the password again.)

So what if you need to do several commands as root? In all the other Linuces, you just log into a shell as root. Well, if you really want to do this, you can either type sudo bash (to open the bash shell) or sudo su root (to open root's default shell). Either way, you will enter your own password and then be presented with a root shell prompt. When you are done with the shell prompt, type exit to return to your own shell. Do not forget this step! It is never a good idea to run applications or compile things as root. You should perform these tasks as your own user. In fact, some newer Linux software is designed so that it will actually refuse to compile or run as root, giving you a gentle slap on the wrist.

Again, please do not change the root password. Doing this will only weaken the security of your system, and make it easier for you to get into bad administrative habits.

If you are using gnome and would like to launch a graphical application as root you use gksudo instead of sudo, but it works the same way aside from its name and the way that the password prompt appears in a window.

Oh, one last thing: I mentioned Charlie Root in the subject. This is the full name sometimes given to the root user, when a first and last name are required (for example, in the address headers of administrative emails automatically generated by the system and sent to the root account.) I think this name is only used on BSD Unix systems, (FreeBSD, OpenBSD, and NetBSD), so you may not ever see it in Ubuntu, but because I was familiar with administering FreeBSD servers, it is still the name I affectionately call the root user by. Here's some more Charlie Root trivia.

Friday, January 26, 2007

Using Standard Dates

Working on a project recently, I received feedback from another team member that "all date formats, whether for entry or exhibit, should be in a readable & usable format, i.e. mm/dd/yyyy." Although offense was probably not intended, I was extremely disappointed and even a bit insulted by this request, as I had taken care to represent date formats in a consistent, readable, and usable way throughout the project, or at least to choose tools that did so by default.

While not strictly an "Open Software" issue, I think usability issues and internationalization issues are often in the interests of Open Software advocates, so I thought I'd take a bit of time to discuss the ISO Date Format, why it is important, why and where you should consider using it and teaching other people about it.

For those who have not been officially introduced to it yet, the international standard (iso) date format is:

YYYY-MM-DD

YYYY is the year on the Gregorian calendar, MM is the month from 01 to 12, and DD is the day of the month from 01 to 31.

For example, 2007-01-26 represents the 26th day of January, 2007.

Prior to standardization, common ways to represent this date are numerous, including:

1/26/07
26/1/07
07/1/26
26.1.2007
26-JAN-2007
26-January-2007

Each of these included variants with either slashes, hyphens, or periods. Sometimes the usage of a particular one of these formats focused around a geographic location, such as the M/D/YY which was somewhat prevalent in the United States, and other times they would focus around groups with special interest or profession.

The trouble with nonstandard dates is that they cannot be interpreted in a consistent manner. It becomes especially problematic when the day of the month is less than 12, and the month is rendered numerically, making the month and day field entirely indistinguishable.

[Tip: In addition to being more legible, the ISO format also has the benefit of being able to be sorted chronologically in its raw format by any alphanumeric sorter, such as in a spreadsheet.]

With the advent of the Internet era, we now have instant communication around the globe. The ISO date format is an absolute must in order to achieve effective and expected communication in regards to dates.

There are still some places where it would be appropriate to use other date formats, for example:

In a localized or personalized formal invitation such as a wedding or graduation announcement where a verbose format such as "Tuesday, the first of January, two thousand eight" would be appropriate.
In any educational material instructed someone how to interpret legacy date formats, or for use in examples to teach someone how to read ISO date format compared to their historic format.
In a dynamic environment (such as a preferences page) where the reader (not the publisher) has specifically requested their date to be presented in a non-standard format.
Interoperation with legacy applications or that are unable to be upgraded at the time. Even in this case, ISO should also be supported so that you do not become the reason the other application is unable to be upgraded, and visible representations outside of these legacy communications should be translated back into the standard format.

Anything other than these examples only serves to create confusion as the ISO date format comes into greater use.

When the opportunity presents itself you should not be shy, but be prepared to teach others about using the ISO date format, especially if you or they are in an industry where they transact business or correspondence on the Internet. Only a few moments of explanation can save them much confusion or embarrassment that could result from misinterpreted dates or missed appointments.

For more information on ISO 8601 date format see: