Word Count Blog

August 3, 2009

Word Count and Frequency Count Are Not the Same

Filed under: tips and tricks — Tags: , , — Thomas Vysokos @ 12:07 pm

With the winning march of Google as a search engine over the planet search engine optimization became a milestone activity for many of the corporate webmasters. Lots of companies helping businesses to climb on top of the search emerged in last 10 years.

But with the development of SEO people started to mix 2 essential content parameters — word count and frequency count. It’s pretty strange, since mixing them is like mixing time and distance in the speed formula.

Word count – is the total quantity of meaningful words (excluding tags) in the piece of text.

Frequency count – is the index of how many times a word or a phrase appears in the in the piece of text.

You can find an example of a classic frequency counter here and a word count tool here.

From the first look you may think that frequency counter outbeats word count software in functionality, because it counts both words and statistics. But if you put a tagged text into a frequency count tool you will disappointed to find that all the tags were also included into the word and frequency count.

So if you need to count the quantity of the meaningful words (excluding all the tags) to know the volume of the job done in the majority of the text and even graphics formats, you need a word count tool. However if you are writing a SEO optimized content and want to know, whether you have put enough keywords into it, you’ll need a frequency count web app.

July 28, 2009

Word Count Journal - Writer’s Discipline via Word Count

Filed under: tips and tricks — Tags: , — Thomas Vysokos @ 6:37 am

I do love ski-tech and all the new opportunities driven by progress. Web helps the art and literature evolve. Some 30 years ago every writer was isolated during the creative process. If an author wanted to cooperate with another one, it doesn’t matter how – learn from him or write a book together, they had to gather in one place and work from a single location.

But with interactive blogging the situation changed. Both private and publicly shared blogs stared providing the cooperation ground for writers at all levels. And how is all this related to the word count?

Well, everybody who tried writing something knows that having a general idea is one thing and writing at least an A4 a day is another. All writers, both beginners and masters, need writing discipline. So a group of guys, who “wants to improve their writing and like building web apps” built a perfect discipline training tool called Word Count Journal.

The idea is simple – first day you write 1 word, second day – 2, and at the end of the year you have 66,795 words (in case you write only what you have to and not more). Like guys say: “Little by little, through the power of series, the total of your written words will add up to more words than contained in the average novel.”

I do like the idea. Just think it over once again – word count of your everyday creative output trains you to write a page of unique content per day till the end of the year. By simply following the rule of writing 1 word more every next day you can become a commercial blogger, journalist or a prominent writer in the next 365 days.

I made my start here and today have exceeded the planned word count 21 times. Feel free to join the initiative and leave a link to your own journal in comments to this post, so we can exchange the ideas and become more prominent writers and word counters :-)

July 20, 2009

Word Count in Unix

Filed under: tips and tricks — Tags: , — Thomas Vysokos @ 5:35 pm

All my previous posts were related to the Windows-based word count software, but I thought that it is pretty unfair to forget about millions of UNIX users, so today we have a UNIX word count session.

Of course my UNIX experience is not that huge (it was all about testing a Fedora core under virtual machine), so below you will find mostly the reprints of the word count tips published in other Internet-media.

Word count using the “wc” command (taken from Wiki)

“wc” (short for word count) is a command in Unix-like operating systems. The program reads either standard input or a list of files and generates one or more of the following statistics: number of bytes, number of words, and number of lines (specifically, the number of newline characters). If a list of files is provided, both individual file and total statistics follow.

This is how to use the “wc” command:
• wc -l print the line count
• wc -c print the byte count
• wc -m print the character count
• wc -L print the length of longest line
• wc -w print the word count

Well, taking into consideration the languages that don’t use space mark, byte count will be very much appreciated by users from China, Japan and Thailand.

Word count without using the “wc” command (taken from computing.net)

After user Gburg inquired how to execute a loop to count each word in a file individually, without using the wc command, the following reply followed:

You can use the script below if your words are space seperated.

#!/bin/ksh
typeset -i I=0
{ while read line
do
for wort in `echo $line`
do
I=$I+1
done
done } < $1
echo $I

Many thanks to user Frank from everybody, except users from China, Japan and Thailand, for whom the script there should be another script :-)

That's basically all for UNIX word count today.

P.S. Fans of the open source platforms will definitely like this video. Even such a Windows guy as me liked the trick :-)

July 14, 2009

Interview With A Word Count Software Creator

Filed under: industry news — Tags: , , — Thomas Vysokos @ 1:03 pm

I work in the same company with Dmitry Chaplay, chief developer of a word count software called AnyCount. Recently Dmitry and the R&D team released AnyCount 7.0 and made a kind of breakthrough in the word count experience by introducing the word count for image files (BMP, GIF, JPG, PNG).

This is a piece of remarkable industry news, since guys are the first, who put OCR feature into a word count tool. So I took my favorite mug and went to Starport (so we call AIT’s R&D center) for a coffee with Dmitry.

T: Dima, what’s the idea lies behind the AnyCount?

D: It’s pretty simple and obvious – just customizable word, character or line count.

T: And nothing more?

D: And did you expect a text editor in the word count software? Multipurpose tools show average performance in a big variety of fields, but provide extensive functionality only for 1 or 2 tasks. Look at the MS Word – it is a cool text editor, but when it comes to the word count statistics AnyCount provides a more accurate one.

T: AIT developers were the first ones, who introduced word count in the image files (BMP, GIF, JPG, PNG). Was that a kind of strategic vision from the marketing point of view?

D: No, actually it is already a need. Our support department gets a pile of requests from translation agencies like this: “We’ve got a scanned contract, how can we count the word statistics to quote the client?” So we acquired an OCR engine, optimized it and incorporated into AnyCount.

T: Must be easy only to say, but in fact there is a big deal of real work behind it :-) Some days ago I tried out a free word count in graphic files and had my result for free without paying a penny. So what do people pay for with OCR in AnyCount?

D: Hm, if I tell you that they pay for the comfort you won’t believe, huh?

T: Me? Definitely not :-)

D: Actually our OCR solution support 20+ languages and the most of free tools go only with 7 of them (like English, Spanish, Portuguese, where the rest 4 vary from tool to tool). So your method of free OCR won’t work in the Cyrillic languages.

T: You almost made me believe that comfort matters :-) But what about the future? You know, a successful product always needs to be two steps ahead of user expectations. Do you already have any ideas of improving AnyCount?

D: We have just released a major update :-) It’s not that just “bang! Idea now – a new feature tomorrow”. We need to analyze the user’s requests, target the new test formats. But I promise that as soon as we find something worth improvement, we’ll bring it to the life.

The dinner break was almost over so I had to run back to Babylon (AIT translation division office) and go back to my PM’s duties. Stay in touch - more word count tips and news are on their way.

July 10, 2009

How to Count Word Statistic In An Image File For Free

Filed under: tips and tricks — Tags: , , — Thomas Vysokos @ 12:35 pm

Let’s imagine that you are a freelance translator and your customer asked you to translate a contract. You eagerly agree and get…a scanned copy of the document. That’s cool if you have previously agreed that for scan jobs you are paid on a per hour basis. But what if not? What if your customer demands job to be done on a per word basis? And even worse…requests you to send a quote immediately?

Well, if there is a wish, there is a will. Let’s get a free OCR tool and fight the problem.

1. After googling for free OCR tools I chose a SimpleOCR. It is absolutely free for typed text and can be downloaded here (straight link to EXE file).

2. Double click the file and proceed with the installation, until you see the this.

choosing simple ocr free mode

choosing simple orc free mode

3. Click “Machine print” to access the free feature (see screenshot above).

4. Click “Select” to proceed to ther OCR features.

how to proceed for OCR

how to proceed for OCR

5. Click Process button to load the image.

how to load the files into SimpleOCR

how to load the files into SimpleOCR

Note: this is a sample screenshot made from a scan.

test english screenshot for OCR

test english screenshot for OCR

.

6. Click “Convert to the text” button to start the OCR.

coverting image to the text

coverting image to the text

7. Edit the garbled and unrecognized words, to get a more accurate word count (the more spaces you have, the more “words” you are likely to get in the statistics later).

using suggestion tool to fix the document

using suggestion tool to fix the document

8. Export the result into a DOC file.

saving the ocred text as a doc

saving the ocred text as a doc

9. After you open the saved DOC you will see a surprise… There is an image file in the doc and the text is duplicated (i.e. originally OCRed and edited one). Delete the duplicate text and the picture.

deleting unnecessary data to get the correctstatistics

deleting unnecessary data to get the correctstatistics

10. Get some statistics using the MS Word built-in tool.

ms word stats after the ocr

ms word stats after the ocr

If it seems a bit complicated or time-consuming process to you, you can submit your file to a free online OCR at http://www.free-ocr.com/ (OCR available only for English, German, French, Italian, Dutch or Spanish). Again, before using anything free and web-based think twice of the privacy.

Of course this just a temporary and quick one-time solution. If you need a quick and extensive word count (or any other statistics, like character and line count), it is better to use a professional word count software (accuracy means budget here). Moreover the commercial word count tool will provide you with accurate word count statistics even for Cyrillic and Scandinavian languages, which is far more than 6 or 7 offered by free OCR tools.

July 8, 2009

Top 5 Professions That Need Word Count

Filed under: more than just history — Tags: , , — Thomas Vysokos @ 3:55 pm

There is a number of people in the world who are paid basing on the how many words of content they produce per day. Let’s have a look at the list of professions where people would typically need a word count software to get their wages accurately.

1. Translators. All folks that are related to the translation process are word count gurus — trimming actual word count means either saving some budget or squeezing more profit. However localizers use not the pure word count, but “weighted word count” that is closely integrated with the translation memory tools.

2. Medical transcriptors. MTs have to digitize manually written or tape-written medical data. The main specific of this profession is that one has to listen and type at the same time. But a skilled medical transcriptionist is a valued worker who usually get their salary basing on the word count.

3. Commercial bloggers. These guys “blog for fun and profit”. Their task is keeping a corporate, news or whatsoever blog popular (i.e. filled with interesting content). They are paid per word of the generated content — pure word counters.

4. Freelance journalists. They are very much like commercial bloggers, but they sell their content to the “real media” (unlike of commercial bloggers they know nothing of SEO). Sometimes they are paid basing on the actual word count, but is most cases their wages are based on estimates (more details here).

5. Writers. Yes, the folks who write big books that are printed on white paper, which you can buy at Barnes & Noble. Of course there are less successful guys whose books you are hardly to find even digging all day long at Amazon. But both successful and not very successful ones are paid on the word count basis (more info on the topic you can find in the History of Word Count Metrics).

If you know any other profession, where people are paid on a per word, per character or per line basis, leave you comment and it will be included into the update to this article :-)

June 28, 2009

Word Count in Oriental Languages

Today you’ll learn about the standards and peculiarities of the word count in oriental languages. I made my mind to write about them separately, since they differ from others greatly.

Chinese. Writing unit in Chinese is hieroglyph. The main difficulty for word count is that hieroglyphs are not separated with spaces. This means that Chinese sentence «这是鸟» (This is a bird – 3 words) is counted like a single word, in case the word count tool counts words basing on the spaces between words (there was even a related query on the WordPress support page).

But if you think that these 3 hieroglyphs «工业化» are also a separate sentence, then you are wrong, since this is just an “industrialization”. So the most logical method of text volume evaluation in Chinese is character count. E.g. a 1000 word English text translated into Chinese will be 1300-1800 characters long. You may read more about the English->Chinese word count ratio here.

Japanese. Japanese is written in a mixture of three main systems — hieroglyphs and two syllabaries: hiragana and katakana. This makes word count even more complicated than in Chinese. So a usual word count scheme in Japanese is based on characters without spaces, which seems quite logical.

Korean. Modern Korean is written with spaces between words (unlike of Chinese or Japanese). Traditionally, Korean was written in columns from top to bottom, right to left, but is now usually written in rows from left to right, top to bottom. This means that the traditional word count scheme, when a word is counted on a spacing basis can be applied.

Other. The only East Asian language except mentioned above that has no spaces is Thai, so the job estimate is done basing on the character count. The rest languages, including all the Indian languages (Bengali, Gujarati, Marathi, Urdu, Orya, Tamil etc), Indonesian, Farsi, Arabic, Turkish and Hebrew utilize spacing, which means that words can be easily counted with a word count tool.

To sum up. Languages that don’t have spacing and require character count include: Chinese, Japanese and Thai. The rest oriental language utilize spacing and enjoy word count instead of character count.

June 23, 2009

A Free Browser Word Count Add-in for Firefox

Filed under: tips and tricks — Tags: , — Thomas Vysokos @ 11:01 am

Have you ever needed to count quantity of the words on a web-page? Have you ever solved this task by copy/pasting the content into word processor and running statistic tool from there? And what if there is a free browser add-in capable of providing the statistics in the browser window?

Firefox boasts to be one of the most extensible browsers and even web humor proofs this. Today I’m reviewing a free word count Firefox add-in called Word Count Plus. It may be of a great benefit to you, so let’s get started.

Step 1. Install a Firefox browser.

For those who don’t have Firefox installed just download it here, and run the installation using default options (not a single problem even on Vista).

Step 2. Install Word Count Plus add-in.

Visit Word Count Plus webpage, then click Install version 1.3.0 button (the version may actually differ).

download word count plus

download word count plus

Firefox will prompt you to allow the add-in installation. Do so.

allow mozilla to install word count plus

allow mozilla to install word count plus

Click “Install now” to install the add-in.

start word count plus instalaltion

start word count plus instalaltion

Restart Firefox.

restart firefox after word count plus installation

restart firefox after word count plus installation

Step 3. Start counting.

You can either press a word count button

getting word count statistics in browser by pressing a button

getting word count statistics in browser by pressing a button

or right click it and get some shortcuts that make the word count much easier

word count plus shortcuts for faster work

word count plus shortcuts for faster work

Summary

Pros: 1) free; 2) flexible word count (you can count words in a first and the last paragraphs of the page with no copy/pasting); 3) supports addition and undoing the last action.

Contras: 1) a browser add-in (to count the words, you need to open a browser); 2) no bulk file processing (counting statistics in 10 files becomes a time-consuming task); 3) not full statistics (no count of alt tags, page title and keywords, as they are coded in fact).

A good tool for ad hoc use when you need to count quantity of the words or characters on the web page. Occasional word counters should thank Sam Waters, who built this fine app.

But professionals who need an accurate and full word count statistics in the html files, including the page title and alt tags text, should pay their attention to a professional word count software.

June 22, 2009

The History of Word Count Metrics

Filed under: more than just history — Tags: , — Thomas Vysokos @ 7:48 am

There is a number of jobs, where people are paid basing on how many text content do they produce, proofread, type or process in any other way. And there is a number of standards, basing on which people are paid. Anyone who had a need in word count came across several of them: 250 words, 300 words, 1800 signs or even 3500 signs. But why just not to pay on a per word basis?

Paying on a per word basis looks much simpler only from the first point of view. But every group of language is special and has its own word count traditions. Still size matters – some words are long, some words are short. So, years ago, two standard methods were developed to count words in a text. I call them Western and Soviet ones.

In Western method one word consists of six characters including spaces (average English word is 5.1 characters long). “Antiautomorphism” is 2 and 2/3 words long, which in fact equals a phrase “during the dinner”. This model is true, because it’s a bit unfair to count articles as separate meaningful words, which are usually twice as long as articles are.

Again Western word count method has 2 industry standards. In earlier times, when most manuscripts were prepared on typewriters with fixed pitch (monospace) fonts 250 words per page was generally considered to be standard, and many editors still use it. But in PC era an average manuscript page in 12 point Times Roman will contain about 23 lines of type per page and about 13 words per line, or 300 words per manuscript page.

In Soviet Union the main and dominant language was Russian. As you may know Russian has no articles, while an average Russian word is 6.36 characters long.

In the early 1920’s industry a new industry standard called “author’s list” was created. It consisted of unbelievable 40 000 signs (including spaces, number and all the punctuation). Unlike of Western standards in Soviet Union manuscripts were submitted with dual spacing, so an average typewritten page was 1800 characters long (paradox but that is 300 words in Western printing standard although average Russian word equals 1.24 English words). And if printed on a PC using 12 point Times Roman with single spacing an average page in Russian is 3500 signs big (584 Western words).

After the Soviet Union collapsed word count standards as well as a great deal of other standards were still widely used in the former republics. So if you are paid in units of 250 or 300 words, your client is most probably in Western Europe or America. But if your work is measured in 1800 or 3500 signs I bet that you got an order somewhere from Community of Independent States.

Still I have to explore the word count specifics in oriental languages. Soon an article on this topic will follow.

P.S. You can easily count word statistics almost in any document format using a word count software.

Powered by WordPress